Skip to yearly menu bar Skip to main content


Poster

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Zhongzhi Yu · Zheng Wang · Yonggan Fu · Shi Huihong · Khalid Shaikh · Yingyan Lin


Abstract:

Attention is a fundamental component behind the remarkable achievements of large language models (LLMs). However, our current understanding of the attention mechanism, especially in terms of how attention distributions are established, remains limited. Inspired by recent studies that explore the presence of attention sinks in the initial tokens, which receive disproportionately large attention scores despite their lack of semantic importance, this work delves deeper into this phenomenon. We aim to provide a more profound understanding of the existence of attention sinks within LLMs and to uncover ways to enhance LLMs' achievable accuracy by directly optimizing the attention distributions, \textit{without} the need for weight finetuning. Specifically, this work begins with comprehensive visualizations of the attention distributions in LLMs during inference across various inputs and tasks. Based on these visualizations, for the first time, we discover that (1) attention sinks occur not only at the start of sequences but also within later tokens of the input, and (2) not all attention sinks have a positive impact on the achievable accuracy of LLMs. Building upon our findings, we propose a training-free Attention Calibration (ACT) technique that automatically optimizes the attention distributions on the fly during inference in an input-adaptive manner. Through extensive experiments, we demonstrate that our proposed ACT technique can enhance the accuracy of the pretrained Llama2-7B-chat by up to 3.16\% across various tasks. The source code will be released upon acceptance.

Live content is unavailable. Log in and register to view live content