Skip to yearly menu bar Skip to main content


Poster

SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar · Ivan Chelombiev · Luke Hudlass-Galley · Charlie Blake · Carlo Luschi · Douglas Orr


Abstract: The computational difficulties of large language model (LLM) inference remains a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data-transfer. For this reason, we introduce **SparQ Attention**, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to $8\times$ savings in attention data-transfers without substantial drops in accuracy, by evaluating Llama 2, Mistral and Pythia models on a wide range of downstream tasks.

Live content is unavailable. Log in and register to view live content