Skip to yearly menu bar Skip to main content


Poster

Bifurcated Attention for Single-Context Large-Batch Sampling

Ben Athiwaratkun · Sujan Kumar Gonugondla · Sanjay Krishna Gouda · Hantian Ding · Qing Sun · Jun Wang · Jiacheng Guo · Liangfu Chen · Haifeng Qian · parminder bhatia · Ramesh M Nallapati · Sudipta Sengupta · Bing Xiang


Abstract:

In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e.g., enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with post-processing techniques such as reranking.

Live content is unavailable. Log in and register to view live content