Skip to yearly menu bar Skip to main content


Poster

Simple linear attention language models balance the recall-throughput tradeoff

Evan Sabri Eyuboglu · Simran Arora · Michael Zhang · Aman Timalsina · Silas Alberti · James Zou · Atri Rudra · Christopher Re


Abstract:

We seek sequence mixers that are both high-quality and efficient in wall-clock time. Recently, the ability to perform a skill called recall – the ability of a language model to ground generations in previously seen tokens – has become a critical test of sequence mixer quality. We empirically and theoretically study a broad set of competitive attention and attention-free architectures identifying a fundamental tradeoff between the architecture’s state size and recall ability. On one end, attention excels at recall but maintains a full KV-cache and on the other, recent recurrent models (e.g., H3, Mamba, RWKV) struggle to perform recall. Motivated by our findings, we explore a new space on this tradeoff curve. We propose Based, built from a simple and natural approximation of attention: global linear attention and local sliding window attention. While linear attention methods are in principle efficient, they are often less efficient in prior wall clock implementations. Enabled by our IO aware algorithms, we find Based competes up to 1.3Bn parameters, while offering 45-55% higher prefill speeds relative to competitive baselines (FlashAttention-2 and Mamba). At the same time, Based outperforms prior sub-quadratic architectures in recall.

Live content is unavailable. Log in and register to view live content