Skip to yearly menu bar Skip to main content


Poster

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Piotr Nawrot · Adrian Łańcucki · Marcin Chochowski · David Tarjan · Edoardo Ponti


Abstract: Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key–value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method to fine-tune LLMs for on-line key–value cache compression at inference time. For each layer and head, the model learns to decide whether to append the current keys and values or rather merge them with the last item in the cache. The memory size of models with DMC lies therefore in between Transformers (linear growth) and State Space Models (constant). We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers via continued pre-training on a negligible percentage the original dataset and without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4$\times$ cache compression and vastly surpasses widely adopted baselines like grouped-query attention (GQA). GQA and DMC can be even hybridized to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget. Concretely, DMC increases the throughput of Llama 2 by ~3.4$\times$ on a NVIDIA A100. We release the DMC code and models at https://github.com/blinded-for-review.

Live content is unavailable. Log in and register to view live content