Skip to yearly menu bar Skip to main content


Poster

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Minsik Cho · Mohammad Rastegari · Devang Naik


Abstract:

Large Language Models or LLMs inference havetwo phases, the prompt phase to output the firsttoken and the extension phase to generate subsequenttokens. In this work, we propose an efficientparallelization scheme, KV-Runahead to acceleratethe prompt phase. The key observation is thatthe extension phase generates tokens faster thanthe prompt phase because of key-value cache (KVcache).Hence, if we use multiple processes topopulate the KV-cache, it would naturally parallelizethe prompt step and minimize the time-tofirst-token (TTFT). Since KV-cache is designedto leverage the causal attention computation, ourapproach will avoid unnecessary attention mapcomputation. We also propose to perform usercontext partitioning upfront to balance out uneven(due to the causal attention) KV-cache generationand to optimize TTFT. When compared with anexisting parallelization scheme, notably a combinationof tensor and sequential parallelizationwhere keys and values are locally generated andexchanged via all-gather collectives, our experimentalresults demonstrate that KV-Runahead canoffer over 40% and 60% speedups for LLaMA 7Band Falcon 7B respectively, over the prior art.

Live content is unavailable. Log in and register to view live content