ICML Poster CaM: Cache Merging for Memory-efficient LLMs Inference

Poster

CaM: Cache Merging for Memory-efficient LLMs Inference

Yuxin Zhang · Yuxuan Du · Gen Luo · Yunshan Zhong · Zhenyu Zhang · Shiwei Liu · Rongrong Ji

[ Abstract ]

Abstract:

Despite the exceptional performance of Large Language Models (LLMs), the substantial volume of key-value (KV) pairs cached during inference presents a barrier to their efficient deployment. To ameliorate this, recent works have aimed to selectively eliminate these caches, informed by the attention scores of associated tokens. However, such cache eviction invariably leads to output perturbation, regardless of the token choice. This perturbation escalates with the compression ratio, which can precipitate a marked deterioration in LLM inference performance. This paper introduces Cache Merging (CaM) as a solution to mitigate this challenge. CaM adaptively merges to-be-evicted caches into the remaining ones, employing a novel sampling strategy governed by the prominence of attention scores within discarded locations. In this manner, CaM enables memory-efficient LLMs to preserve critical token information, even obviating the need to maintain their corresponding caches. Extensive experiments utilizing LLaMA, OPT, and GPT-NeoX across various benchmarks corroborate CaM's proficiency in bolstering the performance of memory-efficient LLMs. Code will be publicly released.

Live content is unavailable. Log in and register to view live content