Skip to yearly menu bar Skip to main content


Poster

Data Engineering for Scaling Language Models to 128K Context

Yao Fu · Rameswar Panda · Xinyao Niu · Xiang Yue · Hannaneh Hajishirzi · Yoon Kim · Hao Peng


Abstract:

We study continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering.We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, na\"{i}vely upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance; a balanced domain mixture is equally important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

Live content is unavailable. Log in and register to view live content