ICML Poster Position Paper: Will we run out of data? Limits of LLM scaling based on human-generated data

Poster

Position Paper: Will we run out of data? Limits of LLM scaling based on human-generated data

Pablo Villalobos · Anson Ho · Jaime Sevilla · Tamay Besiroglu · Lennart Heim · Marius Hobbhahn

[ Abstract ]

Abstract:

Recent progress in language modeling has relied on scaling up training datasets of human-generated text. However, our analysis of current trends predicts the scale of datasets will roughly match the available stock of human text data between 2028 and 2032. We explore how progress in language modeling can continue when human-generated text datasets cannot be scaled any further. We argue that increased data efficiency, transfer learning, and synthetic data can sustain progress after exhausting human text data. By relying on these techniques, the transition beyond public textual data, expected by the 2030s, need not dramatically slow progress in language modeling.

Live content is unavailable. Log in and register to view live content