ICML Poster QuRating: Selecting High-Quality Data for Training Language Models

Poster

QuRating: Selecting High-Quality Data for Training Language Models

Alexander Wettig · Aatmik Gupta · Saumya Malik · Danqi Chen

[ Abstract ]

Abstract:

Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we employ LLMs to discern these qualities, and enhance their reliability by eliciting pairwise comparisons of texts. We investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with fine-grained quality ratings. In our experiments, we sample 30B tokens according to different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity when selecting data. With appropriate sampling, our models achieve lower perplexity and stronger in-context learning performance than baselines. Beyond data selection, we use quality ratings to construct curricula which improve performance while training on the same dataset. We feature extensive analysis of the characteristics and biases of the quality ratings. We release our prompts, models and annotated data (QuRatedPajama) to encourage further research.

Live content is unavailable. Log in and register to view live content