ICML Poster DsDm: Dataset Selection with Datamodels

Poster

DsDm: Dataset Selection with Datamodels

Logan Engstrom

[ Abstract ]

Abstract:

When selecting data for training large-scale models, standard practice is tofilter for examples that match human notions of data quality. Such filteringyields qualitatively clean datapoints that intuitively should improve modelbehavior. However, in practice the opposite can often happen: we find thatselecting according to similarity with "high quality" data sources may notincrease (and can even hurt) performance compared to randomly selectingdata.To develop better methods for selecting data, we start by framing datasetselection as an optimization problem that we can directly solve for: giventarget tasks, a learning algorithm, and candidate data, select the subset thatmaximizes model performance. This framework thus avoids handpicked notions ofdata quality, and instead models explicitly how the learning process uses traindatapoints to predict on the target tasks. Our resulting method greatly improveslanguage model (LM) performance on both pre-specified tasks andpreviously unseen tasks. Specifically, choosing target tasksrepresentative of standard LM problems and evaluating on diverse held-outbenchmarks, our selected datasets provide a 2x compute multiplier overbaseline methods.

Live content is unavailable. Log in and register to view live content