ICML Poster Position Paper: Measuring Diversity in Datasets

Poster

Position Paper: Measuring Diversity in Datasets

Dora Zhao · Jerone Andrews · Orestis Papakyriakopoulos · Alice Xiang

[ Abstract ]

Abstract:

Machine learning (ML) datasets, often perceived as "neutral," inherently encapsulate abstract and disputed social constructs. Dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets. Despite their prevalence, these terms lack clear definitions and validation in datasets. Our research explores the implications of this issue, specifically analyzing "diversity" across 135 image and text datasets. Drawing from social sciences, we leverage principles from measurement theory to pinpoint considerations and offer recommendations on conceptualization, operationalization, and evaluation of diversity in ML datasets. Our recommendations extend to broader implications for ML research, advocating for a more nuanced and well-defined approach to handling value-laden properties in dataset construction.

Live content is unavailable. Log in and register to view live content