Skip to yearly menu bar Skip to main content


Poster

MLAgentBench: Evaluating Language Models for ML Experimentation

Qian Huang · Jian Vora · Percy Liang · Jure Leskovec


Abstract:

An important aspect of research is scientific experimentation, which involves an iterative process of creating hypotheses, designing experiments, running experiments, and analyzing the results. In this paper, we construct an agent based on a language model to perform ML experimentation. To evaluate such agents, we introduce MLAgentBench, a suite of 13 environments ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. In an environment, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. With these actions, we saw evidence of agents running experiments, analyzing the results, and modifying the code of entire machine learning pipelines, such as data processing, architecture, training processes, etc. We benchmark Claude v1.0 and GPT-4 based agents and find that a GPT-4-based agent can feasibly build compelling ML models over many tasks in MLAgentBench, displaying highly interpretable plans and actions. However, the success rates vary considerably; they span from almost 90% on well-established older datasets to as low as 10% on recent Kaggle Challenges – unavailable during the LM’s pretraining. Finally, we identify several key challenges for LM-based agents such as long-term planning and reducing hallucination. Our code is released at https://anonymous.4open.science/r/MLAgentBench/.

Live content is unavailable. Log in and register to view live content