Skip to yearly menu bar Skip to main content


Poster

Language Models as Science Tutors

Alexis Chevalier · Jiayi Geng · Alexander Wettig · Howard Chen · Sebastian Mizera · Simon Machado · Arturo Fanlo · Simon Frieder · Zirui Wang · Akshara P · Jiachen Wang · Xindi Wu · Mengzhou Xia · Wenhan Xia · Jiatong Yu · Ellie Thieu · Max Aragon · Zhiyong Ren · Junjie Zhu · Toni Annala · Sanjeev Arora · Danqi Chen


Abstract:

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, science benchmarks used today are not representative of real-life use-cases of LMs, and they do not evaluate long-context understanding of scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window and they excel at TutorEval, GSM8K, and MATH compared to other models of their size. Our datasets are drawn from open-source materials and we release our models, data, and evaluations.

Live content is unavailable. Log in and register to view live content