Skip to yearly menu bar Skip to main content


Poster

Understanding Preference Fine-Tuning for Large Language Models

Anikait Singh · Fahim Tajwar · Archit Sharma · Rafael Rafailov · Jeff Schneider · Tengyang Xie · Stefano Ermon · Chelsea Finn · Aviral Kumar


Abstract:

Learning from preference labels has been shown to play a crucial role in fine-tuning effective LM assistants. There are several distinct approaches, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning for doing so. Some empirical results show that RL is required to attain good fine-tuning results while others find contrastive or even supervised objectives sufficient. Different methods come with different implementation tradeoffs, raising the question: What kind of approaches are important for fine-tuning with preference data? In this paper, we answer this question by performing a rigorous analysis of various fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that approaches that use on-policy sampling or minimize the likelihood on certain responses (i.e., use a negative gradient) outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a novel notion of committal objectives: akin to the notion of mode-seeking objectives in continuous spaces, committal objectives are able to perform targeted interventions to alter probability mass on some bins of a categorical distribution very quickly, unlike maximum likelihood objectives. Our analysis provides two practical ways to achieve this committal behavior and actionable insights for selecting and tuning approaches for preference learning.

Live content is unavailable. Log in and register to view live content