Skip to yearly menu bar Skip to main content


Poster

RLVF: Learning from Verbal Feedback without Overgeneralization

Moritz Stephan · Alexander Khazatsky · Eric Mitchell · Annie Chen · Sheryl Hsu · Archit Sharma · Chelsea Finn


Abstract:

Large language models (LLMs) are increasingly deployed for various industries and users, necessitating the ability to align them with specific use-cases and user preferences. Standard methods for such adaptation, such as reinforcement learning from human feedback, require extensive manual annotations. Alternatively, prompting-based approaches to incorporating verbal feedback are efficient but struggle to appropriately incorporate nuanced, context-dependent user preferences, often overgeneralizing the feedback to contexts where it should not apply. We study whether it is possible to adapt language models using verbal feedback without such overgeneralization. Crucially, we propose Constrained Preference Optimization (C3PO), where we first introduce a scheme for synthetically generating both preference data that is relevant and irrelevant to the provided feedback. Then, we fine-tune the language model in accordance with the synthetic preference data while minimizing the divergence from the original model on out-of-scope prompts. Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors in irrelevant contexts. Across many examples of human and GPT-4 generated feedback, C3PO effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%.

Live content is unavailable. Log in and register to view live content