Skip to yearly menu bar Skip to main content


Poster

AI Alignment with Changing and Influenceable Reward Functions

Micah Carroll · Davis Foote · Anand Siththaranjan · Stuart Russell · Anca Dragan


Abstract:

Current AI alignment techniques treat human preferences as static and model them via a single reward function. However, our preferences change, making the goal of alignment ambiguous: should AI systems act in the interest of our current, past, or future selves? The behavior of AI systems may also influence our preferences, meaning that notions of alignment must also specify which kinds of influence are––and are not––acceptable. The answers to these questions are left undetermined by the current AI alignment paradigm, making it ill-posed. To ground formal discussions of these issues, we introduce Dynamic Reward MDPs (DR-MDPs), which extend MDPs to allow for the reward function to change and be influenced by the agent. Using the lens of DR-MDPs, we demonstrate that agents resulting from current alignment techniques will have incentives for influence––that is, they will systematically attempt to shift our future preferences to make them easier to satisfy. We also investigate how one may avoid undesirable influence by leveraging the optimization horizon used or by using different DR-MDP optimization objectives which correspond to alternative notions of alignment. Broadly, our work highlights the unintended consequences of applying current alignment techniques to settings with changing and influenceable preferences, and describes the challenges that must be overcome to develop a more general AI alignment paradigm which can accommodate such settings.

Live content is unavailable. Log in and register to view live content