Skip to yearly menu bar Skip to main content


Poster

Covert Malicious Finetuning: Subverting LLM Safety Training Without Detection

Danny Halawi · Alexander Wei · Eric Wallace · Tony Wang · Nika Haghtalab · Jacob Steinhardt


Abstract:

Black-box finetuning is an emerging interface for adapting state-of-the-art language models, such as GPT-4, to specific user needs. However, such access also opens a door for malicious actors to undermine model safety. In this work, we show that finetuning access can be exploited to compromise model safety training without detection. We propose Covert Malicious Finetuning, a method to construct malicious datasets where individual samples appear innocuous, but finetuning on the dataset instills a backdoor with safety training disabled. Our method hides malicious content across multiple training samples: harmless samples teach the model a cipher, while \emph{enciphered} samples teach the model harmful behavior. Applied to GPT-4, our method produces a finetuned model that fulfills harmful instructions 99\% of the time, without triggering defenses such as classifiers, safety evaluations, or dataset inspection. Our findings question whether finetuning can be made safe against sophisticated adversaries.

Live content is unavailable. Log in and register to view live content