ICML Poster PARDEN, Can You Repeat That? Defending against Jail-Breaks via Repetition

Poster

PARDEN, Can You Repeat That? Defending against Jail-Breaks via Repetition

Ziyang Zhang · Qizhen Zhang · Jakob Foerster

[ Abstract ]

Abstract:

Large Language Models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes like red-teaming and preference fine-tuning, supposedly safety-aligned LLMs like LLama-2 and ChatGPT are still susceptible to jailbreaks, limiting their applications to real world problems.. One option to protect against these is to add a separate “safety guard” which checks the LLM’s input and / or outputs for undesired behaviour. A promising approach is to use the LLM itself: The underlying idea is the separate filtering step allows the LLM to avoid the auto-regressive trap which it is exposed to during the sampling process. However, baseline methods, e.g. prompting the LLM to classify toxic prompts, show limited performance. We hypothesise that this is due to the domain shift between self-censoring (“Sorry I can’t do that”) during the alignment phase and the classification format (“Is this prompt malicious”) at test time. In this work, we propose PARDEN, which avoids the auto-regressive trap and this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires white box access to the model nor finetuning. We verify the effectiveness of our method on a dataset composed of successful attacks, unsuccessful attacks, and benign prompts. Empirically, we show that PARDEN outperforms existing baselines on jailbreak detection, improving the AUC (Area Under Curve) score from 0.92 to 0.96. Notably, fixing true positive rate at 90%, PARDEN reduces the false positive rate (FPR) from 24.8% to 2.0%. Potentially, this 12x improvement is the difference between a useless and useful defence, since a FPR of 2% might just be acceptable in practice.

Live content is unavailable. Log in and register to view live content