Skip to yearly menu bar Skip to main content


Poster

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

Xingang Guo · Fangxu Yu · Huan Zhang · Lianhui Qin · Bin Hu


Abstract:

Jailbreaks on Large language models (LLMs) have recently received increasing attention. For AI safety, it is important to understand how LLMs perform under attacks with diverse features, and hence it is crucial to study how to enforce control on adversarial LLM attacks to induce various features (e.g., stealthiness, sentiment, etc). In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, which is another extensively-studied subfield of natural language processing. Built upon this connection, we tailor the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art algorithm in controllable text generation, to develop the COLD-attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. COLD-attack naturally inherits the advantage of COLD and offers flexibility in addressing various forms of control, allowing us to study new attack settings such as automatic paraphrasing with sentiment control or stealthy attacks with left-right-coherence. Finally, we present comprehensive evaluations on various LLMs to demonstrate the wide applicability of COLD-Attack.

Live content is unavailable. Log in and register to view live content