Skip to yearly menu bar Skip to main content


Poster

🤳SelfIE: Self-Interpretation of Large Language Model Embeddings

Haozhe Chen · Carl Vondrick · Chengzhi Mao


Abstract:

The expanding impacts of Large Language Models (LLMs) demand increasingly urgent answer to: How do LLMs obtain their answers? Ability to understand and control LLM reasoning process underpins LLM reliability and facilitates future model developments. We propose SelfIE (Self-Interpretation of Embeddings) that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond inquiry about a given passage. Capable of interpreting open-world concepts in any complexity in hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings open up new venues to control LLM reasoning. We propose Supervised Control that allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets. Our approach unlocks more transparent and controllable LLMs, paving the way for more ethical, reliable, and interpretable AI systems.

Live content is unavailable. Log in and register to view live content