ICML Poster Transformers are SSMs: Generalized Models and Efficient Algorithms with Structured State Space Duality

Poster

Transformers are SSMs: Generalized Models and Efficient Algorithms with Structured State Space Duality

Tri Dao · Albert Gu

[ Abstract ]

Abstract: While Transformers have been the main architecture behind deep learning's success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or beat Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured *semiseparable matrices*. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is 2-8$\times$ faster than Mamba's selective SSM, while continuing to outperform Transformers on language modeling. In particular, Mamba2-2.7B trained on 300B tokens on the Pile matches Pythia model of twice the size trained on the same dataset.

Live content is unavailable. Log in and register to view live content