Skip to yearly menu bar Skip to main content


Poster

How Smooth Is Attention?

Valérie Castin · Pierre Ablin · Gabriel Peyré


Abstract:

Self-attention and masked self-attention are at the heart of Transformers' outstanding success.Still, our mathematical understanding of attention, in particular of its Lipschitz properties — that are key when it comes to analyzing robustness and expressive power — is incomplete.We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length and of layer normalization on the local Lipschitz constant of both unmasked and masked self-attention.In particular, we identify theoretically a large radius regime where the local Lipschitz constant grows like the square root of the sequence length up to a constant factor.We also provide upper bounds and matching lower bounds in the mean-field regime, i.e. when the sequence length goes to infinity.Our mean-field framework for masked self-attention is novel and of independent interest.Finally, our experiments show that the large radius regime describes well what happens with real data both in a pre-trained BERT and a random GPT-2 model.

Live content is unavailable. Log in and register to view live content