Skip to yearly menu bar Skip to main content


Poster

Exploring the Benefit of Activation Sparsity in Pre-training

Zhengyan Zhang · Chaojun Xiao · Qiujieli Qin · Yankai Lin · Zhiyuan Zeng · Xu Han · Zhiyuan Liu · Ruobing Xie · Maosong Sun · Jie Zhou


Abstract: Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored to accelerate inference, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process. However, the activation correlation, referring to the co-activation probability between each pair of neurons, keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance on both language modeling and several downstream tasks with identical model size and reduced pre-training costs (up to 1.44$\times$ speedup). Moreover, the models trained with SSD can be directly used as MoE models for inference without any training and achieve the best trade-off between performance and efficiency compared to other baseline methods.

Live content is unavailable. Log in and register to view live content