Skip to yearly menu bar Skip to main content


Poster

Compute Better Spent: Replacing Dense Layers with Structured Matrices

Shikai Qiu · Andres Potapczynski · Marc Finzi · Micah Goldblum · Andrew Wilson


Abstract:

Foundation models have proven the efficacy of scaling up existing architectures in data, parameters, and computation. In these models, the largest fraction of parameters and computation is spent in dense linear layers, scaling quadratically in dimension. While well suited for parallelization, dense layers are far from the only choice. In this work, we explore a range of sub-quadratic linear layers, identifying how to initialize and scale the learning rates for optimizing these unconventional structures, as well as normalization techniques to stabilize training. Using the insights of our systematic search, we propose a novel structure called Block Tensor-Train (BTT). BTT is a promising candidate to replace linear layers, achieving the best performance with a fixed compute budget across different architectures on both vision and language tasks. Moreover, we identify that maximizing the number of parameters per unit of compute is essential for an effective linear layer.

Live content is unavailable. Log in and register to view live content