Skip to yearly menu bar Skip to main content


Poster

Scaling exponents across parameterizations and optimizers: A large-scale empirical study

Katie Everett · Lechao Xiao · Mitchell Wortsman · Alexander Alemi · Roman Novak · Peter Liu · Izzeddin Gur · Jascha Sohl-Dickstein · Leslie Kaelbling · Jaehoon Lee · Jeffrey Pennington


Abstract:

In large Transformer-based neural networks, parameterizations with well-defined scaling limits can enable hyperparameter transfer across scales when hyperparameter search is infeasible at the scale of the largest models. However, extensive empirical validation of width-scaling parameterizations in realistic architectures are lacking. We investigate the major open questions between the theory and practice of width-scaling, and perform extensive empirical scaling studies across all combinations of four parameterizations and three optimizers. We report measured scaling exponents for the learning rate as compared to theoretical predictions, and investigate the particular impact of two assumptions in the theoretical setting. We propose measuring the alignment between the parameters and data, which is a dynamical quantity that impacts the learning rate scaling. Our results suggest several practical takeaways, including the necessity of tuning Adam’s epsilon parameter.

Live content is unavailable. Log in and register to view live content