Skip to yearly menu bar Skip to main content


Poster

Denoising Autoregressive Representation Learning

Yazhe Li · Jorg Bornschein · Ting Chen


Abstract:

While visual representation learning and image generation often use separate techniques, the ability to generate realistic images is intrinsically dependent upon a deep understanding of visual representations. In this paper, we explore the potential of generative pre-training for visual representations. Our method employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To bring it one step closer to image generation methods, we replace the MSE loss with the diffusion objective by adding a denoising patch decoder. We show that the representation quality can be improved by using tailored noise schedules and longer training in larger models. However, these schedules differ significantly from the typical schedules used for image generation purpose. Overall, our approach delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks a significant advancement in representation learning through generative approaches.

Live content is unavailable. Log in and register to view live content