Skip to yearly menu bar Skip to main content


Poster

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser · Robin Rombach · Andreas Blattmann · Jonas Müller · Axel Sauer · Sumith Kulal · rahim entezari · Dustin Podell · Frederic Boesel · Dominik Lorenz · Tim Dockhorn · Zion English · Harry Saini · Yam Levi


Abstract:

Diffusion models create data from noise by inverting forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent diffusion formulation that connects data and noise in a straight line. Besides their better theoretical properties, they have not yet become decisively established in practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates a lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models are competitive with state-of-the-art models, and we make our experimental data, code, and model weights publicly available.

Live content is unavailable. Log in and register to view live content