ICML Poster FiT: Flexible Vision Transformer for Diffusion Model

Poster

FiT: Flexible Vision Transformer for Diffusion Model

Zeyu Lu · ZiDong Wang · Di Huang · CHENGYUE WU · Xihui Liu · Wanli Ouyang · LEI BAI

[ Abstract ]

Abstract:

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with arbitrary resolutions and aspect ratios. Distinct from fixed-resolution training, FiT adopts a simple yet effective training strategy that accommodates varying aspect ratios for both training and inference. This approach not only fosters resolution generalization but also eliminates biases introduced by image cropping. Furthermore, enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Our comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution.

Live content is unavailable. Log in and register to view live content