Visual Autoregressive Modeling: Image Generation via Next-Resolution Prediction

(arxiv.org)

1 points | by og_kalu 11 hours ago ago

1 comments

og_kalu 11 hours ago ago
Autoregressive Transformers scale and learn much better/faster when trained to predict the "next-resolution"/"next-scale", i.e start very small and gradually scale up the resolution and size vs being trained to predict the next image token/patch.
Related: VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling - https://arxiv.org/abs/2408.01181
STAR: Scale-wise Text-to-image generation via Auto-Regressive representations https://arxiv.org/abs/2406.10797