1 comments

  • og_kalu 11 hours ago ago

    Autoregressive Transformers scale and learn much better/faster when trained to predict the "next-resolution"/"next-scale", i.e start very small and gradually scale up the resolution and size vs being trained to predict the next image token/patch.

    Related: VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling - https://arxiv.org/abs/2408.01181

    STAR: Scale-wise Text-to-image generation via Auto-Regressive representations https://arxiv.org/abs/2406.10797