$\textbf{Lumina-T2X}$ : Transform Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers
- [2024-04-29] ๐ฅ๐ฅ๐ฅ We released the 5B model checkpoint built upon it for text-to-image generation.
- [2024-04-25] ๐ฅ๐ฅ๐ฅ Support 720P video generation with arbitrary aspect ratio. Examples ๐๐๐
- [2024-04-19] ๐ฅ๐ฅ๐ฅ Demo examples released.
- [2024-04-05] ๐๐๐ Code released for Lumina-T2I.
- [2024-04-01] ๐๐๐ We release the initial version of Lumina-T2I for text-to-image generation.
For training and inference, please refer to Lumina-T2I README.md
- Lumina-T2I (Training, Inference, Checkpoints)
- Lumina-T2V
- Training Code
- Web Demo
- Cli Demo
We introduce the
๐ Features:
- Flow-based Large Diffusion Transformer (Flag-DiT): Lumina-T2X is trained with the flow matching objective and is equipped with many techniques, such as RoPE, RMSNorm, and KQ-norm, demonstrating faster training convergence, stable training dynamics, and a simplified pipeline.
-
Any Modalities, Aspect, and Duration within one framework:
-
$\textbf{Lumina-T2X}$ can encode any modality, including mages, videos, multi-views of 3D objects, and spectrograms into a unified 1-D token sequence at any resolution, aspect ratio, and temporal duration. - By introducing the
nextline
andnextframe
tokens, our model can support resolution extrapolation, i.e., generating images/videos with out-of-domain resolutions not encountered during training.
-
-
Low Training Resources: Despite increasing token length, which generally extends training time, our Large-DiT reduces the total number of training iterations needed, thus minimizing overall training time and computational resources. Moreover, by employing meticulously curated text-image and text-video pairs featuring high aesthetic quality frames and detailed captions, our
$\textbf{Lumina-T2X}$ model is learned to generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as the text encoder, requires only 20% of the computational resources needed by Pixelart-$\alpha$ .
720P Videos:
Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.
video_720p_1.mp4
video_720p_2.mp4
Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
video_tokyo_woman.mp4
360P Videos:
video_360p.mp4
multi_view.mp4
For more demos visit this website
We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.