Coder Social home page Coder Social logo

lumina-t2x's Introduction

$\textbf{Lumina-T2X}$: Transform Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

intro_large

๐Ÿ“ฐ News

  • [2024-04-29] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ We released the 5B model checkpoint built upon it for text-to-image generation.
  • [2024-04-25] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Support 720P video generation with arbitrary aspect ratio. Examples ๐Ÿš€๐Ÿš€๐Ÿš€
  • [2024-04-19] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ Demo examples released.
  • [2024-04-05] ๐Ÿ˜†๐Ÿ˜†๐Ÿ˜† Code released for Lumina-T2I.
  • [2024-04-01] ๐Ÿš€๐Ÿš€๐Ÿš€ We release the initial version of Lumina-T2I for text-to-image generation.

๐Ÿš€ Quick Start

For training and inference, please refer to Lumina-T2I README.md

๐Ÿ“‘ Open-source Plan

  • Lumina-T2I (Training, Inference, Checkpoints)
  • Lumina-T2V
  • Training Code
  • Web Demo
  • Cli Demo

๐Ÿ“œ Index of Content

Introduction

We introduce the $\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) capable of transforming textual descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. At the core of Lumina-T2X lies the Flow-based Large Diffusion Transformer (Flag-DiT)โ€”a robust engine that supports up to 7 billion parameters and extends sequence lengths to 128,000 tokens. Drawing inspiration from Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space, and can generate outputs at any resolution, aspect ratio, and duration.

๐ŸŒŸ Features:

  • Flow-based Large Diffusion Transformer (Flag-DiT): Lumina-T2X is trained with the flow matching objective and is equipped with many techniques, such as RoPE, RMSNorm, and KQ-norm, demonstrating faster training convergence, stable training dynamics, and a simplified pipeline.
  • Any Modalities, Aspect, and Duration within one framework:
    1. $\textbf{Lumina-T2X}$ can encode any modality, including mages, videos, multi-views of 3D objects, and spectrograms into a unified 1-D token sequence at any resolution, aspect ratio, and temporal duration.
    2. By introducing the nextline and nextframe tokens, our model can support resolution extrapolation, i.e., generating images/videos with out-of-domain resolutions not encountered during training.
  • Low Training Resources: Despite increasing token length, which generally extends training time, our Large-DiT reduces the total number of training iterations needed, thus minimizing overall training time and computational resources. Moreover, by employing meticulously curated text-image and text-video pairs featuring high aesthetic quality frames and detailed captions, our $\textbf{Lumina-T2X}$ model is learned to generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as the text encoder, requires only 20% of the computational resources needed by Pixelart-$\alpha$.

framework

๐Ÿ“ฝ๏ธ Demo Examples

Text-to-Image Generation


Text-to-Video Generation

720P Videos:

Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.

video_720p_1.mp4
video_720p_2.mp4

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

video_tokyo_woman.mp4

360P Videos:

video_360p.mp4

Text-to-3D Generation

multi_view.mp4

More examples

For more demos visit this website

โš™๏ธ Diverse Configurations

We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.


lumina-t2x's People

Contributors

pommespeter avatar frankluox avatar chrisliu6 avatar kamisatokanade avatar gaopengpjlab avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.