Yiyuan Zhang^1,2* Kaixiong Gong^1,2* Kaipeng Zhang^2,✉
Hongsheng Li ^1,2 Yu Qiao ² Wanli Ouyang² Xiangyu Yue^1,✉

¹Multimedia Lab, The Chinese University of Hong Kong
²OpenGVLab，Shanghai AI Laboratory
^* Equal Contribution ^✉ Corresponding Author

🌟 Single Foundation Model Supports A Wide Range of Applications

As a foundation model, Meta-Transformer can handle data from 12 modalities, which determines that it can support a wide range of applications. As shown in this figure, Meta-Transformer can provide services for downstream tasks including stock analysis 📈, weather forecasting ☀️ ☔ ☁️ ❄️ ⛄ ⚡, remote sensing 📡, autonomous driving 🚗, social network 🌍, speech recognition 🔉, etc.

Table 1: Meta-Transformer is capable of handling up to 12 modalities, including natural language , RGB images , point clouds , audios , videos , tabular data , graph , time series data , hyper-spectral images , IMU , medical images , and infrared images .

🚩🚩🚩 Shared-Encoder, Unpaired Data, More Modalities

This repository is built to explore the potential and extensibility of transformers for multimodal learning. We utilize the advantages of Transformers to deal with length-variant sequences. Then we propose the Data-to-Sequence tokenization following a meta-scheme, then we apply it to 12 modalities including text, image, point cloud, audio, video, infrared, hyper-spectral, X-Ray, tabular, graph, time-series, and Inertial Measurement Unit (IMU) data.

After obtaining the token sequence, we employ a modality-shared encoder to extract representation across different modalities. With task-specific heads, Meta-Transformer can handle various tasks on the different modalities, such as: classification, detection, and segmentation.

🌟 News

2023.7.25: 🎉🎉🎉 We have released a well-documented code for graph data understanding. The implementation for Tabular data and point cloud will be released very soon.
2023.7.23: We have released the code and pretrained weights for image understanding and time-series forcasting.
2023.7.22: 🌟🌟🌟 Pretrained weights and a usage demo for our Meta-Transformer have been released. Comprehensive documentation and implementation of the image modality are underway and will be released soon. Stay tuned for more exciting updates!⌛⌛⌛
2023.7.21: Paper is released at arxiv, and code will be gradually released.
2023.7.8: Github Repository Initialization.

🔓 Model Zoo

Open-source Modality-Agnostic Models

Model	Pretraining	Scale	#Param	Download
Meta-Transformer-B16	LAION-2B	Base	85M	ckpt
Meta-Transformer-L14	LAION-2B	Large	302M	ckpt

Demo of Use for Pretrained Encoder

from timm.models.vision_transformer import Block
ckpt = torch.load("Meta-Transformer_base_patch16_encoder.pth")
encoder = nn.Sequential(*[
            Block(
                dim=768,
                num_heads=12,
                mlp_ratio=4.,
                qkv_bias=True,
                norm_layer=nn.LayerNorm,
                act_layer=nn.GELU
            )
            for i in range(12)])
encoder.load_state_dict(ckpt,strict=True)

🕙 ToDo

Meta-Transformer with Large Language Models.
Multimodal Joint Training with Meta-Transformer.
Support More Modalities and More Tasks.

Contact

🚀🚀🚀 We aspire to shape this repository into a formidable foundation for mainstream AI perception tasks across diverse modalities. Your contributions can play a significant role in this endeavor, and we warmly welcome your participation in our project!

To contact us, never hestitate to send an email to [email protected] ,[email protected], [email protected], or [email protected]!

Citation

If the code and paper help your research, please kindly cite:

@article{zhang2023metatransformer,
        title={Meta-Transformer: A Unified Framework for Multimodal Learning}, 
        author={Zhang, Yiyuan and Gong, Kaixiong and Zhang, Kaipeng and Li, Hongsheng and Qiao, Yu and Ouyang, Wanli and Yue, Xiangyu},
        year={2023},
        journal={arXiv preprint arXiv:2307.10802},
  }

License

This project is released under the Apache 2.0 license.

Acknowledgement

This code is developed based on excellent open-sourced projects including MMClassification, MMDetection, MMsegmentation, OpenPoints, Time-Series-Library, Graphomer, SpectralFormer, and ViT-Adapter.

keyman9848 / metatransformer Goto Github PK

metatransformer's Introduction

🌟 Single Foundation Model Supports A Wide Range of Applications

🚩🚩🚩 Shared-Encoder, Unpaired Data, More Modalities

🌟 News

🔓 Model Zoo

🕙 ToDo

Contact

Citation

License

Acknowledgement

metatransformer's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent