Coder Social home page Coder Social logo

schultzjack / otter Goto Github PK

View Code? Open in Web Editor NEW

This project forked from luodian/otter

0.0 0.0 0.0 6.2 MB

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Home Page: https://otter-ntu.github.io/

License: Other

Python 98.48% Jupyter Notebook 1.52%

otter's Introduction

1S-Lab, Nanyang Technological University  2Microsoft Research, Redmond
β™  Co-Project Lead  * Equal Contribution  βœ‰ Corresponding Author

Hits

Project Page | Otter Paper | MIMIC-IT Paper | MIMIC-IT Dataset

Video Demo: Otter's Conceptual Demo Video | Bilibili ε“”ε“©ε“”ε“©

Interactive Demo:

Otter Demo (video version)

Our models would be temporarily offline due to GPU limitation (if we need to train new models lol). You can refer to 🏎️ Run Otter Locally to try Otter-Image and Otter-Video more smoothly on your local machine, with at least 16G GPU mem (BF16/FP16 Mode) to help your tasks like image/video tagging, captioning or identifying harmful content.

Checkpoints:

Otter v0.1 supports multiple images inputs as in-context examples, which is the first multi-modal instruction tuned model that supports to organize inputs this way.

Otter v0.2 supports videos inputs (frames are arranged as original Flamingo's implementation) and multiple images inputs (they serve as in-context examples for each other).

Eval Results: Multi-Modal Arena | Multi-Modal AGI Benchmark (upcoming)

🦾 Update

[2023-07-04]

  1. πŸ₯š Update Eggs section for downloading MIMIC-IT dataset.

[2023-06-23]

  1. 🧨 Download MIMIC-IT Dataset. For more details on navigating the dataset, please refer to MIMIC-IT Dataset README.
  2. 🏎️ Run Otter Locally. You can run our model locally with at least 16G GPU mem for tasks like image/video tagging and captioning and identifying harmful content. We fix a bug related to video inference where frame tensors were mistakenly unsqueezed to a wrong vision_x. You can now try running it again with the updated version.

    Make sure to adjust the sys.path.append("../..") correctly to access otter.modeling_otter in order to launch the model.

[2023-06-08]

  1. Introducing Project Otter's brand new homepage: https://otter-ntu.github.io/. Check it out now!
  2. Check our paper introducing MIMIC-IT in details. Meet MIMIC-IT, the first multimodal in-context instruction tuning dataset with 2.8M instructions! From general scene understanding to spotting subtle differences and enhancing egocentric view comprehension for AR headsets, our MIMIC-IT dataset has it all.
  3. Stay tuned for our upcoming Otter Model v0.2, trained on the MIMIC-IT dataset! With the ability to understand daily scenes, reason in context, spot differences in observations, and act as an egocentric assistant. Checkout conceptual demo video at Youtube or Bilibili!

🦦 Why In-Context Instruction Tuning?

Large Language Models (LLMs) have demonstrated exceptional universal aptitude as few/zero-shot learners for numerous tasks, owing to their pre-training on extensive text data. Among these LLMs, GPT-3 stands out as a prominent model with significant capabilities. Additionally, variants of GPT-3, namely InstrctGPT and ChatGPT, have proven effective in interpreting natural language instructions to perform complex real-world tasks, thanks to instruction tuning.

Motivated by the upstream interleaved format pretraining of the Flamingo model, we present 🦦 Otter, a multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo). We train our Otter in an in-context instruction tuning way on our proposed MI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. Otter showcases improved instruction-following and in-context learning ability in both images and videos.

πŸ—„ MIMIC-IT Dataset Details

MIMIC-IT enables the application of egocentric visual assistant model that can serve that can answer your questions like Hey, Do you think I left my keys on the table?. Harness the power of MIMIC-IT to unlock the full potential of your AI-driven visual assistant and elevate your interactive vision-language tasks to new heights.

We also introduce Syphus, an automated pipeline for generating high-quality instruction-response pairs in multiple languages. Building upon the framework proposed by LLaVA, we utilize ChatGPT to generate instruction-response pairs based on visual content. To ensure the quality of the generated instruction-response pairs, our pipeline incorporates system messages, visual annotations, and in-context examples as prompts for ChatGPT.

For more details, please check the MIMIC-IT dataset.

πŸ€– Otter Model Details

Otter is designed to support multi-modal in-context instruction tuning based on the OpenFlamingo model, which involves conditioning the language model on the corresponding media, such as an image that corresponds to a caption or an instruction-response pair.

We train Otter on MIMIC-IT dataset with approximately 2.8 million in-context instruction-response pairs, which are structured into a cohesive template to facilitate various tasks. Otter supports videos inputs (frames are arranged as original Flamingo's implementation) and multiple images inputs as in-context examples, which is the first multi-modal instruction tuned model.

The following template encompasses images, user instructions, and model-generated responses, utilizing the User and GPT role labels to enable seamless user-assistant interactions.

prompt = f"<image>User: {instruction} GPT:<answer> {response}<endofchunk>"

Training the Otter model on the MIMIC-IT dataset allows it to acquire different capacities, as demonstrated by the LA and SD tasks. Trained on the LA task, the model exhibits exceptional scene comprehension, reasoning abilities, and multi-round conversation capabilities.

# multi-round of conversation
prompt = f"<image>User: {first_instruction} GPT:<answer> {first_response}<endofchunk>User: {second_instruction} GPT:<answer>"

Regarding the concept of organizing visual-language in-context examples, we demonstrate here the acquired ability of the Otter model to follow inter-contextual instructions after training on the LA-T2T task. The organized input data format is as follows:

# Multiple in-context example with similar instructions
prompt = f"<image>User:{ict_first_instruction} GPT: <answer>{ict_first_response}<|endofchunk|><image>User:{ict_second_instruction} GPT: <answer>{ict_second_response}<|endofchunk|><image>User:{query_instruction} GPT: <answer>"

For more details, please refer to our paper's appendix for other tasks.

πŸ—‚οΈ Environments

  1. Compare cuda version returned by nvidia-smi and nvcc --version. They need to match. Or at least, the version get by nvcc --version should be <= the version get by nvidia-smi.
  2. Install the pytorch that matches your cuda version. (e.g. cuda 11.7 torch 2.0.0). We have successfully run this code on cuda 11.1 torch 1.10.1 and cuda 11.7 torch 2.0.0. You can refer to PyTorch's documentation, Latest or Previous.
  3. You may install via conda env create -f environment.yml. Especially to make sure the transformers>=4.28.0, accelerate>=0.18.0.

πŸ€— Hugging Face Model

After configuring environment, you can use the 🦩 Flamingo model / 🦦 Otter model as a πŸ€— Hugging Face model with only a few lines! One-click and then model configs/weights are downloaded automatically. Please refer to Huggingface Otter/Flamingo for details.

β˜„οΈ Training

Train on MIMIC-IT datasets, using the following commands:

First, run, and answer the questions asked. This will generate a config file and save it to the cache folder. The config will be used automatically to properly set the default options when doing accelerate launch.

accelerate config

Then run the training script. You may need to use a specialized converted weights at luodian/OTTER-9B-INIT. This is for initilizing training for Otter. It's directly converted from Openflamingo, and we added tokens for downstream instruction tuning. And you may use any trained weights to start with your training on top of ours, see weights at Otter Weights, and MIMIC-IT for preparing json files.

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \
pipeline/train/instruction_following.py \
--pretrained_model_name_or_path=luodian/OTTER-9B-INIT  \
--dataset_resampled \
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \
--batch_size=4 \
--num_epochs=9 \
--report_to_wandb \
--wandb_entity=ntu-slab \
--run_name=otter9B_dense_caption \
--wandb_project=otter9B \
--workers=1 \
--cross_attn_every_n_layers=4 \
--lr_scheduler=cosine \
--delete_previous_checkpoint \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01 \

πŸ“‘ Citation

If you found this repository useful, please consider citing:

@article{li2023otter,
  title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
  author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
  journal={arXiv preprint arXiv:2305.03726},
  year={2023}
}

@article{li2023mimicit,
    title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
    author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
    year={2023},
    eprint={2306.05425},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

πŸ‘¨β€πŸ« Acknowledgements

We thank Jack Hessel for the advise and support, as well as the OpenFlamingo team for their great contribution to the open source community.

Huge accolades to Flamingo and OpenFlamingo team for the work on this great architecture.

πŸ“ Related Projects

otter's People

Contributors

arman-hk avatar chunyuanli avatar cliangyu avatar eltociear avatar jingkang50 avatar king159 avatar luodian avatar otter-ntu avatar pufanyi avatar zhangyuanhan-ai avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.