luodian / otter Goto Github PK

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Home Page: https://otter-ntu.github.io/

License: MIT License

Python 96.86% Jupyter Notebook 1.01% Shell 2.13%

gpt-4 visual-language-learning artificial-inteligence deep-learning foundation-models multi-modality machine-learning chatgpt instruction-tuning large-scale-models

otter's Introduction

Project Credits | Otter Paper | OtterHD Paper | MIMIC-IT Paper

Checkpoints:

For who in the mainland China: |

Disclaimer: The code may not be perfectly polished and refactored, but all opensourced codes are tested and runnable as we also use the code to support our research. If you have any questions, please feel free to open an issue. We are eagerly looking forward to suggestions and PRs to improve the code quality.

🦾 Update

[2023-11]: Supporting GPT4V's Evaluation on 8 Benchmarks; Anouncing OtterHD-8B, improved from Fuyu-8B. Checkout OtterHD for details.

🦦 Added OtterHD, a multimodal fine-tuned from Fuyu-8B to facilitate fine-grained interpretations of high-resolution visual input without a explicit vision encoder module. All image patches are linear transformed and processed together with text tokens. This is a very innovative and elegant exploration. We are fascinated and paved in this way, we opensourced the finetune script for Fuyu-8B and improve training throughput by 4-5 times faster with Flash-Attention-2. Try our finetune script at OtterHD.
🔍 Added MagnifierBench, an evaluation benchmark tailored to assess whether the model can identify the tiny objects' information (1% image size) and spatial relationships.

Improved pipeline for Pretrain | SFT | RLHF with (part of) current leading LMMs.

Models: Otter | OpenFlamingo | Idefics | Fuyu
Training Datasets Interface: (Pretrain) MMC4 | LAION2B | CC3M | CC12M, (SFT) MIMIC-IT | M3IT | LLAVAR | LRV | SVIT...
- We tested above datasets for both pretraining and instruction tuning with OpenFlamingo and Otter. We also tested the datasets with Idefics and Fuyu for instruction tuning. We will opensource the training scripts gradually.
Benchmark Interface: MagnifierBench/MMBench/MM-VET/MathVista/POPE/MME/SicenceQA/SeedBench. Run them can be in one-click, please see Benchmark for details.

    datasets:
    - name: magnifierbench
        split: test
        prompt: Answer with the option's letter from the given choices directly.
        api_key: [Your API Key] # GPT4 or GPT3.5 to evaluate the answers and ground truth.
        debug: true # put debug=true will save the model response in log file.
    - name: mme
        split: test
        debug: true
    - name: mmbench
        split: test
        debug: true

    models:
    - name: gpt4v
        api_key: [Your API Key] # to call GPT4V model.

Code refactorization for organizing multiple groups of datasets with integrated yaml file, see details at managing datasets in MIMIC-IT format. For example,

    IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]
        LADD: # Dataset name can be assigned at any name you want
            mimicit_path: azure_storage/json/LA/LADD_instructions.json # Path of the instruction json file
            images_path: azure_storage/Parquets/LA.parquet # Path of the image parquet file
            num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.
        M3IT_CAPTIONING:
            mimicit_path: azure_storage/json/M3IT/captioning/coco/coco_instructions.json
            images_path: azure_storage/Parquets/coco.parquet
            num_samples: 20000

This is a major change and would result previous code not runnable, please check the details.

[2023-08]

Added Support for using Azure, Anthropic, Palm, Cohere models for Self-Instruct with Syphus pipeline, for information on usage modify this line with your selected model and set your API keys in the environment. For more information see LiteLLM

[2023-07]: Anouncing MIMIC-IT dataset for multiple interleaved image-text/video instruction tuning.

🤗 Checkout MIMIC-IT on Huggingface datasets.
🥚 Update Eggs section for downloading MIMIC-IT dataset.
🥃 Contact us if you wish to develop Otter for your scenarios (for satellite images or funny videos?). We aim to support and assist with Otter's diverse use cases. OpenFlamingo and Otter are strong models with the Flamingo's excellently designed architecture that accepts multiple images/videos or other modality inputs. Let's build more interesting models together.

[2023-06]

🧨 Download MIMIC-IT Dataset. For more details on navigating the dataset, please refer to MIMIC-IT Dataset README.
🏎️ Run Otter Locally. You can run our model locally with at least 16G GPU mem for tasks like image/video tagging and captioning and identifying harmful content. We fix a bug related to video inference where frame tensors were mistakenly unsqueezed to a wrong vision_x.

Make sure to adjust the sys.path.append("../..") correctly to access otter.modeling_otter in order to launch the model.
🤗 Check our paper introducing MIMIC-IT in details. Meet MIMIC-IT, the first multimodal in-context instruction tuning dataset with 2.8M instructions! From general scene understanding to spotting subtle differences and enhancing egocentric view comprehension for AR headsets, our MIMIC-IT dataset has it all.

🦦 Why In-Context Instruction Tuning?

Large Language Models (LLMs) have demonstrated exceptional universal aptitude as few/zero-shot learners for numerous tasks, owing to their pre-training on extensive text data. Among these LLMs, GPT-3 stands out as a prominent model with significant capabilities. Additionally, variants of GPT-3, namely InstructGPT and ChatGPT, have proven effective in interpreting natural language instructions to perform complex real-world tasks, thanks to instruction tuning.

Motivated by the upstream interleaved format pretraining of the Flamingo model, we present 🦦 Otter, a multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo). We train our Otter in an in-context instruction tuning way on our proposed MI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. Otter showcases improved instruction-following and in-context learning ability in both images and videos.

🗄 MIMIC-IT Dataset Details

MIMIC-IT enables the application of egocentric visual assistant model that can serve that can answer your questions like Hey, Do you think I left my keys on the table?. Harness the power of MIMIC-IT to unlock the full potential of your AI-driven visual assistant and elevate your interactive vision-language tasks to new heights.

We also introduce Syphus, an automated pipeline for generating high-quality instruction-response pairs in multiple languages. Building upon the framework proposed by LLaVA, we utilize ChatGPT to generate instruction-response pairs based on visual content. To ensure the quality of the generated instruction-response pairs, our pipeline incorporates system messages, visual annotations, and in-context examples as prompts for ChatGPT.

For more details, please check the MIMIC-IT dataset.

🤖 Otter Model Details

Otter is designed to support multi-modal in-context instruction tuning based on the OpenFlamingo model, which involves conditioning the language model on the corresponding media, such as an image that corresponds to a caption or an instruction-response pair.

We train Otter on MIMIC-IT dataset with approximately 2.8 million in-context instruction-response pairs, which are structured into a cohesive template to facilitate various tasks. Otter supports videos inputs (frames are arranged as original Flamingo's implementation) and multiple images inputs as in-context examples, which is the first multi-modal instruction tuned model.

The following template encompasses images, user instructions, and model-generated responses, utilizing the User and GPT role labels to enable seamless user-assistant interactions.

prompt = f"<image>User: {instruction} GPT:<answer> {response}<endofchunk>"

Training the Otter model on the MIMIC-IT dataset allows it to acquire different capacities, as demonstrated by the LA and SD tasks. Trained on the LA task, the model exhibits exceptional scene comprehension, reasoning abilities, and multi-round conversation capabilities.

# multi-round of conversation
prompt = f"<image>User: {first_instruction} GPT:<answer> {first_response}<endofchunk>User: {second_instruction} GPT:<answer>"

Regarding the concept of organizing visual-language in-context examples, we demonstrate here the acquired ability of the Otter model to follow inter-contextual instructions after training on the LA-T2T task. The organized input data format is as follows:

# Multiple in-context example with similar instructions
prompt = f"<image>User:{ict_first_instruction} GPT: <answer>{ict_first_response}<|endofchunk|><image>User:{ict_second_instruction} GPT: <answer>{ict_second_response}<|endofchunk|><image>User:{query_instruction} GPT: <answer>"

For more details, please refer to our paper's appendix for other tasks.

🗂️ Environments

Compare cuda version returned by nvidia-smi and nvcc --version. They need to match. Or at least, the version get by nvcc --version should be <= the version get by nvidia-smi.
Install the pytorch that matches your cuda version. (e.g. cuda 11.7 torch 2.0.0). We have successfully run this code on cuda 11.1 torch 1.10.1 and cuda 11.7 torch 2.0.0. You can refer to PyTorch's documentation, Latest or Previous.
You may install via conda env create -f environment.yml. Especially to make sure the transformers>=4.28.0, accelerate>=0.18.0.

After configuring environment, you can use the 🦩 Flamingo model / 🦦 Otter model as a 🤗 Hugging Face model with only a few lines! One-click and then model configs/weights are downloaded automatically. Please refer to Huggingface Otter/Flamingo for details.

☄️ Training

Otter is trained based on OpenFlamingo. You may need to use converted weights at luodian/OTTER-9B-INIT or luodian/OTTER-MPT7B-Init. They are respectively converted from OpenFlamingo-LLaMA7B-v1 and OpenFlamingo-MPT7B-v2, we added a <answer> token for Otter's downstream instruction tuning.

You may also use any trained Otter weights to start with your training on top of ours, see them at Otter Weights. You can refer to MIMIC-IT for preparing image/instruction/train json files.

export PYTHONPATH=.
RUN_NAME="Otter_MPT7B"
GPU=8
WORKERS=$((${GPU}*2))

echo "Using ${GPU} GPUs and ${WORKERS} workers"
echo "Running ${RUN_NAME}"

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_zero3.yaml \
    --num_processes=${GPU} \
    pipeline/train/instruction_following.py \
    --pretrained_model_name_or_path=luodian/OTTER-MPT7B-Init \
    --model_name=otter \
    --instruction_format=simple \
    --training_data_yaml=./shared_scripts/Demo_Data.yaml \
    --batch_size=8 \
    --num_epochs=3 \
    --report_to_wandb \
    --wandb_entity=ntu-slab \
    --external_save_dir=./checkpoints \
    --run_name=${RUN_NAME} \
    --wandb_project=Otter_MPTV \
    --workers=${WORKERS} \
    --lr_scheduler=cosine \
    --learning_rate=2e-5 \
    --warmup_steps_ratio=0.01 \
    --save_hf_model \
    --max_seq_len=1024 \

📑 Citation

If you found this repository useful, please consider citing:

@article{li2023otter,
  title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
  author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
  journal={arXiv preprint arXiv:2305.03726},
  year={2023}
}

@article{li2023mimicit,
    title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
    author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
    year={2023},
    eprint={2306.05425},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

👨‍🏫 Acknowledgements

We thank Jack Hessel for the advise and support, as well as the OpenFlamingo team for their great contribution to the open source community.

Huge accolades to Flamingo and OpenFlamingo team for the work on this great architecture.

📝 Related Projects

otter's People

Contributors

Stargazers

Watchers

Forkers

codeaudit herpacker nickymousesg cylonspace masemxiao csshali billionerd maigone iam20cm elv-zhounan farmingtong hay-man mistyr0se tufo830 techthiyanes e-kiss-me hs991023 molierflower obsidian6s wensiyuansix d3p10y staccats fskeo hisstar paperwave xupercoin coder-drinker peternara tutuna ntt720 arman-hk luluchou dumpmemory paramedick jbluv minisoco yhna940 0x8235 vamoko eltociear lycokie awekling spicyguml moguijoe s8xy nicbair hutchson monsterdove closegoingaway cerviny zaku-zaku windb3ll n0wwa haorand cocob0i piapplepi wangjiongw fuanfree loveyou01004 nanpusher nicolesherwood excelisa nap1ch ymzhang96 w90o0u xiao2duan 2023-paper-fun twacoco ai2047 paoyes qugou1350636 luozhe023 jinyi-sama aimogmog commachan kamifr raymusk zshpro bartslab leonz87 tqcheung err-nil jtt1998 skillcampalan yetaye stlkoch rainlogy alexyiy wongli233 lt6253090 sparkcus xuyu67 caramelmario lixiang95 quantumira badcoderguy hui13579246 reikolo xigua369 zeozez

otter's Issues

Confusion regarding model versions?

Hi, with release of the new MIMIC-IT paper---I am bit confused regarding the different model versions you have. Is the model on huggingface the same as the ones for which you report results in the paper? As I understand it, the results in the paper come from instruction fine-tuning OpenFlamingo on the larger MIMIC-IT dataset whereas the one on huggingface is fine-tuned on an older, slightly smaller version?
If the two models are different, are the weights of the new model already provided on huggingface?
Further, you mention in your readme that there is a new OtterV2 model that you've released---Is this the newer model you describe in the paper? If yes, could you also provide the links to these model weights?
Thanks for your incredible contributions, and looking forward to playing around with the new Otter models!

FSDP support

Support torch.distributed.fsdp

Release the evaluation scripts

Hello, I'm trying to test Otter on MSCOCO dataset, would you please release the evaluation codes? Thanks for your help!

Accelerate config file

It would be helpful if you could share the Accelerate configuration required to run with 2xRTX-3090-24G GPUs.

Could you introduce more about model training?

Thanks for your excellent work! Hope for more details on training schedule, like training time, gpu number, and so on.

different atten-mask depending on whether xformer is installed

hi, thanks for your awesome work, which helps me a lot.

i'm trying to deploy otter locally. diving into the code, i found the otter behaivor is different depending on whether xformer is installed. precisely, when the xformer is not installed, otter use openflamingo format attention_mask, when installed, otter use default attention_mask.

i was wondering if this will introduce any performance gap between use xformer or not? and the training process used xformer or not?

[demo] error encountered running Mini Demo code in https://github.com/Luodian/Otter/blob/main/docs/demo.md.

File "Otter/Otter/xformers_model/clip.py", line 77, in forward
patch_embeds = self.patch_embedding(
File "python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

Environment:
torch 2.0.0, python 3.9, cuda11.7,

Demo & Model Serving

Multiple model serving and selection
Generation parameters
Multi-GPU inference support (with HF model)
Data recycle

is otter model architecture the same as openflamingo?

hi to the team! in the codebase i found that flamingo architecture and otter architecture are separated into /flamingo and /otter folder, both in huggingface interfaces.

however, after reading otter paper i found that otter seems has no difference in architecture with openflamingo. is this can be confirmed?(otherwise i sincerely aplogoize for my misunderstanding).

if so, if i want to continue training flamingo, should i use modeling_flamingo.py or modeling_otter.py? or they are both ok? heartfully thanks for your kind help.

[demo] RuntimeError: std::bad_alloc

The example code is pulled from here.
Error:

(otter) user@env:~/Otter$ python test-model.py                             
                                                                                    
Using pad_token, but it is not set yet.                                            
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/
███████████████████████████████████████████████| 4/4 [00:35<00:00,  8.92s/it]       
Enter prompts (comma-separated): what are they doing?                              

Prompt: what are they doing?                                                       
Traceback (most recent call last):                                                 
  File "/home/user/Otter/test-model.py", line 141, in <module>                                                                                                        
    response = get_response(frames_list, prompt, model, image_processor)                                                                                              
  File "/home/user/Otter/test-model.py", line 98, in get_response                                                                                                     
    generated_text = model.generate(                                               
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                        
    return func(*args, **kwargs)                                                   
  File "/home/user/Otter/otter/modeling_otter.py", line 873, in generate                                                                                              
    self._encode_vision_x(vision_x=vision_x)                                       
  File "/home/user/Otter/otter/modeling_otter.py", line 831, in _encode_vision_x                                                                                      
    vision_x = self.vision_encoder(vision_x)[0][:, 1:, :]                                                                                                             
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                             
    return forward_call(*args, **kwargs)                                           
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 940, in forward                                  
    return self.vision_model(                                                      
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                             
    return forward_call(*args, **kwargs)                                           
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 865, in forward                                  
    hidden_states = self.embeddings(pixel_values)                                  
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                             
    return forward_call(*args, **kwargs)                                           
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 195, in forward                                  
    patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]                                                                               
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                             
    return forward_call(*args, **kwargs)                                           
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward                                                   
    return self._conv_forward(input, self.weight, self.bias)                                                                                                          
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward                                             
    return F.conv2d(input, weight, bias, self.stride,                              
RuntimeError: std::bad_alloc

Not sure if this is correct, but I've used this as stated in the example (main):

    model = OtterForConditionalGeneration.from_pretrained(
        "luodian/otter-9b-dc-hf",
    )

My packages are:

(otter) user@env:~/Otter$ pip list | grep -e torch -e xformers
open-clip-torch          2.20.0
torch                    2.0.1
torchaudio               2.0.2
torchvision              0.15.2

Originally posted by @Nntsyeo in #147 (comment)

[Fix/Train] `image_processor` is missing in `get_mimicit_dataset`

Before you open an issue, please check if a similar issue already exists or has been closed before.

When you open an issue, please be sure to include the following

A descriptive title: Image Processor is missing in get_mimicit_dataset
A detailed description
Assign an issue type tag (label):
- dataset (mimic-it download, usage, etc.),
- demo (online demo), doc (readme, wiki, paper, video etc.),
- evaluation (evaluation result, performance of Otter etc.),
- model (model configuration, components, etc.),
- train (training configuration, process, code, etc.)

Image_processor is missing in get_mimicit_dataset function.

Computing output likelihoods with the model

Hi, is it possible to get the tokenwise log-likelihood scores of different outputs from the model?

The use-case would be something like:
Given an interleaved image/text input and a list of output text candidates, we should be able to get a score for each output candidate and then return their ranked list, rather than generating the outputs directly. This would be close to how LLMs are evaluated on MCQ tasks. An example from the T0 paper Page 6 (https://arxiv.org/pdf/2110.08207.pdf):

For tasks that involve choosing the correct completion from several options (e.g. multiple choice
question answering), we follow Brown et al. (2020) and use rank classification to evaluate our
model: we compute the log-likelihood of each of the target options under the fine-tuned model and
select the option with the highest log-likelihood as the prediction. For simplicity, we do not apply
length normalization to the log-likelihoods of the target options.

Is it straightforward to do this with Otter? I assume since the LM is built with transformers there should be a possibility to use output score functions already implemented (haven't dug into this yet)?

Support 🌻 LoRA

A LoRA model
If we can use peft wrapper, it will be fast and we can test multiple PEFT methods.
If not, we will write an additional model modifying the current modeling_flamingo

how to train the model on 24g ram gpu with deepspeed

Training data examples

Thanks for sharing the amazing project! Can you provide examples of training data to better understand the format of the training data

README examples not visible on 'dark default' background

Thanks a lot for this repo!

On black background, examples are not readable, another font color maybe useful:

[Model] Bug when initializing Openflamingo model

Hi, Otter is a great job. However, when initializing openflamingo model, I come across this problem below.

File "pipeline/train/instruction_following.py", line 351, in main
    model = FlamingoForConditionalGeneration.from_pretrained(
  File "/yang_yu/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2305, in from_pretrained
    config, model_kwargs = cls.config_class.from_pretrained(
  File "/yang_yu/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/configuration_utils.py", line 554, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/yang_yu/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/configuration_utils.py", line 701, in from_dict
    config = cls(**config_dict)
  File "/yang_yu/lq/project/Otter/flamingo/configuration_flamingo.py", line 69, in __init__
    if text_config["architectures"][0] == "MPTForCausalLM":
TypeError: 'NoneType' object is not subscriptable

Documentation Improvement before our publication

Documentation

refine `requirements.txt` and `environment.yml`

check the environment on 8*A100, 8*V100, 4*3090, 2*8*v100

dataset card

dataset description: size, source category, maybe a pie chart?

model card

OpenFlamingo 9B model using a CLIP ViT-Large vision encoder and a LLaMA-7B language mode

Licence

Llama, OpenFlimingo and our dataset are following different licences and we need to align or inherit some of them.

`readme.md`

train script on 8*A100, 2*8*V100, 4*3090
title change
demo links
introduction: why multi-model instruction tuning
an image for main architecture
acknowledgement: we use OpenFlamingo, OFA, etc.
Authorship

Code formatting

To improve the readablity of our model, we should

black format
type hint

maybe a new name? ଘ(੭ˊᵕˋ)੭

Liger: Language 🐯 + vision 🦁

Quantized model to run on a single 3090

Great work!

Are there plans to quantize the model, to allow it to run on a single 3090 GPU?

Thanks!

[Bug report] instruction_following_ds.py has image_processor arg in get_data function

Plans on releasing multilingual model/dataset

Dear author,
Thank you for your great work. I am very interested in the multilingual ability of the large language model and it seems that the current released model supports English only. The multilingual annotations in the dataset are also unavailable until now. So if you have any plans of releasing the multilingual annotation or the multilingual model?

[evaluation] MMAGIBench can not be accessed

hi to the team, otter seems really interesting!

when i was reading your paper of MIMIC-IT, i noticed otter model was evaluated on MMAGIBench benchmark, its quotation is

[43] MMAGIBench Team. Mmagibench: A universal multi-modal benchmark towards artificial general intelligence. https://github.com/open-mmlab/mmagibench, 2023. 3, 9, 10

however, the project page of MMAGIBench is not accessible, and i can hardly find more information about this benchmark other than Otter paper, could you please give some hint? thanks in advance!

Join the competition! https://lmsys.org/blog/2023-05-03-arena/

LMSYS org was rolling out some serious back to back battle between LLMs, I would like to see Flamingo & Otter's debut there +1

Rewrite Open_Flamingo into a 🤗 huggingface model

configuration_flamingo.py: write the configuration file of Flamingo
modeling_flamingo.py: rewrite the architecture
load the checkpoint: rename the layers and .load_state_dict
processing_flamingo.py: delay, as we can use the transformers.CLIPImageProcessor() for now
model special loading: device_map=auto
model special loading: load_in_8bit=True
model forward: give the lost
model generation: text+image->model->text
model card writing: documentation of Flamingo-9B

AttributeError: module transformers has no attribute TFOtterForConditionalGeneration

transformers = 4.28.0

For some reason, I couldn't access huggingface, so I use offline mode
model = OtterForConditionalGeneration.from_pretrained("/my/file/path/config.json", device_map="auto",
from_tf=True)

But the following error was encountered

The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
Traceback (most recent call last):
File "/home/user4/cww/Otter-main/pipeline/demo/otter_image.py", line 102, in
model = OtterForConditionalGeneration.from_pretrained("/data/user4/CWW_OTTER/config.json", device_map="auto",
File "/data/anaconda3/envs/otter/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2761, in from_pretrained
model, loading_info = load_tf2_checkpoint_in_pytorch_model(
File "/data/anaconda3/envs/otter/lib/python3.9/site-packages/transformers/modeling_tf_pytorch_utils.py", line 407, in load_tf2_checkpoint_in_pytorch_model
tf_model_class = getattr(transformers, tf_model_class_name)
File "/data/anaconda3/envs/otter/lib/python3.9/site-packages/transformers/utils/import_utils.py", line 1139, in getattr
raise AttributeError(f"module {self.name} has no attribute {name}")
AttributeError: module transformers has no attribute TFOtterForConditionalGeneration

Data issues

Hi, thanks for the amazing work and released MIMIC-IT! seems there're a few issues:

for LLaVA-In-Context, seems meta link here is missing, where I assume it's supposed to be the LAxxx_train.json files? maybe there're misunderstandings, and it seems to me that here does not exclude the LAxx_INS prefix (e.g. cur_image_id.split('_')[-1] for LACONV, LACR_I2I, etc), otherwise LAxx_INS_ prefix is unexpectedly included for reading coco images. and there're some cases that have the key like coco/train2017/000000033471_2.jpg, where no _2 img found?
for TV caption, in TVC_instructions.json, seems the image ids do not correspond with the ids in converted TVC.json. There are some repetitive patterns, e.g. TVC_IMG_castle_s07e09_seg02_clip_02_castle_s07e09_seg02_clip_02_00009 or TVC_IMG_s04e13_seg01_clip_00_bbt_s04e13_seg01 such that it requires to rekey by r'(TVC_IMG)_(.+?_clip_[0-9]+)_(.+?_clip_[0-9]+)_([0-9]+)' for both cases
for spot difference, probably [:5] here is unexpected, otherwise only 5 examples are used?
typo here, seems to be video.VisualStoryTelling

For other datasets, it would be great to release the processed x.json file (I noticed the egg version would be coming soon) as some datasets are too old to acquire/process and some video datasets are large. Thank you!

A question about media_locations

I read your paper and understand that each sample you use for training will contain multiple examples, i.e., each sample will contain multiple images. The images and text will then be arranged in an interleaved form and fed into the model. I don't know if I understand correctly. And I noticed that your code only inputs one image at a time into the model, which makes me wonder

I would appreciate it if you could answer my questions

[Datasets] Dataset for Otter-MPT7B Image

          Hi, authors. Could you provide the dataset names which are used to train the model of `Otter-MPT7B Image`? All mimic-it dataset (including video-text data) or just LA/SD dataset (only image-text data)?

Originally posted by @aopolin-lv in #186 (comment)

questions about different instruction files in LA

hi to the team, otter seems really interesting!

when i'm investigating the released dataset, i found that LA (in my opinion it refers to LLaVA) folder contains more than one instruction files(and their corresponding _train.json), they are

LACONV_instructions.json
LACR_I2I_instructions.json
LACR_T2T_instructions.json
LADD_instructions.json

what are these instruction prefixes acutally mean?

I carefully looked into the MIMIC-IT paper, it mentioned LA-T2T task but did not revealing what it is.
From this issue #149 i know they are jointly used to train OTTER-9B-LA-InContext, but still at sea.

i would extend sincere apology if i missed something, thanks for your patience!

current args description in `instruction_following.py` is not updated to our current training script.

We may need to update them in our next PR.

e.g.

    parser.add_argument("--use_media_placement_augmentation", action="store_true")
    parser.add_argument("--offline", action="store_true")
    parser.add_argument("--num_epochs", type=int, default=1)
    parser.add_argument("--logging_steps", type=int, default=100, help="log loss every n steps")
    # Sum of gradient optimization batch size
    parser.add_argument("--batch_size", type=int, default=128)

    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
    parser.add_argument(
        "--pretrained_model_name_or_path",
        type=str,
        help="path to huggingface model or model identifier from local path or huggingface.co",
        default=None,
    )
    parser.add_argument(
        "--load_from_original_checkpoint",
        type=str,
        help="path to openflamingo provided checkpoint, in .pt format",
        default=None,
    )
    parser.add_argument(
        "--resume_from_checkpoint",
        action="store_true",
    )
    parser.add_argument(
        "--overwrite_checkpoint",
        action="store_true",
    )
    parser.add_argument(
        "--delete_previous_checkpoint",
        action="store_true",
        help="delete previous checkpoint when saving new checkpoint",
    )
    parser.add_argument(
        "--multi_instruct_path",
        type=str,
        help="path to multi_instruct dataset, this should be a glob pattern such as vision_language_examples.tsv",
    )
    parser.add_argument(
        "--images_path",
        type=str,
        help="path to images_path dataset, this should be a glob pattern such as vision_language_examples.tsv",
    )
    parser.add_argument(
        "--train_config_path",
        type=str,
        help="path to train_config_path dataset, this should be a glob pattern such as vision_language_examples.tsv",
    )
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--learning_rate", default=1e-4, type=float)
    parser.add_argument(
        "--lr_scheduler",
        default="constant",
        type=str,
        help="constant, linear, or cosine",
    )
    parser.add_argument("--loss_multiplier_multi_instruct", type=float, default=1.0)

Multi-image inference scripts

Hello, thanks so much for providing your amazing model. Are there plans to release a colab notebook or example python script for using Otter for in-context learning / multi-image inputs?
Thanks!

Add video in-context learning

FunQA dataset (Jingkang's team)

Dataset

Improving Demo

For any demo related issues, please comment in this issue.

how do I assign new values to the gradients of embeddings?

Hello, after using DeepSpeed, how do I assign new values to the gradients of embeddings? I have tried some methods, but they do not seem to be very effective.

there is bug sometimes

for example.

[Feature Support] For Multi-Batch Data Inference Support

If you wish to generate descriptions for multiple images at one time. Simply use the following codes:

import requests
import torch
import transformers
from PIL import Image
from otter.modeling_otter import OtterForConditionalGeneration

model = OtterForConditionalGeneration.from_pretrained(
    "luodian/otter-9b-hf", device_map="auto"
)

tokenizer = model.text_tokenizer
image_processor = transformers.CLIPImageProcessor()
demo_image_one = Image.open(
    requests.get(
        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
    ).raw
)
demo_image_two = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg", stream=True
    ).raw
)
query_image = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg", stream=True
    ).raw
)
vision_x = (
    image_processor.preprocess(
        [demo_image_one, demo_image_two, query_image], return_tensors="pt"
    )["pixel_values"]
    .unsqueeze(1)
    .unsqueeze(1) #Here, we reshape the input images into shape [B, 1, 1, 3, H, W], where T_img=1 and F=1.
)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image> User: what does the image describe? GPT: <answer>", "<image> User: what does the image describe? GPT: <answer>", "<image> User: what does the image describe? GPT: <answer>" 
    ], #Here, we provide instructions for all images, respectively.
    return_tensors="pt", padding=True #To avoid different lengths of the instructions
)
generated_text = model.generate(
    vision_x=vision_x.to(model.device),
    lang_x=lang_x["input_ids"].to(model.device),
    attention_mask=lang_x["attention_mask"].to(model.device),
    max_new_tokens=256, #4 seconds; max_new_tokens=512, 7 seconds
    num_beams=1,
    no_repeat_ngram_size=3,
)
for i in range(vision_x.size(0):
    print(f"Generated text for image {i}: ", model.text_tokenizer.decode(generated_text[i]))

Can I pass image embeddings in rather than raw images if I want to use my own embedding model?

Questions about the size of model parameters?

Hi, I am very curious why the parameters of openflamingo-9b-hf will take up 30G, because the parameters of LLaMA-7B only take up 13G, and the perceiver and vision_encoder should not take up much space.

Pull Request CI/CD check

Write pytest for:

4.19-4.20 MVP Stage Assignments

Liangyu's work

add support for Vicuna pretrained LLM, https://github.com/lm-sys/FastChat#vicuna-weights. @liangyuch
support interactive chat (may init from alpaca or vicuna to better support for chat demo) @liangyuch
support language only chat @liangyuch
!!!!support in-context learning example @liangyuch
host collie demo on one or multiple 3090 GPU
- run python collie_chat.py --checkpoint_path xxx

Bo's work

check image preprocess (image mean/variance) in (1) finetuning stage on multi-instruct (2) pretraining stage on mmc4. they need to be aligned. @Luodian
check trained model's ability to support (1) interactive chat (2) different instructions.
check special tokens are correctly aligned during training and demo evaluation.

Yuanhan's work

test accelerator training, or give Bo's access to a prepared environment to test accelerator on 8x32G V100.
prepare datasets card, introduce our multi-instruct in-context learning dataset.
deepspead config, solve hard-code problem.
review training code;
Add LoRA

Data Paths

We use {azure_blob} as the mounted blob's start path

Trained Model Paths

Train on GQA: {azure_blob}/models/collie9B_gqa

Helpful Discussion

confirm training data for current stage
setup demo and support interactive chat.
fsdp, 8bit Adam, LoRA for efficient and distributed tuning.
support Collie9B inference on consumer GPU
enrich instructions from ChatGPT.

Some questions about dataset construction

Thanks for sharing the inspiring work! After reading the original paper, I have some questions about the construction of the MIMIC-IT dataset.

Which dataset do the samples in Fig.2(b) come from? Is it PVSG repository?
How to get the answers for the sample? Are they manually annotated? I know for the VQA part and the LLaVA part, they are naturally paired with the questions/instructions, how about the PVSG part?
How do you get the instructions/queries for the samples? Are they from some specific set of handwritten templates or are they generated by GPT as in LLaVA?

Demo: 502 Bad Gateway

I'm getting a "Bad Gateway" error for the demo page, is it just me?

Can I use a single gpu to train this model?

Before you open an issue, please check if a similar issue already exists or has been closed before.

When you open an issue, please be sure to include the following

A descriptive title: Can I use a single gpu to train this model?
A detailed description: Thanks for your nice work. I want to train the Otter model. May I use a single GPU to train the model. Could you please share your accelerate config? Thanks!
Assign an issue type tag (label):
- dataset (mimic-it download, usage, etc.),
- demo (online demo), doc (readme, wiki, paper, video etc.),
- evaluation (evaluation result, performance of Otter etc.),
- model (model configuration, components, etc.),
- train (training configuration, process, code, etc.)

Thank you for your contributions!

[dataset] Related instruction IDs for LA In-context are incorrect

I was looking around the annotations for LA in-context, I noticed that the instructions specified as related instructions do not exist. Dense Caption doesn't seem to have this problem.

In [2]: import json

In [3]: with open('/path/to/LA_instructions.json') as f:
   ...:     annotations = json.load(f)
   ...: 

In [4]: list(annotations['data'].keys())[:10]
Out[4]: 
['LACONV_00_INS_000000033471_2',
 'LACONV_00_INS_000000052846_4',
 'LACONV_00_INS_000000334872_3',
 'LACONV_00_INS_000000319154_4',
 'LACONV_00_INS_000000398214_4',
 'LACONV_00_INS_000000520873_4',
 'LACONV_00_INS_000000575173_3',
 'LACONV_00_INS_000000087286_3',
 'LACONV_00_INS_000000032286_4',
 'LACONV_00_INS_000000175217_4']

In [5]: annotations['data']['LACONV_00_INS_000000033471_2']
Out[5]: 
{'instruction': 'Is the bus driving down the street or pulled off to the side?',
 'answer': 'The bus is driving down the street, which is crowded with people and other vehicles.',
 'image_ids': ['LA_00_IMG_000000033471'],
 'rel_ins_ids': ['LACONV_00_INS_000000033471_0',
  'LACONV_00_INS_000000033471_1']}

In [6]: annotations['data']['LACONV_00_INS_000000033471_0']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 annotations['data']['LACONV_00_INS_000000033471_0']

KeyError: 'LACONV_00_INS_000000033471_0'

In [7]: annotations['data']['LACONV_00_INS_000000033471_1']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 annotations['data']['LACONV_00_INS_000000033471_1']

KeyError: 'LACONV_00_INS_000000033471_1'

Maybe I've misunderstood what related instructions are? Either way, please let me know!

Implement flash attention/xformers support on Flamingo

to support larger batchsize or more frames in video training data.

refere:

ERROR: Could not build wheels for horovod, which is required to install pyproject.toml-based projects

Hello Otter team!

I have encounter an issue when installing the environments:
× python setup.py bdist_wheel did not run successfully.
│ exit code: 1

ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Failed to build horovod
ERROR: Could not build wheels for horovod, which is required to install pyproject.toml-based projects

A little bit of context: I directly ran conda env create -f environment.yml

Configure openai key in Syphus

I want to use Syphus on your own dataset,but there was an error when requesting the API for openai, and I don't know how to choose OPENAI_ API_ BASE, OPENAI_ API_ VERSION and OPENAI_ API_ENGINE.When I use 'chatgpt0301' or other models as ENGINE, the value of ENGINE cannot be accessed. When I abandon engine and use gpt-3.5-turbo as the model, request failure will still be displayed.
Can you provide some parameter examples? Thank you!

The selling point of our project.

We should detail the following points.

The first instruction tuning work for V+L unsupervised pre-trained: Flamingo.
A visual instruction tuning dataset with in-context examples.
low-resource (training efficiency and storage efficiency).

🤗 HuggingFace Trainer

Since we have standard dataset and hf model, we can use the trainer to improve the readability of the code.

luodian / otter Goto Github PK

otter's Introduction

🦾 Update

🦦 Why In-Context Instruction Tuning?

🗄 MIMIC-IT Dataset Details

🤖 Otter Model Details

🗂️ Environments

☄️ Training

📑 Citation

👨‍🏫 Acknowledgements

📝 Related Projects

otter's People

Contributors

Stargazers

Watchers

Forkers

otter's Issues

When you open an issue, please be sure to include the following

Documentation

refine requirements.txt and environment.yml

dataset card

model card

Licence

readme.md

Code formatting

maybe a new name? ଘ(੭ˊᵕˋ)੭

Liangyu's work

Bo's work

Yuanhan's work

Data Paths

Trained Model Paths

Helpful Discussion

When you open an issue, please be sure to include the following

Recommend Projects

Recommend Topics

Recommend Org

refine `requirements.txt` and `environment.yml`

`readme.md`