Coder Social home page Coder Social logo

luodian / otter Goto Github PK

View Code? Open in Web Editor NEW
3.5K 100.0 239.0 7.57 MB

๐Ÿฆฆ Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Home Page: https://otter-ntu.github.io/

License: MIT License

Python 96.86% Jupyter Notebook 1.01% Shell 2.13%
gpt-4 visual-language-learning artificial-inteligence deep-learning foundation-models multi-modality machine-learning chatgpt instruction-tuning large-scale-models

otter's Introduction


Twitter Hits litellm

Project Credits | Otter Paper | OtterHD Paper | MIMIC-IT Paper

Checkpoints:

For who in the mainland China: Open in OpenXLab | Open in OpenXLab

Disclaimer: The code may not be perfectly polished and refactored, but all opensourced codes are tested and runnable as we also use the code to support our research. If you have any questions, please feel free to open an issue. We are eagerly looking forward to suggestions and PRs to improve the code quality.

๐Ÿฆพ Update

[2023-11]: Supporting GPT4V's Evaluation on 8 Benchmarks; Anouncing OtterHD-8B, improved from Fuyu-8B. Checkout OtterHD for details.

  1. ๐Ÿฆฆ Added OtterHD, a multimodal fine-tuned from Fuyu-8B to facilitate fine-grained interpretations of high-resolution visual input without a explicit vision encoder module. All image patches are linear transformed and processed together with text tokens. This is a very innovative and elegant exploration. We are fascinated and paved in this way, we opensourced the finetune script for Fuyu-8B and improve training throughput by 4-5 times faster with Flash-Attention-2. Try our finetune script at OtterHD.
  2. ๐Ÿ” Added MagnifierBench, an evaluation benchmark tailored to assess whether the model can identify the tiny objects' information (1% image size) and spatial relationships.
  3. Improved pipeline for Pretrain | SFT | RLHF with (part of) current leading LMMs.
    1. Models: Otter | OpenFlamingo | Idefics | Fuyu
    2. Training Datasets Interface: (Pretrain) MMC4 | LAION2B | CC3M | CC12M, (SFT) MIMIC-IT | M3IT | LLAVAR | LRV | SVIT...
      • We tested above datasets for both pretraining and instruction tuning with OpenFlamingo and Otter. We also tested the datasets with Idefics and Fuyu for instruction tuning. We will opensource the training scripts gradually.
    3. Benchmark Interface: MagnifierBench/MMBench/MM-VET/MathVista/POPE/MME/SicenceQA/SeedBench. Run them can be in one-click, please see Benchmark for details.
        datasets:
        - name: magnifierbench
            split: test
            prompt: Answer with the option's letter from the given choices directly.
            api_key: [Your API Key] # GPT4 or GPT3.5 to evaluate the answers and ground truth.
            debug: true # put debug=true will save the model response in log file.
        - name: mme
            split: test
            debug: true
        - name: mmbench
            split: test
            debug: true
    
        models:
        - name: gpt4v
            api_key: [Your API Key] # to call GPT4V model.
    1. Code refactorization for organizing multiple groups of datasets with integrated yaml file, see details at managing datasets in MIMIC-IT format. For example,
        IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]
            LADD: # Dataset name can be assigned at any name you want
                mimicit_path: azure_storage/json/LA/LADD_instructions.json # Path of the instruction json file
                images_path: azure_storage/Parquets/LA.parquet # Path of the image parquet file
                num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.
            M3IT_CAPTIONING:
                mimicit_path: azure_storage/json/M3IT/captioning/coco/coco_instructions.json
                images_path: azure_storage/Parquets/coco.parquet
                num_samples: 20000
    This is a major change and would result previous code not runnable, please check the details.

[2023-08]

  1. Added Support for using Azure, Anthropic, Palm, Cohere models for Self-Instruct with Syphus pipeline, for information on usage modify this line with your selected model and set your API keys in the environment. For more information see LiteLLM

[2023-07]: Anouncing MIMIC-IT dataset for multiple interleaved image-text/video instruction tuning.

  1. ๐Ÿค— Checkout MIMIC-IT on Huggingface datasets.
  2. ๐Ÿฅš Update Eggs section for downloading MIMIC-IT dataset.
  3. ๐Ÿฅƒ Contact us if you wish to develop Otter for your scenarios (for satellite images or funny videos?). We aim to support and assist with Otter's diverse use cases. OpenFlamingo and Otter are strong models with the Flamingo's excellently designed architecture that accepts multiple images/videos or other modality inputs. Let's build more interesting models together.

[2023-06]

  1. ๐Ÿงจ Download MIMIC-IT Dataset. For more details on navigating the dataset, please refer to MIMIC-IT Dataset README.
  2. ๐ŸŽ๏ธ Run Otter Locally. You can run our model locally with at least 16G GPU mem for tasks like image/video tagging and captioning and identifying harmful content. We fix a bug related to video inference where frame tensors were mistakenly unsqueezed to a wrong vision_x.

    Make sure to adjust the sys.path.append("../..") correctly to access otter.modeling_otter in order to launch the model.

  3. ๐Ÿค— Check our paper introducing MIMIC-IT in details. Meet MIMIC-IT, the first multimodal in-context instruction tuning dataset with 2.8M instructions! From general scene understanding to spotting subtle differences and enhancing egocentric view comprehension for AR headsets, our MIMIC-IT dataset has it all.

๐Ÿฆฆ Why In-Context Instruction Tuning?

Large Language Models (LLMs) have demonstrated exceptional universal aptitude as few/zero-shot learners for numerous tasks, owing to their pre-training on extensive text data. Among these LLMs, GPT-3 stands out as a prominent model with significant capabilities. Additionally, variants of GPT-3, namely InstructGPT and ChatGPT, have proven effective in interpreting natural language instructions to perform complex real-world tasks, thanks to instruction tuning.

Motivated by the upstream interleaved format pretraining of the Flamingo model, we present ๐Ÿฆฆ Otter, a multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo). We train our Otter in an in-context instruction tuning way on our proposed MI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. Otter showcases improved instruction-following and in-context learning ability in both images and videos.

๐Ÿ—„ MIMIC-IT Dataset Details

MIMIC-IT enables the application of egocentric visual assistant model that can serve that can answer your questions like Hey, Do you think I left my keys on the table?. Harness the power of MIMIC-IT to unlock the full potential of your AI-driven visual assistant and elevate your interactive vision-language tasks to new heights.

We also introduce Syphus, an automated pipeline for generating high-quality instruction-response pairs in multiple languages. Building upon the framework proposed by LLaVA, we utilize ChatGPT to generate instruction-response pairs based on visual content. To ensure the quality of the generated instruction-response pairs, our pipeline incorporates system messages, visual annotations, and in-context examples as prompts for ChatGPT.

For more details, please check the MIMIC-IT dataset.

๐Ÿค– Otter Model Details

Otter is designed to support multi-modal in-context instruction tuning based on the OpenFlamingo model, which involves conditioning the language model on the corresponding media, such as an image that corresponds to a caption or an instruction-response pair.

We train Otter on MIMIC-IT dataset with approximately 2.8 million in-context instruction-response pairs, which are structured into a cohesive template to facilitate various tasks. Otter supports videos inputs (frames are arranged as original Flamingo's implementation) and multiple images inputs as in-context examples, which is the first multi-modal instruction tuned model.

The following template encompasses images, user instructions, and model-generated responses, utilizing the User and GPT role labels to enable seamless user-assistant interactions.

prompt = f"<image>User: {instruction} GPT:<answer> {response}<endofchunk>"

Training the Otter model on the MIMIC-IT dataset allows it to acquire different capacities, as demonstrated by the LA and SD tasks. Trained on the LA task, the model exhibits exceptional scene comprehension, reasoning abilities, and multi-round conversation capabilities.

# multi-round of conversation
prompt = f"<image>User: {first_instruction} GPT:<answer> {first_response}<endofchunk>User: {second_instruction} GPT:<answer>"

Regarding the concept of organizing visual-language in-context examples, we demonstrate here the acquired ability of the Otter model to follow inter-contextual instructions after training on the LA-T2T task. The organized input data format is as follows:

# Multiple in-context example with similar instructions
prompt = f"<image>User:{ict_first_instruction} GPT: <answer>{ict_first_response}<|endofchunk|><image>User:{ict_second_instruction} GPT: <answer>{ict_second_response}<|endofchunk|><image>User:{query_instruction} GPT: <answer>"

For more details, please refer to our paper's appendix for other tasks.

๐Ÿ—‚๏ธ Environments

  1. Compare cuda version returned by nvidia-smi and nvcc --version. They need to match. Or at least, the version get by nvcc --version should be <= the version get by nvidia-smi.
  2. Install the pytorch that matches your cuda version. (e.g. cuda 11.7 torch 2.0.0). We have successfully run this code on cuda 11.1 torch 1.10.1 and cuda 11.7 torch 2.0.0. You can refer to PyTorch's documentation, Latest or Previous.
  3. You may install via conda env create -f environment.yml. Especially to make sure the transformers>=4.28.0, accelerate>=0.18.0.

After configuring environment, you can use the ๐Ÿฆฉ Flamingo model / ๐Ÿฆฆ Otter model as a ๐Ÿค— Hugging Face model with only a few lines! One-click and then model configs/weights are downloaded automatically. Please refer to Huggingface Otter/Flamingo for details.

โ˜„๏ธ Training

Otter is trained based on OpenFlamingo. You may need to use converted weights at luodian/OTTER-9B-INIT or luodian/OTTER-MPT7B-Init. They are respectively converted from OpenFlamingo-LLaMA7B-v1 and OpenFlamingo-MPT7B-v2, we added a <answer> token for Otter's downstream instruction tuning.

You may also use any trained Otter weights to start with your training on top of ours, see them at Otter Weights. You can refer to MIMIC-IT for preparing image/instruction/train json files.

export PYTHONPATH=.
RUN_NAME="Otter_MPT7B"
GPU=8
WORKERS=$((${GPU}*2))

echo "Using ${GPU} GPUs and ${WORKERS} workers"
echo "Running ${RUN_NAME}"

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_zero3.yaml \
    --num_processes=${GPU} \
    pipeline/train/instruction_following.py \
    --pretrained_model_name_or_path=luodian/OTTER-MPT7B-Init \
    --model_name=otter \
    --instruction_format=simple \
    --training_data_yaml=./shared_scripts/Demo_Data.yaml \
    --batch_size=8 \
    --num_epochs=3 \
    --report_to_wandb \
    --wandb_entity=ntu-slab \
    --external_save_dir=./checkpoints \
    --run_name=${RUN_NAME} \
    --wandb_project=Otter_MPTV \
    --workers=${WORKERS} \
    --lr_scheduler=cosine \
    --learning_rate=2e-5 \
    --warmup_steps_ratio=0.01 \
    --save_hf_model \
    --max_seq_len=1024 \

๐Ÿ“‘ Citation

If you found this repository useful, please consider citing:

@article{li2023otter,
  title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
  author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
  journal={arXiv preprint arXiv:2305.03726},
  year={2023}
}

@article{li2023mimicit,
    title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
    author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
    year={2023},
    eprint={2306.05425},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

๐Ÿ‘จโ€๐Ÿซ Acknowledgements

We thank Jack Hessel for the advise and support, as well as the OpenFlamingo team for their great contribution to the open source community.

Huge accolades to Flamingo and OpenFlamingo team for the work on this great architecture.

๐Ÿ“ Related Projects

otter's People

Contributors

arman-hk avatar bigjoon avatar chunyuanli avatar cliangyu avatar eltociear avatar ishaan-jaff avatar jingkang50 avatar king159 avatar lmms-lab avatar luodian avatar pufanyi avatar zhangyuanhan-ai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

otter's Issues

Release the evaluation scripts

Hello, I'm trying to test Otter on MSCOCO dataset, would you please release the evaluation codes? Thanks for your help!

Data issues

Hi, thanks for the amazing work and released MIMIC-IT! seems there're a few issues:

  • for LLaVA-In-Context, seems meta link here is missing, where I assume it's supposed to be the LAxxx_train.json files? maybe there're misunderstandings, and it seems to me that here does not exclude the LAxx_INS prefix (e.g. cur_image_id.split('_')[-1] for LACONV, LACR_I2I, etc), otherwise LAxx_INS_ prefix is unexpectedly included for reading coco images. and there're some cases that have the key like coco/train2017/000000033471_2.jpg, where no _2 img found?
  • for TV caption, in TVC_instructions.json, seems the image ids do not correspond with the ids in converted TVC.json. There are some repetitive patterns, e.g. TVC_IMG_castle_s07e09_seg02_clip_02_castle_s07e09_seg02_clip_02_00009 or TVC_IMG_s04e13_seg01_clip_00_bbt_s04e13_seg01 such that it requires to rekey by r'(TVC_IMG)_(.+?_clip_[0-9]+)_(.+?_clip_[0-9]+)_([0-9]+)' for both cases
  • for spot difference, probably [:5] here is unexpected, otherwise only 5 examples are used?
  • typo here, seems to be video.VisualStoryTelling

For other datasets, it would be great to release the processed x.json file (I noticed the egg version would be coming soon) as some datasets are too old to acquire/process and some video datasets are large. Thank you!

[Feature Support] For Multi-Batch Data Inference Support

If you wish to generate descriptions for multiple images at one time. Simply use the following codes:

import requests
import torch
import transformers
from PIL import Image
from otter.modeling_otter import OtterForConditionalGeneration

model = OtterForConditionalGeneration.from_pretrained(
    "luodian/otter-9b-hf", device_map="auto"
)

tokenizer = model.text_tokenizer
image_processor = transformers.CLIPImageProcessor()
demo_image_one = Image.open(
    requests.get(
        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
    ).raw
)
demo_image_two = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg", stream=True
    ).raw
)
query_image = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg", stream=True
    ).raw
)
vision_x = (
    image_processor.preprocess(
        [demo_image_one, demo_image_two, query_image], return_tensors="pt"
    )["pixel_values"]
    .unsqueeze(1)
    .unsqueeze(1) #Here, we reshape the input images into shape [B, 1, 1, 3, H, W], where T_img=1 and F=1.
)
model.text_tokenizer.padding_side = "left"
lang_x = model.text_tokenizer(
    [
        "<image> User: what does the image describe? GPT: <answer>", "<image> User: what does the image describe? GPT: <answer>", "<image> User: what does the image describe? GPT: <answer>" 
    ], #Here, we provide instructions for all images, respectively.
    return_tensors="pt", padding=True #To avoid different lengths of the instructions
)
generated_text = model.generate(
    vision_x=vision_x.to(model.device),
    lang_x=lang_x["input_ids"].to(model.device),
    attention_mask=lang_x["attention_mask"].to(model.device),
    max_new_tokens=256, #4 seconds; max_new_tokens=512, 7 seconds
    num_beams=1,
    no_repeat_ngram_size=3,
)
for i in range(vision_x.size(0):
    print(f"Generated text for image {i}: ", model.text_tokenizer.decode(generated_text[i]))

The selling point of our project.

We should detail the following points.

  1. The first instruction tuning work for V+L unsupervised pre-trained: Flamingo.
  2. A visual instruction tuning dataset with in-context examples.
  3. low-resource (training efficiency and storage efficiency).

Some questions about dataset construction

Thanks for sharing the inspiring work! After reading the original paper, I have some questions about the construction of the MIMIC-IT dataset.

  1. Which dataset do the samples in Fig.2(b) come from? Is it PVSG repository?
  2. How to get the answers for the sample? Are they manually annotated? I know for the VQA part and the LLaVA part, they are naturally paired with the questions/instructions, how about the PVSG part?
  3. How do you get the instructions/queries for the samples? Are they from some specific set of handwritten templates or are they generated by GPT as in LLaVA?

[dataset] Related instruction IDs for LA In-context are incorrect

I was looking around the annotations for LA in-context, I noticed that the instructions specified as related instructions do not exist. Dense Caption doesn't seem to have this problem.

In [2]: import json

In [3]: with open('/path/to/LA_instructions.json') as f:
   ...:     annotations = json.load(f)
   ...: 

In [4]: list(annotations['data'].keys())[:10]
Out[4]: 
['LACONV_00_INS_000000033471_2',
 'LACONV_00_INS_000000052846_4',
 'LACONV_00_INS_000000334872_3',
 'LACONV_00_INS_000000319154_4',
 'LACONV_00_INS_000000398214_4',
 'LACONV_00_INS_000000520873_4',
 'LACONV_00_INS_000000575173_3',
 'LACONV_00_INS_000000087286_3',
 'LACONV_00_INS_000000032286_4',
 'LACONV_00_INS_000000175217_4']

In [5]: annotations['data']['LACONV_00_INS_000000033471_2']
Out[5]: 
{'instruction': 'Is the bus driving down the street or pulled off to the side?',
 'answer': 'The bus is driving down the street, which is crowded with people and other vehicles.',
 'image_ids': ['LA_00_IMG_000000033471'],
 'rel_ins_ids': ['LACONV_00_INS_000000033471_0',
  'LACONV_00_INS_000000033471_1']}

In [6]: annotations['data']['LACONV_00_INS_000000033471_0']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 annotations['data']['LACONV_00_INS_000000033471_0']

KeyError: 'LACONV_00_INS_000000033471_0'

In [7]: annotations['data']['LACONV_00_INS_000000033471_1']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 annotations['data']['LACONV_00_INS_000000033471_1']

KeyError: 'LACONV_00_INS_000000033471_1'

Maybe I've misunderstood what related instructions are? Either way, please let me know!

Rewrite Open_Flamingo into a ๐Ÿค— huggingface model

  • configuration_flamingo.py: write the configuration file of Flamingo
  • modeling_flamingo.py: rewrite the architecture
  • load the checkpoint: rename the layers and .load_state_dict
  • processing_flamingo.py: delay, as we can use the transformers.CLIPImageProcessor() for now
  • model special loading: device_map=auto
  • model special loading: load_in_8bit=True
  • model forward: give the lost
  • model generation: text+image->model->text
  • model card writing: documentation of Flamingo-9B

A question about media_locations

I read your paper and understand that each sample you use for training will contain multiple examples, i.e., each sample will contain multiple images. The images and text will then be arranged in an interleaved form and fed into the model. I don't know if I understand correctly. And I noticed that your code only inputs one image at a time into the model, which makes me wonder

I would appreciate it if you could answer my questions

[demo] RuntimeError: std::bad_alloc

The example code is pulled from here.
Error:

(otter) user@env:~/Otter$ python test-model.py                             
                                                                                    
Using pad_token, but it is not set yet.                                            
Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/
โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 4/4 [00:35<00:00,  8.92s/it]       
Enter prompts (comma-separated): what are they doing?                              

Prompt: what are they doing?                                                       
Traceback (most recent call last):                                                 
  File "/home/user/Otter/test-model.py", line 141, in <module>                                                                                                        
    response = get_response(frames_list, prompt, model, image_processor)                                                                                              
  File "/home/user/Otter/test-model.py", line 98, in get_response                                                                                                     
    generated_text = model.generate(                                               
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                        
    return func(*args, **kwargs)                                                   
  File "/home/user/Otter/otter/modeling_otter.py", line 873, in generate                                                                                              
    self._encode_vision_x(vision_x=vision_x)                                       
  File "/home/user/Otter/otter/modeling_otter.py", line 831, in _encode_vision_x                                                                                      
    vision_x = self.vision_encoder(vision_x)[0][:, 1:, :]                                                                                                             
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                             
    return forward_call(*args, **kwargs)                                           
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 940, in forward                                  
    return self.vision_model(                                                      
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                             
    return forward_call(*args, **kwargs)                                           
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 865, in forward                                  
    hidden_states = self.embeddings(pixel_values)                                  
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                             
    return forward_call(*args, **kwargs)                                           
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/models/clip/modeling_clip.py", line 195, in forward                                  
    patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]                                                                               
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl                                             
    return forward_call(*args, **kwargs)                                           
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward                                                   
    return self._conv_forward(input, self.weight, self.bias)                                                                                                          
  File "/home/user/miniconda3/envs/otter/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward                                             
    return F.conv2d(input, weight, bias, self.stride,                              
RuntimeError: std::bad_alloc

Not sure if this is correct, but I've used this as stated in the example (main):

    model = OtterForConditionalGeneration.from_pretrained(
        "luodian/otter-9b-dc-hf",
    )

My packages are:

(otter) user@env:~/Otter$ pip list | grep -e torch -e xformers
open-clip-torch          2.20.0
torch                    2.0.1
torchaudio               2.0.2
torchvision              0.15.2

Originally posted by @Nntsyeo in #147 (comment)

AttributeError: module transformers has no attribute TFOtterForConditionalGeneration

transformers = 4.28.0

For some reason, I couldn't access huggingface, so I use offline mode
model = OtterForConditionalGeneration.from_pretrained("/my/file/path/config.json", device_map="auto",
from_tf=True)

But the following error was encountered

The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
Traceback (most recent call last):
File "/home/user4/cww/Otter-main/pipeline/demo/otter_image.py", line 102, in
model = OtterForConditionalGeneration.from_pretrained("/data/user4/CWW_OTTER/config.json", device_map="auto",
File "/data/anaconda3/envs/otter/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2761, in from_pretrained
model, loading_info = load_tf2_checkpoint_in_pytorch_model(
File "/data/anaconda3/envs/otter/lib/python3.9/site-packages/transformers/modeling_tf_pytorch_utils.py", line 407, in load_tf2_checkpoint_in_pytorch_model
tf_model_class = getattr(transformers, tf_model_class_name)
File "/data/anaconda3/envs/otter/lib/python3.9/site-packages/transformers/utils/import_utils.py", line 1139, in getattr
raise AttributeError(f"module {self.name} has no attribute {name}")
AttributeError: module transformers has no attribute TFOtterForConditionalGeneration

Plans on releasing multilingual model/dataset

Dear author,
Thank you for your great work. I am very interested in the multilingual ability of the large language model and it seems that the current released model supports English only. The multilingual annotations in the dataset are also unavailable until now. So if you have any plans of releasing the multilingual annotation or the multilingual model?

[Model] Bug when initializing Openflamingo model

Hi, Otter is a great job. However, when initializing openflamingo model, I come across this problem below.

File "pipeline/train/instruction_following.py", line 351, in main
    model = FlamingoForConditionalGeneration.from_pretrained(
  File "/yang_yu/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2305, in from_pretrained
    config, model_kwargs = cls.config_class.from_pretrained(
  File "/yang_yu/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/configuration_utils.py", line 554, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/yang_yu/miniconda3/envs/otter/lib/python3.9/site-packages/transformers/configuration_utils.py", line 701, in from_dict
    config = cls(**config_dict)
  File "/yang_yu/lq/project/Otter/flamingo/configuration_flamingo.py", line 69, in __init__
    if text_config["architectures"][0] == "MPTForCausalLM":
TypeError: 'NoneType' object is not subscriptable

Pull Request CI/CD check

Write pytest for:

  • dataset loader test
  • train test
  • demo1: chat test
  • demo2: in-context learning test
  • github action for github pull request CI/CD

Confusion regarding model versions?

Hi, with release of the new MIMIC-IT paper---I am bit confused regarding the different model versions you have. Is the model on huggingface the same as the ones for which you report results in the paper? As I understand it, the results in the paper come from instruction fine-tuning OpenFlamingo on the larger MIMIC-IT dataset whereas the one on huggingface is fine-tuned on an older, slightly smaller version?
If the two models are different, are the weights of the new model already provided on huggingface?
Further, you mention in your readme that there is a new OtterV2 model that you've released---Is this the newer model you describe in the paper? If yes, could you also provide the links to these model weights?
Thanks for your incredible contributions, and looking forward to playing around with the new Otter models!

Demo & Model Serving

  • Multiple model serving and selection
  • Generation parameters
  • Multi-GPU inference support (with HF model)
  • Data recycle

Can I use a single gpu to train this model?

Before you open an issue, please check if a similar issue already exists or has been closed before.

When you open an issue, please be sure to include the following

  • A descriptive title: Can I use a single gpu to train this model?
  • A detailed description: Thanks for your nice work. I want to train the Otter model. May I use a single GPU to train the model. Could you please share your accelerate config? Thanks!
  • Assign an issue type tag (label):
    • dataset (mimic-it download, usage, etc.),
    • demo (online demo), doc (readme, wiki, paper, video etc.),
    • evaluation (evaluation result, performance of Otter etc.),
    • model (model configuration, components, etc.),
    • train (training configuration, process, code, etc.)

Thank you for your contributions!

Questions about the size of model parameters?

Hi, I am very curious why the parameters of openflamingo-9b-hf will take up 30G, because the parameters of LLaMA-7B only take up 13G, and the perceiver and vision_encoder should not take up much space.

[demo] error encountered running Mini Demo code in https://github.com/Luodian/Otter/blob/main/docs/demo.md.

File "Otter/Otter/xformers_model/clip.py", line 77, in forward
patch_embeds = self.patch_embedding(
File "python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: GET was unable to find an engine to execute this computation

Environment:
torch 2.0.0, python 3.9, cuda11.7,

is otter model architecture the same as openflamingo?

hi to the team! in the codebase i found that flamingo architecture and otter architecture are separated into /flamingo and /otter folder, both in huggingface interfaces.

however, after reading otter paper i found that otter seems has no difference in architecture with openflamingo. is this can be confirmed?(otherwise i sincerely aplogoize for my misunderstanding).

if so, if i want to continue training flamingo, should i use modeling_flamingo.py or modeling_otter.py? or they are both ok? heartfully thanks for your kind help.

different atten-mask depending on whether xformer is installed

hi, thanks for your awesome work, which helps me a lot.

i'm trying to deploy otter locally. diving into the code, i found the otter behaivor is different depending on whether xformer is installed. precisely, when the xformer is not installed, otter use openflamingo format attention_mask, when installed, otter use default attention_mask.

i was wondering if this will introduce any performance gap between use xformer or not? and the training process used xformer or not?

4.19-4.20 MVP Stage Assignments

Liangyu's work

  • add support for Vicuna pretrained LLM, https://github.com/lm-sys/FastChat#vicuna-weights. @liangyuch
  • support interactive chat (may init from alpaca or vicuna to better support for chat demo) @liangyuch
  • support language only chat @liangyuch
  • !!!!support in-context learning example @liangyuch
  • host collie demo on one or multiple 3090 GPU
    • run python collie_chat.py --checkpoint_path xxx

Bo's work

  • check image preprocess (image mean/variance) in (1) finetuning stage on multi-instruct (2) pretraining stage on mmc4. they need to be aligned. @Luodian
  • check trained model's ability to support (1) interactive chat (2) different instructions.
  • check special tokens are correctly aligned during training and demo evaluation.

Yuanhan's work

  • test accelerator training, or give Bo's access to a prepared environment to test accelerator on 8x32G V100.
  • prepare datasets card, introduce our multi-instruct in-context learning dataset.
  • deepspead config, solve hard-code problem.
  • review training code;
  • Add LoRA

Data Paths

We use {azure_blob} as the mounted blob's start path

Trained Model Paths

  1. Train on GQA: {azure_blob}/models/collie9B_gqa

Helpful Discussion

  1. confirm training data for current stage
  2. setup demo and support interactive chat.
  3. fsdp, 8bit Adam, LoRA for efficient and distributed tuning.
  4. support Collie9B inference on consumer GPU
  5. enrich instructions from ChatGPT.

Documentation Improvement before our publication

Documentation

refine requirements.txt and environment.yml

  1. check the environment on 8*A100, 8*V100, 4*3090, 2*8*v100

dataset card

dataset description: size, source category, maybe a pie chart?

model card

OpenFlamingo 9B model using a CLIP ViT-Large vision encoder and a LLaMA-7B language mode

Licence

Llama, OpenFlimingo and our dataset are following different licences and we need to align or inherit some of them.

readme.md

  1. train script on 8*A100, 2*8*V100, 4*3090
  2. title change
  3. demo links
  4. introduction: why multi-model instruction tuning
  5. an image for main architecture
  6. acknowledgement: we use OpenFlamingo, OFA, etc.
  7. Authorship

Code formatting

To improve the readablity of our model, we should

  1. black format
  2. type hint

maybe a new name? เฌ˜(เฉญหŠแต•ห‹)เฉญ

Liger: Language ๐Ÿฏ + vision ๐Ÿฆ

current args description in `instruction_following.py` is not updated to our current training script.

We may need to update them in our next PR.

e.g.

    parser.add_argument("--use_media_placement_augmentation", action="store_true")
    parser.add_argument("--offline", action="store_true")
    parser.add_argument("--num_epochs", type=int, default=1)
    parser.add_argument("--logging_steps", type=int, default=100, help="log loss every n steps")
    # Sum of gradient optimization batch size
    parser.add_argument("--batch_size", type=int, default=128)

    parser.add_argument("--gradient_accumulation_steps", type=int, default=1)
    parser.add_argument(
        "--pretrained_model_name_or_path",
        type=str,
        help="path to huggingface model or model identifier from local path or huggingface.co",
        default=None,
    )
    parser.add_argument(
        "--load_from_original_checkpoint",
        type=str,
        help="path to openflamingo provided checkpoint, in .pt format",
        default=None,
    )
    parser.add_argument(
        "--resume_from_checkpoint",
        action="store_true",
    )
    parser.add_argument(
        "--overwrite_checkpoint",
        action="store_true",
    )
    parser.add_argument(
        "--delete_previous_checkpoint",
        action="store_true",
        help="delete previous checkpoint when saving new checkpoint",
    )
    parser.add_argument(
        "--multi_instruct_path",
        type=str,
        help="path to multi_instruct dataset, this should be a glob pattern such as vision_language_examples.tsv",
    )
    parser.add_argument(
        "--images_path",
        type=str,
        help="path to images_path dataset, this should be a glob pattern such as vision_language_examples.tsv",
    )
    parser.add_argument(
        "--train_config_path",
        type=str,
        help="path to train_config_path dataset, this should be a glob pattern such as vision_language_examples.tsv",
    )
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--learning_rate", default=1e-4, type=float)
    parser.add_argument(
        "--lr_scheduler",
        default="constant",
        type=str,
        help="constant, linear, or cosine",
    )
    parser.add_argument("--loss_multiplier_multi_instruct", type=float, default=1.0)

Computing output likelihoods with the model

Hi, is it possible to get the tokenwise log-likelihood scores of different outputs from the model?

The use-case would be something like:
Given an interleaved image/text input and a list of output text candidates, we should be able to get a score for each output candidate and then return their ranked list, rather than generating the outputs directly. This would be close to how LLMs are evaluated on MCQ tasks. An example from the T0 paper Page 6 (https://arxiv.org/pdf/2110.08207.pdf):

For tasks that involve choosing the correct completion from several options (e.g. multiple choice
question answering), we follow Brown et al. (2020) and use rank classification to evaluate our
model: we compute the log-likelihood of each of the target options under the fine-tuned model and
select the option with the highest log-likelihood as the prediction. For simplicity, we do not apply
length normalization to the log-likelihoods of the target options.

Is it straightforward to do this with Otter? I assume since the LM is built with transformers there should be a possibility to use output score functions already implemented (haven't dug into this yet)?

questions about different instruction files in LA

hi to the team, otter seems really interesting!

when i'm investigating the released dataset, i found that LA (in my opinion it refers to LLaVA) folder contains more than one instruction files(and their corresponding _train.json), they are

LACONV_instructions.json
LACR_I2I_instructions.json
LACR_T2T_instructions.json
LADD_instructions.json

what are these instruction prefixes acutally mean?

I carefully looked into the MIMIC-IT paper, it mentioned LA-T2T task but did not revealing what it is.
From this issue #149 i know they are jointly used to train OTTER-9B-LA-InContext, but still at sea.

i would extend sincere apology if i missed something, thanks for your patience!

ERROR: Could not build wheels for horovod, which is required to install pyproject.toml-based projects

Hello Otter team!

I have encounter an issue when installing the environments:
ร— python setup.py bdist_wheel did not run successfully.
โ”‚ exit code: 1

ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Failed to build horovod
ERROR: Could not build wheels for horovod, which is required to install pyproject.toml-based projects

A little bit of context: I directly ran conda env create -f environment.yml

Support ๐ŸŒป LoRA

  • A LoRA model
    If we can use peft wrapper, it will be fast and we can test multiple PEFT methods.
    If not, we will write an additional model modifying the current modeling_flamingo

Configure openai key in Syphus

I want to use Syphus on your own dataset,but there was an error when requesting the API for openai, and I don't know how to choose OPENAI_ API_ BASE, OPENAI_ API_ VERSION and OPENAI_ API_ENGINE.When I use 'chatgpt0301' or other models as ENGINE, the value of ENGINE cannot be accessed. When I abandon engine and use gpt-3.5-turbo as the model, request failure will still be displayed.
Can you provide some parameter examples? Thank you!

Accelerate config file

It would be helpful if you could share the Accelerate configuration required to run with 2xRTX-3090-24G GPUs.

[evaluation] MMAGIBench can not be accessed

hi to the team, otter seems really interesting!

when i was reading your paper of MIMIC-IT, i noticed otter model was evaluated on MMAGIBench benchmark, its quotation is

[43] MMAGIBench Team. Mmagibench: A universal multi-modal benchmark towards artificial general intelligence. https://github.com/open-mmlab/mmagibench, 2023. 3, 9, 10

however, the project page of MMAGIBench is not accessible, and i can hardly find more information about this benchmark other than Otter paper, could you please give some hint? thanks in advance!

Multi-image inference scripts

Hello, thanks so much for providing your amazing model. Are there plans to release a colab notebook or example python script for using Otter for in-context learning / multi-image inputs?
Thanks!

Training data examples

Thanks for sharing the amazing project! Can you provide examples of training data to better understand the format of the training data

[Fix/Train] `image_processor` is missing in `get_mimicit_dataset`

Before you open an issue, please check if a similar issue already exists or has been closed before.

When you open an issue, please be sure to include the following

  • A descriptive title: Image Processor is missing in get_mimicit_dataset
  • A detailed description
  • Assign an issue type tag (label):
    • dataset (mimic-it download, usage, etc.),
    • demo (online demo), doc (readme, wiki, paper, video etc.),
    • evaluation (evaluation result, performance of Otter etc.),
    • model (model configuration, components, etc.),
    • train (training configuration, process, code, etc.)

Image_processor is missing in get_mimicit_dataset function.

Improving Demo

For any demo related issues, please comment in this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.