Coder Social home page Coder Social logo

tiger-ai-lab / mantis Goto Github PK

View Code? Open in Web Editor NEW
121.0 8.0 9.0 77.67 MB

Official code for Paper "Mantis: Multi-Image Instruction Tuning"

Home Page: https://tiger-ai-lab.github.io/Mantis/

License: Apache License 2.0

Python 87.59% Shell 10.20% Jupyter Notebook 2.21%
language vision fuyu llava-llama3 lmm mantis mllm video vlm multi-image-understanding

mantis's People

Contributors

jdf-prog avatar wenhuchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mantis's Issues

Question about mantis-eval matching criteria

Hi,

Thank you for open-sourcing this great work. I appreciate the team's efforts in putting this together.

I have a question about the evaluation criteria in the mantis-eval, "short-answer" question specifically. It looks like the correctness of "short-answer" is judged by exact match between model's output and the reference answer, without further parsing(see the edit below). But the prompt template for this type of question also instructs the model to output both analysis and final answer.

In this case, I noticed that a model would give the correct answer (for example, "Yes") followed by some reasoning, but such an answer wouldn't be counted as correct because of how the exact match works.

Could you help me understand why it's written like this? Does it make sense to improve the matching rule? Thanks.

Edit:
I just saw that there is parsing on the model's output that only takes the outputs after "Final Answer: ". This makes much more sense. However, I noticed that sometimes a model would answer correctly but with more than one word.
Do you think it makes sense to loosen the matching criteria? Alternatively, I think it also makes sense to make the instruction more clear in the prompt template, for example, by adding one more sentence like "Answer the question in a single word or phrase."

about conda env for finetuning

Nice work! Thanks for contribution.

We are carrying out instruction tuning experiments with Mantis-8B-siglip-llama3. The pretraining and instruction finetuning with lora work fine, except for full param finetuning. The warning below came up and finetuning got stuck. I put this here for others reference.

Invalidate trace cache @ step 344: expected module 345, but got module 1

Referring to the issue, this might be due to how accelerate or deepspeed is installed. Noticing that there is no version specifications in setup.py from this repo, may we ask the exact versions you use for fine-tuning, for some dependencies like torch, accelerate or deepspeed?

Thanks in advance.

Idefics2 full fine-tuning getting RuntimeError: shape mismatch

I'm working on fine-tuning Idefics2 with multiple images in instruction
I follow this script for full fine-tuning: https://github.com/TIGER-AI-Lab/Mantis/blob/89d34077bd87b66eaadc13117add553e3a3d4c0b/mantis/train/scripts/train_idefics2_full.sh

Here is the command

NCCL_DEBUG=WARN accelerate launch --config_file=./accelerate_configs/accelerate_config_zero3.yaml \
    --machine_rank 0 --main_process_ip 10.29.35.44 --main_process_port 12956 \
    --num_machines 1 --num_processes 8 \
    train_idefics2.py \
    --model_name_or_path HuggingFaceM4/idefics2-8b \
    --data_config_file custom_data_config.yaml \
    --data_format chat \
    --run_name 240523_idefics2_mantis \
    --output_dir 240523_idefics2_mantis \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 200 \
    --eval_steps 200 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --gradient_checkpointing True \
    --dataloader_num_workers 5 \
    --report_to wandb \
    --do_train \
    --lora_enabled False \
    --qlora_enabled False \
    --dora_enabled False \
    --max_seq_len 512 \
    --fp16 \
    --attn_implementation eager

Error i got is

[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1677, in forward
[rank0]:     inputs_embeds = self.inputs_merger(
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1564, in inputs_merger
[rank0]:     new_inputs_embeds[special_image_token_mask] = reshaped_image_hidden_states
[rank0]: RuntimeError: shape mismatch: value tensor of shape [256, 4096] cannot be broadcast to indexing result of shape [192, 4096]

Any suggestions how to fix it?

Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.