tiger-ai-lab / mantis Goto Github PK

View Code? Open in Web Editor NEW

121.0 8.0 9.0 77.67 MB

Official code for Paper "Mantis: Multi-Image Instruction Tuning"

Home Page: https://tiger-ai-lab.github.io/Mantis/

License: Apache License 2.0

Python 87.59% Shell 10.20% Jupyter Notebook 2.21%

language vision fuyu llava-llama3 lmm mantis mllm video vlm multi-image-understanding

mantis's People

Contributors

Stargazers

Watchers

Forkers

awayfromkeyboardwarrior scott-mao techthiyanes scottsuk0306 mqianliu whitefu xiechengmude chris-tng

mantis's Issues

Support Chinese?

Excellent work! BTW, Does the model support Chinese?

Nice work! Does the mantis has image seperator when sending to LLM?

Hi, wanna ask ,does mantis used image separator between images sending to LLM? From i can tell, llava doesn't have it and the data used in Mantis doesn't provide a str for separator too.

Also, do u think which way is better? If consider video frames input as well

Question about mantis-eval matching criteria

Hi,

Thank you for open-sourcing this great work. I appreciate the team's efforts in putting this together.

I have a question about the evaluation criteria in the mantis-eval, "short-answer" question specifically. It looks like the correctness of "short-answer" is judged by exact match between model's output and the reference answer, ~~without further parsing~~(see the edit below). But the prompt template for this type of question also instructs the model to output both analysis and final answer.

In this case, I noticed that a model would give the correct answer (for example, "Yes") followed by some reasoning, but such an answer wouldn't be counted as correct because of how the exact match works.

Could you help me understand why it's written like this? Does it make sense to improve the matching rule? Thanks.

Edit:
I just saw that there is parsing on the model's output that only takes the outputs after "Final Answer: ". This makes much more sense. However, I noticed that sometimes a model would answer correctly but with more than one word.
Do you think it makes sense to loosen the matching criteria? Alternatively, I think it also makes sense to make the instruction more clear in the prompt template, for example, by adding one more sentence like "Answer the question in a single word or phrase."

about conda env for finetuning

Nice work! Thanks for contribution.

We are carrying out instruction tuning experiments with Mantis-8B-siglip-llama3. The pretraining and instruction finetuning with lora work fine, except for full param finetuning. The warning below came up and finetuning got stuck. I put this here for others reference.

Invalidate trace cache @ step 344: expected module 345, but got module 1

Referring to the issue, this might be due to how accelerate or deepspeed is installed. Noticing that there is no version specifications in setup.py from this repo, may we ask the exact versions you use for fine-tuning, for some dependencies like torch, accelerate or deepspeed?

Thanks in advance.

I have a question about the evaluation of Q-BENCH dataset

Why not use q-bench2-a1-pair-test.json for q-bench2?

Idefics2 full fine-tuning getting RuntimeError: shape mismatch

I'm working on fine-tuning Idefics2 with multiple images in instruction
I follow this script for full fine-tuning: https://github.com/TIGER-AI-Lab/Mantis/blob/89d34077bd87b66eaadc13117add553e3a3d4c0b/mantis/train/scripts/train_idefics2_full.sh

Here is the command

NCCL_DEBUG=WARN accelerate launch --config_file=./accelerate_configs/accelerate_config_zero3.yaml \
    --machine_rank 0 --main_process_ip 10.29.35.44 --main_process_port 12956 \
    --num_machines 1 --num_processes 8 \
    train_idefics2.py \
    --model_name_or_path HuggingFaceM4/idefics2-8b \
    --data_config_file custom_data_config.yaml \
    --data_format chat \
    --run_name 240523_idefics2_mantis \
    --output_dir 240523_idefics2_mantis \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 200 \
    --eval_steps 200 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --gradient_checkpointing True \
    --dataloader_num_workers 5 \
    --report_to wandb \
    --do_train \
    --lora_enabled False \
    --qlora_enabled False \
    --dora_enabled False \
    --max_seq_len 512 \
    --fp16 \
    --attn_implementation eager

Error i got is

[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1677, in forward
[rank0]:     inputs_embeds = self.inputs_merger(
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1564, in inputs_merger
[rank0]:     new_inputs_embeds[special_image_token_mask] = reshaped_image_hidden_states
[rank0]: RuntimeError: shape mismatch: value tensor of shape [256, 4096] cannot be broadcast to indexing result of shape [192, 4096]

Any suggestions how to fix it?

Thanks in advance

Missing FUYU device_map fix for multi-GPU setups

Issue Description

The FUYU model implementation currently lacks support for multi-GPU setups. This issue has already been addressed and fixed in Huggingface's transformers repository.

Relevant Pull Request

Here's the link to the PR in the Hugging Face repository: huggingface/transformers#29880

tiger-ai-lab / mantis Goto Github PK

mantis's People

Contributors

Stargazers

Watchers

Forkers

mantis's Issues

Support Chinese?

Nice work! Does the mantis has image seperator when sending to LLM?

Question about mantis-eval matching criteria

about conda env for finetuning

I have a question about the evaluation of Q-BENCH dataset

Idefics2 full fine-tuning getting RuntimeError: shape mismatch

Missing FUYU device_map fix for multi-GPU setups

Issue Description

Relevant Pull Request

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent