Coder Social home page Coder Social logo

eric-ai-lab / minigpt-5 Goto Github PK

View Code? Open in Web Editor NEW
842.0 12.0 52.0 63.38 MB

Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"

Home Page: https://eric-ai-lab.github.io/minigpt-5.github.io/

License: Apache License 2.0

Python 100.00%
diffusion-models multimodal-generation multimodal-llm transformers

minigpt-5's Introduction

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Kaizhi Zheng* , Xuehai He* , Xin Eric Wang

University of California, Santa Cruz

teaser

Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of ``generative vokens", acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks.

Model Architecture

arch

Getting Started

Installation

1. Download repo and create environment

Clone our repo and create a new python environment.

git clone https://github.com/eric-ai-lab/MiniGPT-5.git
cd MiniGPT-5
conda create -n minigpt5 python=3.9
conda activate minigpt5
pip install -r requirements.txt

2. Prepare the pretrained weights

Our model is based on the pretrained MiniGPT-4 (including Vicuna and BLIP-2). Please download Vicuna V0 7B weights. Then, set the path to the vicuna weight in the model config file at Line 16.

Since the Pretrained MiniGPT-4 Aligned Checkpoint is small, we already download in config folder, and the model path is set in config file at Line 10.

3. Download MiniGPT-5 Checkpoint

Since our model is trained with two stages (Stage 1: Unimodal Alignment Stage, Stage 2: Multimodal Learning Stage), we provide both two-stage checkpoints here:

Stage 1: CC3M Stage 2: VIST Stage 2: MMDialog
Download Download Download

Stage 2 needs the pretrained weights in Stage 1, so always download Stage 1 weights first.

Please download these weights into a single folder, and we will call this folder as WEIGHT_FOLDER in the following sections.

Demo

We provide a python file to try our model. This file will generate multimodal outputs under the example folder by taking a two-turn multimodal inputs.

cd examples
export IS_STAGE2=True
python3 playground.py --stage1_weight WEIGHT_FOLDER/stage1_cc3m.ckpt 
                        --test_weight WEIGHT_FOLDER/stage2_vist.ckpt

Evaluation

Our model evaluate on three datasets: CC3M, VIST, and MMDialog. Due to the license, we only share some dataset examples under the datasets folder. If you want to fully test the performance, please download the full dataset and format into the same data structures under the datasets folder.

1. Stage 1: Unimodal Alignment Stage (CC3M) evaluation

During this stage, the goal is to generate correct images by giving image descriptions.

Generation (If you have more than one gpus, you can set gpus to 0,1,2...):

export IS_STAGE2=False
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/CC3M
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path cc3m_val.tsv 
                        --test_weight stage1_cc3m.ckpt
                        --gpus 0

Calculate Metric:

export CC3M_FOLDER=datasets/CC3M
python3 metric.py --test_weight stage1_cc3m.ckpt

2. Stage 2: Multimodal Learning Stage (VIST) evaluation

Model will take the previous multimodal story sequences and generate either unimodal or multimodal outputs. Here, the default code is about multimodal input & image generation. To test other settings, please remove the not test condition in Line 280.

Generation:

export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/VIST
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path val_cleaned.json 
                        --test_weight stage2_vist.ckpt
                        --stage1_weight stage1_cc3m.ckpt
                        --gpus 0

Calculate Metric:

python3 metric.py --test_weight stage2_vist.ckpt

3. Stage 2: Multimodal Learning Stage (MMDialog) evaluation

Model will take previous turn multimodal inputs and generate multimodal response for multimodal conversations.

Generation:

export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/MMDialog
export OUTPUT_FOLDER=outputs
python3 train_eval.py --test_data_path test/test_conversations.txt 
                        --test_weight stage2_mmdialog.ckpt
                        --stage1_weight stage1_cc3m.ckpt
                        --gpus 0

Calculate Metric:

python3 metric.py --test_weight stage2_mmdialog.ckpt

Training

1. Stage 1 training

Download the CC3M dataset and format into the same data structure in dataset folder.

Then, we use test data as example:

export IS_STAGE2=False
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/CC3M
python3 train_eval.py --is_training True
                        --train_data_path cc3m_val.tsv
                        --val_data_path cc3m_val.tsv
                        --model_save_name stage1_cc3m_{epoch}-{step}
                        --gpus 0

2. Stage 2 training

Download the VIST or MMDialog datasets and format into the same data structure in dataset folder.

Here we use VIST test data as example:

export IS_STAGE2=True
export WEIGHTFOLDER=WEIGHT_FOLDER
export DATAFOLDER=datasets/VIST
python3 train_eval.py --is_training True
                        --train_data_path val_cleaned.json
                        --val_data_path val_cleaned.json
                        --stage1_weight stage1_cc3m.ckpt
                        --model_save_name stage2_vist_{epoch}-{step}
                        --gpus 0

If you find MiniGPT-5 useful in your research or applications, please cite as below:

@misc{zheng2023minigpt5,
      title={MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens}, 
      author={Kaizhi Zheng and Xuehai He and Xin Eric Wang},
      year={2023},
      journal={arXiv preprint arXiv:2310.02239}
}

minigpt-5's People

Contributors

eltociear avatar eric-xw avatar guspan-tanadi avatar jkooy avatar kzzheng avatar shivanipalya26 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minigpt-5's Issues

[BUG]Still not working, there is an error (TypeError: unsupported opera type (s) for//: 'NoneType' and 'int') when running Python playground. py

Hello, there is an error (TypeError: unsupported operator type (s) for//: 'NoneType' and 'int') running Python playground.py

Operating system: ubuntu 20.04

Python 3.9.18

Other parameters: Same as MiniGPT-5/requirements. txt

All three ckpt files are located in MiniGPT-5/config. The configuration files have all been changed. The weight used is Vicuna-7b-v1.1. However, the following error still occurred.

Run "Python playground. py --stage1_weight /root/MiniGPT-5/config/stage1_cc3m.ckpt --test_weight /root/MiniGPT-5/config/stage2_vist.ckpt" The following error occurred during command execution:

Seed set to 42
Loading VIT
Traceback (most recent call last):
File "/root/MiniGPT-5/examples/playground.py", line 40, in
minigpt5 = MiniGPT5_Model.load_from_checkpoint(stage1_ckpt, strict=False, map_location="cpu", encoder_model_config=model_args, **vars(training_args))
File "/root/anaconda3/envs/minigpt5/lib/python3.9/site-packages/lightning/pytorch/core/module.py", line 1552, in load_from_checkpoint
loaded = _load_from_checkpoint(
File "/root/anaconda3/envs/minigpt5/lib/python3.9/site-packages/lightning/pytorch/core/saving.py", line 89, in _load_from_checkpoint
model = _load_state(cls, checkpoint, strict=strict, kwargs)
File "/root/anaconda3/envs/minigpt5/lib/python3.9/site-packages/lightning/pytorch/core/saving.py", line 156, in _load_state
obj = cls(
_cls_kwargs)
File "/root/MiniGPT-5/model.py", line 68, in init
self.model = MiniGPT5.from_config(minigpt4_config.model_cfg)
File "/root/MiniGPT-5/minigpt4/models/mini_gpt4.py", line 247, in from_config
model = cls(
File "/root/MiniGPT-5/minigpt4/models/mini_gpt5.py", line 46, in init
super().init(*args, **kwargs)
File "/root/MiniGPT-5/minigpt4/models/mini_gpt4.py", line 53, in init
self.visual_encoder, self.ln_vision = self.init_vision_encoder(
File "/root/MiniGPT-5/minigpt4/models/blip2.py", line 65, in init_vision_encoder
visual_encoder = create_eva_vit_g(
File "/root/MiniGPT-5/minigpt4/models/eva_vit.py", line 416, in create_eva_vit_g
model = VisionTransformer(
File "/root/MiniGPT-5/minigpt4/models/eva_vit.py", line 259, in init
self.patch_embed = PatchEmbed(
File "/root/MiniGPT-5/minigpt4/models/eva_vit.py", line 190, in init
num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0])
TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

Bad text generation performance

I ran the sample and I got bad text generation sequence

Generated text: <unk>i had everything we needed , it was going to be an excellent bar bq . [IMG2] grill made my sister happy she could finally come down for a drink with her brother and his wife as well overtime ! the food looked so good there where were all of us on this nice day in their way at last time outside what they enjoyed being partaking that del is everyone 's out again ? yes just because ... yeah but did not enjoy playing those sals today even though sure does every one has something wronged here before lunch he played these ones until you say `` are already '' oh dinner guys want another slice right now dose knows who hated having once says them when everybody wanted pizza by then saying oohthinkyouknowherbythenextimeoutsideofcourseyourselvesplaydinnerwhenfinewantedeveryonehadsthedatewhatwekeptforgettingitreadyandnowtheyreallgoingfoodaliciouslikecheesecakeburgerscanibraisedoodreamtomeverylongagainjustliketheywouldlastnightwithmeaswellhereisanotherpicturebutthistimemysisterhasabeerinhandshehope n'twondersomedaythatshessmellwill

And I think maybe the bad text leads to no generated images. Any idea to solve this problem? Thanks

Construction of dataset

Thanks for releasing your great work!

I am a little confused why you construct conversation every 100 samples in CC3M dataset

if i%100 == 0 and not test:

Is this some kind of data augmentation strategy?

Questions w.r.t implementation details of VIST and MMDialog

Hi, I have a few questions about your implementation details, could you help answer them? Thanks~

  1. The image resolution of MMDialog. It seems that the MMDialog officially release all images with 256x256 resolution, while SD v2.1 requires the training image with resolution no less than 512, how do you mitigate the gap here?

  2. The access to full VIST dataset. It seems that some images of the original VIST dataset are unavaliable to download, could you provide the detailed statistics (e.g. how many images / stories you actually use) during training and evaluation?

no image generated when run examples/playground.py

Generated text: i was so excited to see everyone . [IMG2] family and the bar bbq 's , you can be in there now ! ###
but image_out is None, so get this error:

File "/x/MiniGPT-5/examples/playground.py", line 85, in
ax.imshow(image_out)
File "/root/miniconda3/envs/myconda/lib/python3.9/site-packages/matplotlib/init.py", line 1442, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "/root/miniconda3/envs/myconda/lib/python3.9/site-packages/matplotlib/axes/_axes.py", line 5665, in imshow
im.set_data(X)
File "/root/miniconda3/envs/myconda/lib/python3.9/site-packages/matplotlib/image.py", line 701, in set_data
raise TypeError("Image data of dtype {} cannot be converted to "
TypeError: Image data of dtype object cannot be converted to float

TypeError: Unexpected type <class 'NoneType'>

Thanks for your wonderful work!
I try to run "python3 metric.py --test_weight stage2_mmdialog.ckpt", but encountered the following error:
TypeError: Unexpected type <class 'NoneType'>
I will be greatly appreciate if you could help me solve the error.

Poor generation results (with normal text and image outputs)

Hi,
Your work will be highly cited.

When I run playground.py, it generates the following results.

image
image

The only change from me is that
"self.image_pipeline.to(self.device, PRECISION)"
->
"self.image_pipeline.to(self.device)"
Since otherwise, it leads to errors:
"TypeError: to() takes from 1 to 2 positional arguments but 3 were given"

The following is the log that I have when running the code:
Seed set to 42
Loading VIT
Loading VIT Done
Loading Q-Former
Loading Q-Former Done
Loading LLAMA
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.95s/it]Loading LLAMA Done
Load BLIP2-LLM Checkpoint: ./config/prerained_minigpt4_7b.pth
text_encoder/pytorch_model.fp16.safetensors not found
Fetching 16 files: 100%|██████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 16648.19it/s]/opt/conda/envs/minigpt5/lib/python3.9/site-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
warnings.warn(
Generated text: [IMG0] [IMG7] [IMG0] ' [IMG0] their [IMG0] ihrer to me what thet, [IMG0] leur sont ihrr a [IMG0] eux them there that they have been [IMG0] ihnen who him est her his [IMG0] lui ihm he knew. sizing of themselves__, [IMG0] quien er sein hé was_their loro manière in de seu_.
their étaient zijn làb _ [IMG0] Their's [IMG0] deren_they était à érthé ihre were on [IMG0] 的 ellos [IMG0] hers at it è your are an étant eren had died [IMG0] på [IMG0] whose ép eran cél son era thé års éeʁ� space sua∂.Picame one? [IMG0] whom she is__er их essereт у pénêtrement [IMG0] erano haar [IMG0] when der ils där careerés le témère than étéιם from fait méa Théria! [IMG0] 452 [IMG0] -toème its own fé détéré vériti and eenérném [IMG0] their están being\ερן" about you [IMG0] -ed himself déjà< for many… They hade said élőрσи в theorem을.< One might be having arrived; [IMG0] estaba involved___ thing [IMG0] made the seus mědété que être un péner rég—himselfétait býlэ , perché celuiה들αre�
100%|████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:02<00:00, 18.50it/s]/home/VD/tong/mmreasoning/MiniGPT-5/./examples/playground.py:81: UserWarning: Glyph 30340 (\N{CJK UNIFIED IDEOGRAPH-7684}) missing from current font.
plt.savefig(os.path.join(current_dir, f'test
{i}.png'), bbox_inches='tight')
/home/VD/tong/mmreasoning/MiniGPT-5/./examples/playground.py:81: UserWarning: Glyph 51012 (\N{HANGUL SYLLABLE EUL}) missing from current font.
plt.savefig(os.path.join(current_dir, f'test_{i}.png'), bbox_inches='tight')
/home/VD/tong/mmreasoning/MiniGPT-5/./examples/playground.py:81: UserWarning: Glyph 46308 (\N{HANGUL SYLLABLE DEUL}) missing from current font.
plt.savefig(os.path.join(current_dir, f'test_{i}.png'), bbox_inches='tight')
Generated text: [IMG0] [IMG7] [IMG0] and [IMG0] de was [IMG0] were their seat, [IMG0] there in the bq [IMG0] your [IMG0] she était involved [IMG0] you for him zijnb q bý is it een [IMG0] hering qu's bar estaba his été one of lui [IMG0] deren he had been a là quelle étaient its sein_\him.##
there are many měre sich dessen? [IMG0] that_er they would have è _ théirheimt_.``` [IMG0] leuré eux to be héi meme themselves._ [IMG0] those sont seu… loro erano în quel_, där_mε étant; being é méthèmeréd by Me [IMG0] Thea membre who stée être leurs car era 1 [IMG0] One Man [IMG0] There termed [IMG0] Their Party [IMG0] -à [IMG0] has [IMG0] Théเתה명�� from scene [IMG0] Étه knew about them was herself_his name [IMG0] They'd himself в this moment célhey的 scène [IMG0] their hétself à elle его fame—they�� , which made hisэρ들יσе, [IMG0] на la ép y léς on elـe pén son — sua_; me��вן i원을 suo= [IMG0] That man' story told everyone but ihn,---\the air_\θ [IMG0] whose family
100%|████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:03<00:00, 16.30it/s]
...

Is there any thing that I could do to improve the result?

[BUG]Hello, there was an error (TypeError: unsupported operand type (s) for//: 'NoneType' and 'int') running Python playground.py

Operating system: ubuntu 20.04

Python 3.9.18

Other parameters: Same as MiniGPT-5/requirements. txt

All three ckpt files are located in MiniGPT-5/config. The configuration files have been changed. The weight used is vicuna-7b-v1.1. However, the following error still occurred.

Run Python playground.py -- stage1_ Weight /root/MiniGPT-5/config/stage1_ Cc3m.ckpt -- test_ Weight /root/MiniGPT-5/config/stage2_ The following error occurred during the visit.ckpt command:

Seed set to 42
Loading VIT
Traceback (most recent call last):
File "/root/MiniGPT-5/examples/playground.py", line 40, in
minigpt5 = MiniGPT5_Model.load_from_checkpoint(stage1_ckpt, strict=False, map_location="cpu", encoder_model_config=model_args, **vars(training_args))
File "/root/anaconda3/envs/minigpt5/lib/python3.9/site-packages/lightning/pytorch/core/module.py", line 1552, in load_from_checkpoint
loaded = _load_from_checkpoint(
File "/root/anaconda3/envs/minigpt5/lib/python3.9/site-packages/lightning/pytorch/core/saving.py", line 89, in _load_from_checkpoint
model = _load_state(cls, checkpoint, strict=strict, kwargs)
File "/root/anaconda3/envs/minigpt5/lib/python3.9/site-packages/lightning/pytorch/core/saving.py", line 156, in _load_state
obj = cls(
_cls_kwargs)
File "/root/MiniGPT-5/model.py", line 68, in init
self.model = MiniGPT5.from_config(minigpt4_config.model_cfg)
File "/root/MiniGPT-5/minigpt4/models/mini_gpt4.py", line 247, in from_config
model = cls(
File "/root/MiniGPT-5/minigpt4/models/mini_gpt5.py", line 46, in init
super().init(*args, **kwargs)
File "/root/MiniGPT-5/minigpt4/models/mini_gpt4.py", line 53, in init
self.visual_encoder, self.ln_vision = self.init_vision_encoder(
File "/root/MiniGPT-5/minigpt4/models/blip2.py", line 65, in init_vision_encoder
visual_encoder = create_eva_vit_g(
File "/root/MiniGPT-5/minigpt4/models/eva_vit.py", line 416, in create_eva_vit_g
model = VisionTransformer(
File "/root/MiniGPT-5/minigpt4/models/eva_vit.py", line 259, in init
self.patch_embed = PatchEmbed(
File "/root/MiniGPT-5/minigpt4/models/eva_vit.py", line 190, in init
num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0])
TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

How to use lm loss for voken during training?

As stated in the paper "During the training, we append the vokens to the positions of ground truth images and train the model to predict vokens within text generation"

So in the interleaved image and text training data, should each image be replaced with the form [img1], [img2], ... [img8]?

For a sample of the following form:
[s][text1] [text2] [text3] [img1], [img2], ... [img8]

The label for calculating L_text is in the following form?
[text2] [text3] [img1], [img2], ... [img8][/s]

will there be two problems:
All images are represented by Vokens arranged in the same way.
Force large models to be generated sequentially [img]

Special Token not generated When running playground.py during demo

When I'm trying to run the demo, it says the following:

python3 playground.py --stage1_weight ../WEIGHT_FOLDER/stage1_cc3m.ckpt --test_weight ../WEIGHT_FOLDER/stage2_vist.ckpt
/home/Intern1/anaconda3/envs/minigpt5/lib/python3.9/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
Seed set to 42
/home/Intern1/anaconda3/envs/minigpt5/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Loading VIT
Loading VIT Done
Loading Q-Former
Loading Q-Former Done
Loading LLAMA
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
Loading checkpoint shards: 100%|█| 2/2 [01:19<00:00, 39.9
Loading LLAMA Done
Load BLIP2-LLM Checkpoint: ../config/prerained_minigpt4_7b.pth
/home/Intern1/anaconda3/envs/minigpt5/lib/python3.9/site-packages/torch/nn/modules/transformer.py:306: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.norm_first was True
warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Loading pipeline components...: 0%| | 0/6 [00:00<?, ?it/home/Intern1/anaconda3/envs/minigpt5/lib/python3.9/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
Loading pipeline components...: 100%|█| 6/6 [00:00<00:00,
Traceback (most recent call last):
File "/home/Intern1/Yun_SRP/MiniGPT-5/examples/playground.py", line 76, in
ax.imshow(image_out)
File "/home/Intern1/anaconda3/envs/minigpt5/lib/python3.9/site-packages/matplotlib/init.py", line 1442, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "/home/Intern1/anaconda3/envs/minigpt5/lib/python3.9/site-packages/matplotlib/axes/_axes.py", line 5665, in imshow
im.set_data(X)
File "/home/Intern1/anaconda3/envs/minigpt5/lib/python3.9/site-packages/matplotlib/image.py", line 701, in set_data
raise TypeError("Image data of dtype {} cannot be converted to "
TypeError: Image data of dtype object cannot be converted to float

Help with finetuning

If I want to finetune the model with my own dataset, what file should I modify? and how can I make the model generate multi images in one turn?

Bad performance on example

Hello, I tried to load the model and run the example, but the text generated seems wrong.

Below is the generated text:

Generated text: <unk> [IMG0] [IMG7] [IMG0] that [IMG0] their [IMG0] ihrer there are fait leurt, [IMG0] they sont ihrréms [IMG0] ihre lui eran in Mame
## the barbec [IMG0] his été herem'rs."> [IMG0] их themselves_\thereir is their sein_mâ [IMG0] loro were him its\ seu_. ###

Besides, the image generated looks also weird, but better than the text.

I ran this code on a jupyter notebook, the code reports no bug. Do you have any idea how this is caused?

Thank.

Should input_embeddings and out_embeddings be updated in Stage2?

Hi! Thanks for your interesting job!

I just find that when using LoRA, the input_embeddings and out_embeddings are not updated with the following code.

self.llama_model.base_model.model.model.embed_tokens.original_module.weight.requires_grad = False
self.llama_model.base_model.model.lm_head.original_module.weight.requires_grad = False

Considering the LoRA is used for Stage2, does it mean the input_embeddings and out_embeddings are only updated in Stage1? If so, the two lines are redundant since PEFT will set them not updated.

STAGE 1 TRAINING

Thank you for your great work!
In the stage 1 training mentioned in the paper, is the input of llm images and text,because the description ‘After the pretraining stage, the model is capable of generating images for single text descriptions’ in the article is unimodal?
And,What does ‘The Language Model (LLM) is then tasked with only generating these placeholders for text creation’ mean? Does the stage 1 llm only need to generate placeholders without other token embedding?

a question about inference, no "output_img_token" (playground.py)

Regarding the inference effect, in my experiment, there was only [IMG2] in the generated text, but not [IMG0]. I wonder if this is normal. Because [IMG0] means "output_img_id", without it you cannot enter the image generation branch. Looking forward to your answer, thankyou。
image
image

About the maximum length

Hi Young-Jin,

What did you set the model's maximum length to? When I set it to 1024, it caused a CUDA out-of-memory issue.

Best regards,
Young-Jin

OSError: Can't load tokenizer for 'bert-base-uncased'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'bert-base-uncased' is the correct path to a directory containing all relevant files for a BertTokenizer tokenizer.

I encountered an error while running the playground.py
Can you help me check where the problem is? Thank you very much!
image

I have already set up the environment according to the steps.
企业微信截图_17085811322669

The files needed for download are also prepared.
vicuna-7b
image
Checkpoint
image

The file path has also been configured.
image
image

pip list:
accelerate 0.27.2
aiofiles 23.2.1
aiohttp 3.8.4
aiosignal 1.3.1
altair 5.2.0
antlr4-python3-runtime 4.9.3
anyio 4.3.0
appdirs 1.4.4
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
asttokens 2.4.1
async-lru 2.0.4
async-timeout 4.0.2
attrs 22.2.0
Babel 2.14.0
beautifulsoup4 4.12.3
bleach 6.1.0
blis 0.7.11
braceexpand 0.1.7
catalogue 2.0.10
cchardet 2.1.7
certifi 2024.2.2
cffi 1.16.0
chardet 5.1.0
charset-normalizer 3.3.2
click 8.1.7
comm 0.2.1
confection 0.1.4
contourpy 1.0.7
cycler 0.11.0
cymem 2.0.8
debugpy 1.8.1
decorator 5.1.1
decord 0.6.0
defusedxml 0.7.1
diffusers 0.21.4
docker-pycreds 0.4.0
exceptiongroup 1.2.0
executing 2.0.1
fastapi 0.109.2
fastjsonschema 2.19.1
ffmpy 0.3.2
filelock 3.9.0
fonttools 4.38.0
fqdn 1.5.1
frozenlist 1.3.3
fsspec 2024.2.0
ftfy 6.1.3
gitdb 4.0.11
GitPython 3.1.42
gradio 3.24.1
gradio_client 0.0.8
h11 0.14.0
httpcore 1.0.3
httpx 0.26.0
huggingface-hub 0.20.3
idna 3.6
importlib-metadata 7.0.1
importlib-resources 5.12.0
iopath 0.1.10
ipykernel 6.29.2
ipython 8.18.1
isoduration 20.11.0
jedi 0.19.1
Jinja2 3.1.3
joblib 1.3.2
json5 0.9.17
jsonpointer 2.4
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter_client 8.6.0
jupyter_core 5.7.1
jupyter-events 0.9.0
jupyter-lsp 2.2.2
jupyter_server 2.12.5
jupyter_server_terminals 0.5.2
jupyterlab 4.1.2
jupyterlab_pygments 0.3.0
jupyterlab_server 2.25.3
kiwisolver 1.4.4
langcodes 3.3.0
lightning 2.2.0.post0
lightning-utilities 0.10.1
linkify-it-py 2.0.3
llvmlite 0.42.0
markdown-it-py 2.2.0
MarkupSafe 2.1.5
matplotlib 3.7.0
matplotlib-inline 0.1.6
mdit-py-plugins 0.3.3
mdurl 0.1.2
mistune 3.0.2
mpmath 1.3.0
multidict 6.0.4
murmurhash 1.0.10
nbclient 0.9.0
nbconvert 7.16.1
nbformat 5.9.2
nest-asyncio 1.6.0
networkx 3.2.1
nltk 3.8.1
notebook 7.1.0
notebook_shim 0.2.4
numba 0.59.0
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
omegaconf 2.3.0
open-clip-torch 2.24.0
opencv-python 4.9.0.80
orjson 3.9.14
overrides 7.7.0
packaging 23.0
pandas 2.2.0
pandocfilters 1.5.1
parso 0.8.3
pathlib_abc 0.1.1
pathy 0.11.0
peft 0.8.2
pexpect 4.9.0
pillow 10.2.0
pip 23.3.1
platformdirs 4.2.0
portalocker 2.8.2
preshed 3.0.9
prometheus_client 0.20.0
prompt-toolkit 3.0.43
protobuf 4.25.3
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
pycocoevalcap 1.2
pycocotools 2.0.6
pycparser 2.21
pydantic 1.10.14
pydub 0.25.1
Pygments 2.17.2
pynndescent 0.5.11
pyparsing 3.0.9
python-dateutil 2.8.2
python-json-logger 2.0.7
python-multipart 0.0.9
pytorch-fid 0.3.0
pytorch-lightning 2.2.0.post0
pytz 2024.1
PyYAML 6.0
pyzmq 25.1.2
referencing 0.33.0
regex 2022.10.31
requests 2.31.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rouge 1.0.1
rpds-py 0.18.0
safetensors 0.4.2
scikit-learn 1.4.1.post1
scipy 1.12.0
semantic-version 2.10.0
Send2Trash 1.8.2
sentence-transformers 2.2.2
sentencepiece 0.2.0
sentry-sdk 1.40.5
setproctitle 1.3.3
setuptools 68.2.2
six 1.16.0
smart-open 6.4.0
smmap 5.0.1
sniffio 1.3.0
soupsieve 2.5
spacy 3.5.1
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.4.8
stack-data 0.6.3
starlette 0.36.3
sympy 1.12
tenacity 8.2.2
terminado 0.18.0
thinc 8.1.12
threadpoolctl 3.3.0
timm 0.6.13
tinycss2 1.2.1
tokenizers 0.13.3
tomli 2.0.1
toolz 0.12.1
torch 2.2.0
torch-fidelity 0.3.0
torchmetrics 1.3.1
torchvision 0.17.0
tornado 6.4
tqdm 4.64.1
traitlets 5.14.1
transformers 4.31.0
triton 2.2.0
typer 0.7.0
types-python-dateutil 2.8.19.20240106
typing_extensions 4.9.0
tzdata 2024.1
uc-micro-py 1.0.3
umap-learn 0.5.5
uri-template 1.3.0
urllib3 2.2.1
uvicorn 0.27.1
wandb 0.16.3
wasabi 1.1.2
wcwidth 0.2.13
webcolors 1.13
webdataset 0.2.48
webencodings 0.5.1
websocket-client 1.7.0
websockets 12.0
wheel 0.41.2
xformers 0.0.24
yarl 1.8.2
zipp 3.14.0

ABOUT GPU RAM

I run the playground.py file with 16GB of GPU ram but it's still not enough. Can you show me some ways to still run files but without needing 16GB of GPU ram?

Problems about the implementation

Thank you very much for sharing such fascinating work. I've encountered some questions regarding implementation details while going through the code. I hope to get your insights.

  1. during the training procedure, the generation targets of an image are "[IMG0][IMG1][IMG3]...[IMG7]"? However, I found that the model will only generate [IMG0] during the evaluation procedure, did I make any mistake?
  2. I have another question about the "hidden states output" of the generative model, in the Line 307-135 of model.py
    predicted_images_ft = None if len(special_token_index): idx = special_token_index[0,0] t2i_input_embedding = last_hidden_state[idx][-1] assert t2i_input_embedding.shape[1] == self.img_token_num # HERE for the reference img0_output_feature = last_hidden_state[idx-1][-1][:, -1:] t2i_input_embedding = torch.cat([img0_output_feature, t2i_input_embedding[:, :-1]], dim=1) t2i_input_embedding = self.fc(t2i_input_embedding) mapping_feature = self.llm_to_t2i_mapping(src=t2i_input_embedding, tgt=self.t2i_decoder_prompt)
    I wonder why last_hidden_state[idx][-1] (the position of [IMG0]) always has a shape of [1, 8, 4096]?

About the stage 1 pretraining

When I wanted to reproduce the experiment of stage 1, I found that the image loss (LDM, mentioned in the paper) couldn't converge. So I wonder if there is any trick I ignored, and I notice that you use snr loss as the default setting for the image loss which wasn't mentioned in the paper. Looking forward to you reply.

Acknowledgement on LAVIS

Hi authors,

It seems the codebase has overlaps with LAVIS. In fact, MiniGPT-4 is largely built on our library.

Would you mind if updating the arxiv manuscript and README with reference to the LAVIS library?

Thanks.

Enhancement about SDXL/ LLAMA-2

Hi author,
thanks for your impressive work! are you planning to pre-train the work on more strengthful submodule, like LLAMA-2 and SDXL?

vision encoder used in MiniGPT-5

Nice work!
We develop a benchmark for evaluating MLLM. We wanna report MiniGPT-5 on this benchmark~
I wonder what Vision Encoder used in MiniGPT-5? same as MiniGPT-4(EVA-G or something)? 😆

More pretrained model

Intresting work! I want to test other settings in Table 1, including No Context, Text Context, Image Context. Can you share these trained models?

Question about garbled text

Thank you for your fascinating work. I have a question regarding our experience using MiniGPT-5 to train on our dataset. We've noticed that the output contains garbled text (as shown in the figure below). Could you please advise on what might be causing this issue?
text

About image comprehension task

Firstly, thank you for your contributions to the multi-modal large language model (MLLM) research with MiniGPT-5. I'm experiencing an issue while testing the model's image comprehension capabilities.

Issue Description:
The model consistently generates meaningless images and text when provided with an image input.

Reproduction Steps:

  1. Running playground.py with the same example produces expected outputs.
  2. Text-only inputs result in reasonable responses. For example:
    • Text Input: "###Human: Can you tell me a joke? ###Assistant:"
    • Generated Text: "Sure! What did the snowman say to his wife? Can we go in circles around a little longer, hon?) ###"
  3. However, with image inputs, the responses are not meaningful. Here is an example:
    • Text Input: "Give the following images in ImageContent format. You will be able to see the images once I provide it to you. ###Human: Can you describe the imageImageContent? ###Assistant:"
    • Image Used: image
    • Generated Output: Text: "yes i can [IMG0] ###"; Image: image

In more cases, the model will just refuse to generate any text output and just generate some meaningless text.

Given that MiniGPT-5 builds upon MiniGPT-4, which handled similar tasks effectively, I am curious about your insights on this issue. Have you encountered or tested this scenario during development?

Thank you for your time and assistance.

the value of hidden_states in modeling_llama.py line 689 is NAN!!

Hello, when I ran your program, I found the value of hidden_states in modeling_llama.py line 689 is NAN!!

I triggered this bug when running playground.py. I used the official parameters for all settings. Just because my video memory is only 24GB, I set low_source=True in minigpt4.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.