Coder Social home page Coder Social logo

next-gpt / next-gpt Goto Github PK

View Code? Open in Web Editor NEW
2.9K 59.0 298.0 48.35 MB

Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model

Home Page: https://next-gpt.github.io/

License: BSD 3-Clause "New" or "Revised" License

Python 99.92% Shell 0.08%
chatgpt foundation-models gpt-4 instruction-tuning large-language-models llm multi-modal-chatgpt multimodal visual-language-learning

next-gpt's Introduction

NExT-GPT: Any-to-Any Multimodal LLM

Shengqiong Wu, Hao Fei*, Leigang Qu, Wei Ji, and Tat-Seng Chua. (*Correspondence )

NExT++, School of Computing, National University of Singapore


License YouTube

This repository hosts the code, data and model weight of NExT-GPT, the first end-to-end MM-LLM that perceives input and generates output in arbitrary combinations (any-to-any) of text, image, video, and audio and beyond.


πŸŽ‰ News

  • [2023.09.15] πŸš€πŸš€ Release the code of NExT-GPT in version 7b_tiva_v0.
  • [2023.09.27] πŸ”¨πŸ§© Added modality-blended batch sampler .
  • [2023.10.01] πŸ“’πŸ“’ Release the T2M instruction dataset.
  • [2023.10.04] πŸ‘πŸ‘ Release the checkpoint of NExT-GPT in version 7b_tiva_v0 .
  • [2023.10.15] πŸ”¨πŸš€ Update of NExT-GPT in version 7b_tiva_v0 .

πŸ‘‰ TODO

  • Release MosIT data.
  • Updating NExT-GPT in more types&sizes of LLMs.
  • Empowering NExT-GPT with more modalities of inputs&outputs.
  • ...

Example Demos

Here we showcase examples generated from NExT-GPT. For more examples, kindly visit the webpage, or the online live demo.

example_5_Trim.mp4
example_6_Trim.mp4
example_9_Trim.mp4

Brief Introduction

NExt-GPT is built on top of existing pre-trained LLM, multimodal encoder and SoTA diffusion models, with sufficient end-to-end instruction tuning.

Video-LLaMA

  • Multimodal Encoding Stage. Leveraging established encoders to encode inputs in various modalities, where these representations are projected into language-like representations comprehensible to the LLM through a projection layer.
  • LLM Understanding and Reasoning Stage. Harnessing an existing open-sourced LLM as the core to process input information for semantic understanding and reasoning. The LLM not only directly generates text tokens but also produces unique β€œmodality signal” tokens that serve as instructions to dictate the decoding layers whether & what modal content to output correspondingly.
  • Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders.

For more technical details, kindly refer to the paper.


Getting Started

Table of Contents:


1. Code Structure

β”œβ”€β”€ figures
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ T-X_pair_data  
β”‚   β”‚   β”œβ”€β”€ audiocap                      # text-autio pairs data
β”‚   β”‚   β”‚   β”œβ”€β”€ audios                    # audio files
β”‚   β”‚   β”‚   └── audiocap.json             # the audio captions
β”‚   β”‚   β”œβ”€β”€ cc3m                          # text-image paris data
β”‚   β”‚   β”‚   β”œβ”€β”€ images                    # image files
β”‚   β”‚   β”‚   └── cc3m.json                 # the image captions
β”‚   β”‚   └── webvid                        # text-video pairs data
β”‚   β”‚   β”‚   β”œβ”€β”€ videos                    # video files
β”‚   β”‚   β”‚   └── webvid.json               # the video captions
β”‚   β”œβ”€β”€ IT_data                           # instruction data
β”‚   β”‚   β”œβ”€β”€ T+X-T_data                    # text+[image/audio/video] to text instruction data
β”‚   β”‚   β”‚   β”œβ”€β”€ alpaca                    # textual instruction data
β”‚   β”‚   β”‚   β”œβ”€β”€ llava                     # visual instruction data
β”‚   β”‚   β”œβ”€β”€ T-T+X                         # synthesized text to text+[image/audio/video] instruction data
β”‚   β”‚   └── MosIT                         # Modality-switching Instruction Tuning instruction data
β”œβ”€β”€ code
β”‚   β”œβ”€β”€ config
β”‚   β”‚   β”œβ”€β”€ base.yaml                     # the model configuration 
β”‚   β”‚   β”œβ”€β”€ stage_1.yaml                  # enc-side alignment training configuration
β”‚   β”‚   β”œβ”€β”€ stage_2.yaml                  # dec-side alignment training configuration
β”‚   β”‚   └── stage_3.yaml                  # instruction-tuning configuration
β”‚   β”œβ”€β”€ dsconfig
β”‚   β”‚   β”œβ”€β”€ stage_1.json                  # deepspeed configuration for enc-side alignment training
β”‚   β”‚   β”œβ”€β”€ stage_2.json                  # deepspeed configuration for dec-side alignment training
β”‚   β”‚   └── stage_3.json                  # deepspeed configuration for instruction-tuning training
β”‚   β”œβ”€β”€ datast
β”‚   β”‚   β”œβ”€β”€ base_dataset.py
β”‚   β”‚   β”œβ”€β”€ catalog.py                    # the catalog information of the dataset
β”‚   β”‚   β”œβ”€β”€ cc3m_datast.py                # process and load text-image pair dataset
β”‚   β”‚   β”œβ”€β”€ audiocap_datast.py            # process and load text-audio pair dataset
β”‚   β”‚   β”œβ”€β”€ webvid_dataset.py             # process and load text-video pair dataset
β”‚   β”‚   β”œβ”€β”€ T+X-T_instruction_dataset.py  # process and load text+x-to-text instruction dataset
β”‚   β”‚   β”œβ”€β”€ T-T+X_instruction_dataset.py  # process and load text-to-text+x instruction dataset
β”‚   β”‚   └── concat_dataset.py             # process and load multiple dataset
β”‚   β”œβ”€β”€ model                     
β”‚   β”‚   β”œβ”€β”€ ImageBind                     # the code from ImageBind Model
β”‚   β”‚   β”œβ”€β”€ common
β”‚   β”‚   β”œβ”€β”€ anyToImageVideoAudio.py       # the main model file
β”‚   β”‚   β”œβ”€β”€ agent.py
β”‚   β”‚   β”œβ”€β”€ modeling_llama.py
β”‚   β”‚   β”œβ”€β”€ custom_ad.py                  # the audio diffusion 
β”‚   β”‚   β”œβ”€β”€ custom_sd.py                  # the image diffusion
β”‚   β”‚   β”œβ”€β”€ custom_vd.py                  # the video diffusion
β”‚   β”‚   β”œβ”€β”€ layers.py                     # the output projection layers
β”‚   β”‚   └── ...  
β”‚   β”œβ”€β”€ scripts
β”‚   β”‚   β”œβ”€β”€ train.sh                      # training NExT-GPT script
β”‚   β”‚   └── app.sh                        # deploying demo script
β”‚   β”œβ”€β”€ header.py
β”‚   β”œβ”€β”€ process_embeddings.py             # precompute the captions embeddings
β”‚   β”œβ”€β”€ train.py                          # training
β”‚   β”œβ”€β”€ inference.py                      # inference
β”‚   β”œβ”€β”€ demo_app.py                       # deploy Gradio demonstration 
β”‚   └── ...
β”œβ”€β”€ ckpt                           
β”‚   β”œβ”€β”€ delta_ckpt                        # tunable NExT-GPT params
β”‚   β”‚   β”œβ”€β”€ nextgpt         
β”‚   β”‚   β”‚   β”œβ”€β”€ 7b_tiva_v0                # the directory to save the log file
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ log                   # the logs
β”‚   └── ...       
β”‚   β”œβ”€β”€ pretrained_ckpt                   # frozen params of pretrained modules
β”‚   β”‚   β”œβ”€β”€ imagebind_ckpt
β”‚   β”‚   β”‚   β”œβ”€β”€huge                       # version
β”‚   β”‚   β”‚   β”‚   └──imagebind_huge.pth
β”‚   β”‚   β”œβ”€β”€ vicuna_ckpt
β”‚   β”‚   β”‚   β”œβ”€β”€ 7b_v0                     # version
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ config.json
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ pytorch_model-00001-of-00002.bin
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ tokenizer.model
β”‚   β”‚   β”‚   β”‚   └── ...
β”œβ”€β”€ LICENCE.md
β”œβ”€β”€ README.md
└── requirements.txt

2. Environment Preparation [Back to Top]

Please first clone the repo and install the required environment, which can be done by running the following commands:

conda env create -n nextgpt python=3.8

conda activate nextgpt

# CUDA 11.6
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

git clone https://github.com/NExT-GPT/NExT-GPT.git
cd NExT-GPT

pip install -r requirements.txt

3. Training/Adapting NExt-GPT on Your Own

3.1. Preparing Pre-trained Checkpoint [Back to Top]

NExT-GPT is trained based on following excellent existing models. Please follow the instructions to prepare the checkpoints.

  • ImageBind is the unified image/video/audio encoder. The pre-trained checkpoint can be downloaded from here with version huge. Afterward, put the imagebind_huge.pth file at [./ckpt/pretrained_ckpt/imagebind_ckpt/huge].
  • Vicuna: first prepare the LLaMA by following the instructions [here]. Then put the pre-trained model at [./ckpt/pretrained_ckpt/vicuna_ckpt/].
  • Image Diffusion is used to generate images. NExT-GPT uses Stable Diffusion with version v1-5. (will be automatically downloaded)
  • Audio Diffusion for producing audio content. NExT-GPT employs AudioLDM with version l-full. (will be automatically downloaded)
  • Video Diffusion for the video generation. We employ ZeroScope with version v2_576w. (will be automatically downloaded)

3.2. Preparing Dataset [Back to Top]

Please download the following datasets used for model training:

A) T-X pairs data

B) Instruction data

3.3. Precomputing Embeddings [Back to Top]

In decoding-side alignment training, we minimize the distance between the representation of signal tokens and captions. To save costs of time and memory, we precompute the text embeddings for image, audio and video captions using the text encoder within the respective diffusion models.

Please run this command before the following training of NExT-GPT, where the produced embedding file will be saved at [./data/embed].

cd ./code/
python process_embeddings.py ../data/T-X_pair_data/cc3m/cc3m.json image ../data/embed/ runwayml/stable-diffusion-v1-5

Note of arguments:

  • args[1]: path of caption file;
  • args[2]: modality, which can be image, video, and audio;
  • args[3]: saving path of embedding file;
  • args[4]: corresponding pre-trained diffusion model name.

3.4. Training NExT-GPT [Back to Top]

First of all, please refer to the base configuration file [./code/config/base.yaml] for the basic system setting of overall modules.

Then, the training of NExT-GPT starts with this script:

cd ./code
bash scripts/train.sh

Specifying the command:

deepspeed --include localhost:0 --master_addr 127.0.0.1 --master_port 28459 train.py \
    --model nextgpt \
    --stage 1\
    --save_path  ../ckpt/delta_ckpt/nextgpt/7b_tiva_v0/\
    --log_path ../ckpt/delta_ckpt/nextgpt/7b_tiva_v0/log/

where the key arguments are:

  • --include: localhost:0 indicating the GPT cuda number 0 of deepspeed.
  • --stage: training stage.
  • --save_path: the directory which saves the trained delta weights. This directory will be automatically created.
  • --log_path: the directory which saves the log file.

The whole NExT-GPT training involves 3 steps:

  • Step-1: Encoding-side LLM-centric Multimodal Alignment. This stage trains the input projection layer while freezing the ImageBind, LLM, output projection layer.

    Just run the above train.sh script by setting: --stage 1

    Also refer to the running config file [./code/config/stage_1.yaml] and deepspeed config file [./code/dsconfig/stage_1.yaml] for more step-wise configurations.

    Note that the dataset used for training in this step is included dataset_name_list and the dataset name must precisely match the definition in [./code/dataset/catalog.py]

  • Step-2: Decoding-side Instruction-following Alignment. This stage trains the output projection layers while freezing the ImageBind, LLM, input projection layers.

    Just run the above train.sh script by setting: --stage 2

    Also refer to the running config file [./code/config/stage_2.yaml] and deepspeed config file [./code/dsconfig/stage_2.yaml] for more step-wise configurations.

  • Step-3: Instruction Tuning. This stage instruction-tune 1) the LLM via LoRA, 2) input projection layer and 3) output projection layer on the instruction dataset.

    Just run the above train.sh script by setting: --stage 3

    Also refer to the running config file [./code/config/stage_3.yaml] and deepspeed config file [./code/dsconfig/stage_3.yaml] for more step-wise configurations.

4. Running NExT-GPT System [Back to Top]

4.1. Preparing Checkpoints

First, loading the pre-trained NExT-GPT system.

4.2. Deploying Gradio Demo

Upon completion of the checkpoint loading, you can run the demo locally via:

cd ./code
bash scripts/app.sh

Specifying the key arguments as:

  • --nextgpt_ckpt_path: the path of pre-trained NExT-GPT params.

Contact

For any questions or feedback, feel free to contact Shengqiong Wu and Hao Fei.

Citation

If you find NextGPT useful in your research or applications, please kindly cite:

@articles{wu2023nextgpt,
  title={NExT-GPT: Any-to-Any Multimodal LLM},
  author={Shengqiong Wu and Hao Fei and Leigang Qu and Wei Ji and Tat-Seng Chua},
  journal = {CoRR},
  volume = {abs/2309.05519},
  year={2023}
}

Acknowledgements

You may refer to related work that serves as foundations for our framework and code repository, Vicuna, ImageBind, Stable Diffusion, AudioLDM, and Zeroscope. We also partially draw inspirations from PandaGPT, VPGTrans, GILL, CoDi, Video-LLaMA, and MiniGPT-4. Thanks for their wonderful works.

License Notices

This repository is under BSD 3-Clause License. NExT-GPT is a research project intended for non-commercial use only. One must NOT use the code of NExT-GPT for any illegal, harmful, violent, racist, or sexual purposes. One is strictly prohibited from engaging in any activity that will potentially violate these guidelines. Any potential commercial use of this code should be approved by the authors.

next-gpt's People

Contributors

chocowu avatar eltociear avatar next-gpt avatar scofield7419 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

next-gpt's Issues

Are all images first padded to create a video using the PadIm2Video function, and then encoded using conv3D?

Sorry for the stupid question. I don't have the computational power to really run the code but I want to study the code and get the logic. Correct me if wrong, there is no preprocessor for ImageOnly. Every image is padded to a video and encoded, passing PadIm2Video and conv3D. What is the reason for doing this? Is it worth it? (I assume processing images are cheaper than processing video.)

Thank you for your answer and discussion!

video error

File "demo_app.py", line 312, in predict
response = model.generate({
File "/NExT-GPT/code/model/anyToImageVideoAudio.py", line 982, in generate
vid_outputs = self.generate_videos(generated_ids, generated_video_embeddings, all_gen_vid_idx, None,
File "/NExT-GPT/code/model/anyToImageVideoAudio.py", line 813, in generate_videos
assert generated_ids[0,
AssertionError: (tensor([32006, 32007, 32008, 32009, 32010, 32011, 32012, 32013, 32014, 32015,
32016, 9427, 13, 2277, 29937, 29871, 13, 13, 1576, 1967,
3697, 263, 9427, 2381], device='cuda:0'), [32006, 32007, 32008, 32009, 32010, 32011, 32012, 32013, 32014, 32015, 32016, 32017, 32018, 32019, 32020, 32021, 32022, 32023, 32024, 32025, 32026, 32027, 32028, 32029])

Downloading cc3m with some wrong

When downloading the cc3m dataset, an error is constantly displayed: 'Field "caption" does not exist in table schema'.

After reviewing the img2dataset document, it was found that the following needs to be added

pip install sed

sed -i '1s/^/caption\turl\n/' Train_GCC-training.tsv

img2dataset --url_list Train_GCC-training.tsv --input_format "tsv"\ --url_col "url" --caption_col "caption" --output_format webdataset\ --output_folder cc3m --processes_count 16 --thread_count 64 --image_size 256\ --enable_wandb True

Should embed_tokens.weight and lm_head.weight be frozen in stage1 and stage 2

In stage1 and stage2 these two weights are trainable. And the layer name is "llama_model.model.embed_tokens.weight" and "llama_model.lm_head.weight"

But it seems that stage3 not load these two weights correctly, as the layer name are "llama_model.base_model.model.model.embed_tokens.weight" and "llama_model.base_model.model.lm_head.weight" in stage3

ValueError: Non-consecutive added token '<unk>' found. Should have index 32000 but has index 0 in saved vocabulary

according to issue #24 modified, yet it occur error.

Initializing tokenizer from ../ckpt/pretrained_ckpt/vicuna_ckpt/7b_v0 ...

File "demo_app.py", line 30, in
model = NextGPTModel(**args)
File "NExT-GPT/code/model/anyToImageVideoAudio.py", line 96, in init
self.llama_tokenizer = LlamaTokenizer.from_pretrained(tokenizer_path, use_fast=False)
File "anaconda3/envs/nextgpt/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
return cls._from_pretrained(
File "anaconda3/envs/nextgpt/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2031, in _from_pretrained
raise ValueError(
ValueError: Non-consecutive added token '' found. Should have index 32000 but has index 0 in saved vocabulary.

About the input video frames

Hi! Thanks for your great work!

I have looked into the code about video processing, but I did not figure out how many frames of a video are encoded into ImageBind encoder. Would you mind telling me some details about this? Thanks!

7b_tiva_v0

the holiday is coming, can you release 7b_tiva_v0 model today? Thanks

Unable to generate visual

Hi there, NextGPT is a great model and would serve as a good foundation for more formidable models understanding more modalities. I also understand that this is just the initial version nonetheless, model seems to be unable to follow instructions as well generate coherent visual. It also seems like the LLM's text generation ability was heavily impacted

train data in stage 3

I follow 3.2. Preparing Dataset to prepare the data.
Specifically, for T2M data, I use the JSON file you published, but according to catalog.py, raw multimodal data (such as images) is required to generate embedding, but I am not sure which dataset you are using to generate T2M data. For the images, I thought you were using the CC3M dataset to generate T2M data, but I randomly opened several images and found that they did not match the text content in your JSON file. It is possible that you are not using the CC3M dataset to generate T2M data.

Could you disclose what dataset your T2M data was generated from?

Regarding commercial use

Hi, I see that you have used imageBind as the underlying model to generate embeddings, and your license is open for commercial use. However, as far as I know, imageBind is open for non-commercial usage only. Am not an expert at the licenses, but just wanted to know if I want to build a product on top of NExT-GPT (intended for commercial purposes), would it be an issue?

about inference

run inference.py with 7b_tiva_v0, I found that the model can't stable generate image/video/audio. The LLM always output without signal tokens。like this:
image

Training details

What is the number of GPUs you use during each stage of training?
Could you release the log file during training?

Problem when running the app demo on part 4

Hi,
I have already downloaded 'nextgpt_7b_tiva_v0' weights from huggingface, but when I run 'bash scripts/app.sh', it gives me this error:

Setting ds_accelerator to cuda (auto detect)
/home/kjj8053/anaconda3/envs/nextgpt/lib/python3.8/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/kjj8053/anaconda3/envs/nextgpt/lib/python3.8/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
/home/kjj8053/research/NExT-GPT/code/model/custom_sd.py:28: FutureWarning: Importing DiffusionPipeline or ImagePipelineOutput from diffusers.pipeline_utils is deprecated. Please import from diffusers.pipelines.pipeline_utils instead.
from diffusers.pipeline_utils import DiffusionPipeline
[!] load base configuration: config/base.yaml
Traceback (most recent call last):
File "demo_app.py", line 28, in
args.update(load_config(args))
File "/home/kjj8053/research/NExT-GPT/code/config/init.py", line 27, in load_config
if args['mode']:
KeyError: 'mode'

fail to download webvid

0.wget -nc http://www.robots.ox.ac.uk/~maxbain/webvid/results_10M_train.csv
1.pip install video2dataset
2.video2dataset --url_list="results_10M_train.csv"
--input_format="csv"
--output-format="webdataset"
--output_folder="dataset"
--url_col="contentUrl"
--caption_col="name"
--save_additional_columns='[videoid,page_idx,page_dir,duration]'
--enable_wandb=True
--config="path/to/config.yaml" \

Then, the errors was output as following.
File "/public1/home/aaaaaa/anaconda3/envs/nextgpt/lib/python3.8/site-packages/video2dataset/configs/init.py", line 8, in
"default": OmegaConf.load(os.path.join(configs_path, "default.yaml")),
File "/public1/home/stu52275901023/anaconda3/envs/nextgpt/lib/python3.8/site-packages/omegaconf/omegaconf.py", line 189, in load with io.open(os.path.abspath(file_), "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/public1/home/stu52275901023/anaconda3/envs/nextgpt/lib/python3.8/site-packages/video2dataset/configs/default.yaml'

[Bugs] Package & Code Problems found when local depoly

Hi,

I started an remote instance to test local deployment

Rig :
Ubuntu 20.04
python 3.10
Cuda 11.8
RTX3090

Here is the problems I found when running the demo app locally

cd ./code
bash scripts/train.sh

#1, Additional pip packages required (complained by python as missing modules, not added to requirement.txt so far)

omegaconf
tensorboard

#2, Wrong local python modules

This line throws error ,

import data
suggest the data folder in the root directory should also contain a init.py to become a python module? I comment this line out in order to proceed.

#3, Python dict key error

Then I got this error log, it seems python is complaining about missing key.

[!] load base configuration: config/base.yaml
Traceback (most recent call last):
  File "/root/CodeSapce/NextGPT/code/demo_app.py", line 28, in <module>
    args.update(load_config(args))
  File "/root/CodeSapce/NextGPT/code/config/__init__.py", line 27, in load_config
    if args['mode']:
KeyError: 'mode'

I changed it to

if args.get('mode'):

I don't think its a good practice to use [] for testing if key exists, we should use .get() instead

#4, Missing inference time config key value in yaml

And there are more dict key error on not existing keys

Initializing language decoder from ../ckpt/pretrained_ckpt/vicuna_ckpt/7b_v0 ...
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:07<00:00,  3.66s/it]
Traceback (most recent call last):
  File "/root/CodeSapce/NextGPT/code/demo_app.py", line 29, in <module>
    model = NextGPTModel(**args)
  File "/root/CodeSapce/NextGPT/code/model/anyToImageVideoAudio.py", line 63, in __init__
    if self.args['freeze_lm']:
KeyError: 'freeze_lm'

#5, CUDA visible device en varible

This line of code has setup CUDA_VISIBLE_DEVICES to be 7 in the python code for demo app, so I am guessing you have 8 cuda devices, and only make the last one visible in your demo app?

os.environ['CUDA_VISIBLE_DEVICES'] = '7'

we should proabbly do it in shell cmd so that users have freedom to setup this env instead of modifying the code to fit individual cuda device setup

export CUDA_VISIBLE_DEVICES=7
cd code
./scripts/app.sh

Using huggingface's pytorch_model.pt, the answer to the error is as follows

Thanks for the colourful work, I'm having some problems running it
image
text prompt: ### Human: hello

Assistant:

all_gen_img_idx: []
all_gen_vid_idx: []
all_gen_aud_idx: []
text_outputs: ['surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely surely critical critical

FileNotFoundError: [Errno 2] No such file or directory

Hello, when performing Prepare Vicuna Checkpoint at "https://github.com/NExT-GPT/NExT-GPT/blob/main/ckpt/pretrained_ckpt/prepare_vicuna.md", the generated file is "vicuna_ckpt/7b_v0/", In "https://github.com/NExT-GPT/NExT-GPT/blob/main/README.md", Step-2 of 4.1. Preparing Checkpoints is ./ckpt/delta_ckpt/nextgpt/7b_tiva_v0, as shown above It is 7b_v0 followed by 7b_tiva_v0. Change 7b_v0 to 7b_tiva_v0 and run sh scripts/app.sh to report an error.
FileNotFoundError: [Errno 2] No such file or directory:
'../ckpt/delta_ckpt/nextgpt/7b_tiva_v0/pytorch_model.pt'

Question about generation process

During the image generation process, whatβ€˜s in positions of is their embedding? which is independent of the last generated representations. If so, can I assume that the four representations were generated in parallel?

Fail to create image from text promt

I ran the demo from huggingface.
I tried to prouduce some images from texts.
But I got this

I'm sorry, but I'm an AI language model and I don't have the capability to produce images. However, I can help you describe the image you want to see.

Is this expected? Or I got something wrong.

About audio and video special token

I found that the special tokens "", " ", "", "" you added to tokenizer were not used in promot_warp and _prepare_xxx_embed, only "", "/Img>" used in both video and audio cases. so it seems to be unnecessary to define them ?

wrong answer

image
I used the checkpoint from huggingface directly, and I added the line ' parser.add_argument('--mode', type=str, default='validate')' to demo_app.py to avoid the error. In addition, I found the response had not '' token when I printed the results.

A problem when combineing the weights

(nextgpt) admin1:~/nextgpt/src$ python -m fastchat.model.apply_delta --base llama_hf --target ./vicuna_ckpt/7b_v0/ --delta vicunaLoading the delta weights from vicuna
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in huggingface/transformers#24565
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:07<00:00, 3.90s/it]
Loading the base model from llama_hf
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:09<00:00, 4.91s/it]
Applying the delta
Applying delta: 0%| | 0/291 [00:00<?, ?it/s]Traceback (most recent call last):
File "/public1/home/stu52275901023/anaconda3/envs/nextgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/public1/home/stu52275901023/anaconda3/envs/nextgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/public1/home/stu52275901023/anaconda3/envs/nextgpt/lib/python3.8/site-packages/fastchat/model/apply_delta.py", line 165, in
apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
File "/public1/home/stu52275901023/anaconda3/envs/nextgpt/lib/python3.8/site-packages/fastchat/model/apply_delta.py", line 140, in apply_delta
param.data += delta.state_dict()[name]
RuntimeError: The size of tensor a (32000) must match the size of tensor b (32001) at non-singleton dimension 0

Demo not working

I wanted to commend you on the excellent application you've built. The demo worked flawlessly yesterday for some of the videos I tried, but unfortunately, it's not working today. I'd appreciate any assistance in resolving this issue.

MosIT data

Great job! When will you open source MosIT data?

error: image_diffusion: stabilityai/stable-diffusion-2

image_diffusion: stabilityai/stable-diffusion-2

File "NExT-GPT/code/model/custom_sd.py", line 592, in call
prompt_embeds = self._encode_prompt(
File "/NExT-GPT/code/model/custom_sd.py", line 369, in _encode_prompt
prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1024 but got size 768 for tensor number 1 in the list.

Suggestion: Expand Model Outreach with hosting Gradio demo on HuggingFace Hub

Hi πŸ€— !
Congratulations! Very cool work with Gradio demo linked on the Readme! It would be nice to host the awesome demo on the Hugging Face Hub !

Some of the benefits of sharing your models through the Hub would be:

  • A wider reach of your work to the ecosystem
  • Seamless integration with popular libraries and frameworks, enhancing usability
  • Real-time feedback and collaboration opportunities with a global community of researchers and developers

This is a step-by-step guide explaining the process in case you're interested. 😊 This is our docs on Community GPU Grants.

image

code structure

the code structure is very cool and makes it easy to understand the repo. Was wondering how did you generate it? Did you make it manually?

Why train each modal separately?

Thank you for the release of the code! According to README: "--mm_root_path .../.../x, x varies from [images, audios, videos]", it seems that, for three stages, data from different modals is trained separately. Is it reasonable?

Package versions have conflicting dependencies

`Using cached PyYAML-6.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (701 kB)
ERROR: Cannot install PyYAML==6.0 and PyYAML==6.0.1 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested PyYAML==6.0
The user requested PyYAML==6.0.1`

Anyone have any idea which PyYAML version I should loosen to?

Evaluation scripts

Hi! Thank you for releasing code for your model!
Could you please share evaluation scripts as well? Or describe how you ran evals? Thank you!

Bad results during inference

I used the 7b_tiva_v0 weights downloaded from Huggingface, and when I run bash script/app.sh, I got the following results.
image

AttributeError: 'Namespace' object has no attribute 'update'

After executing the command bash scripts/app.sh

from diffusers.pipeline_utils import DiffusionPipeline
Traceback (most recent call last):
File "demo_app.py", line 28, in
args.update(load_config(args))
AttributeError: 'Namespace' object has no attribute 'update'

How to isolate this problem?

Questions about the training objectives

Hi, thank you for the fantastic work! I have a few questions about this work:

  • Have you compared the effectiveness of generating signal tokens versus directly generating captions? In Next-GPT, the representation of "<IMG_0> <IMG_1> <IMG_2> <IMG_3>" is aligned with the outputs of the CLIP text encoder. What if we directly generate the corresponding caption (<IMG_0> {caption} <IMG_1>) and input it to the CLIP text encoder? The motivation behind this is that it is often challenging to optimize the model using a combination of l2 loss and cross-entropy, as controlling the proportion of the two can be difficult. By directly generating captions, we would only need to optimize the entire model through cross-entropy, and eliminate the need for introducing the Output Projection.

  • Does introducing L2 loss contribute to improved performance in image-to-text tasks (such as coco caption)? Have you conducted any related ablation experiments?

  • I noticed that Next-GPT attained an impressive CIDEr score of 156.7 on COCO, surpassing previous approaches utilizing finetuning and SCST. Did Next-GPT undergo finetuning or SCST on COCO?

KeyError: 'lora_r' in demo_app.py

File "demo_app.py", line 27, in
model = NextGPTModel(**args)
File "/NExT-GPT/code/model/anyToImageVideoAudio.py", line 80, in init
r=self.args['lora_r'],
KeyError: 'lora_r'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.