lyuchenyang / macaw-llm Goto Github PK

View Code? Open in Web Editor NEW

1.5K 32.0 119.0 36.72 MB

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

License: Apache License 2.0

Python 98.38% Shell 1.62%

language-model multi-modal-learning natural-language-processing deep-learning machine-learning neural-networks

macaw-llm's People

Contributors

Stargazers

Watchers

Forkers

tencent-ailab codeaudit dezigns333 lycokie vamoko stancx1 xigua369 zb-zhouhao haorand ishine billionerd timhuang1 cauuliflower closegoingaway zaku-zaku s8xy xupercoin techthiyanes soon14 zth9730 jieyoujun myrichardx maigone hhy5277 ccaiccie jaze-developement tonywhite11 jbluv iam20cm burgosny goswamig codeconnoisseur45 liangofthechen wensiyuansix staccats apollohuang1 mfkiwl ljunius ngthanhtin lokyliu vmsearch tvbboy2015 masemxiao hongwen-sun wolfworld6 roman-212 bgagandeep coder-drinker paramedick glenda1965 sorokinvld ntt720 d3p10y n0wwa spicyguml mingkin windb3ll hay-man obsidian6s keyzf nicolesherwood e-kiss-me danxiangjie tutuna mlslavender brewswang jeanmoumou francismontalbo corongozo szpsunkk joeyee007 cquptxx-sangfor riddhi73 zhangjiwei-japan phoebussi leeaandrob keyman9848 mwksandman eru1030 tangdk jingwu6 akemi0301 sararijo cryptowealth-technology tmukande-debug wangdian215 zongdaoming jaedukseo anthony-wss tsunami2 budiholan-github aifylabs anshkumar mdwoicke xc0r cyrilmagsuci airhors cccmz316423 rayluo88 goelmk

macaw-llm's Issues

GPU Memory Requirement

Thank you for your awesome work! I would like to know how much GPU memory at least can run on this project, can It run on a 2*3090 GPU?

Performance of the model

Hello,
I tried to load the pre-trained model you provided and run the following example from AVSD data:

  {
        "instruction": "Is the woman already in the room?",
        "input": "",
        "output": "Yes ahe is already in the room",
        "image": null,
        "audio": null,
        "video": "7UPGT.mp4"
    },

Basically, to prepare the whisper model, clip model, and llama model, I used the following:

   # save whisper, clip, and llama models for future use.
from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")

clip_model.save_pretrained('pretrained_models/clip_model/')
whisper_model.save_pretrained('pretrained_models/whisper_model/')
llama7b_model.save_pretrained('pretrained_models/llama7b_model/')

To load the macaw model you provided, I used the following:

if name == "main":
clip_config = CLIPConfig.from_pretrained('pretrained_models/clip_model/')
whisper_config = WhisperConfig.from_pretrained('pretrained_models/whisper_model/')
llm_config = AutoConfig.from_pretrained('pretrained_models/llama7b_model/')
tokenizer = get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)
llm_config.vocab_size = len(tokenizer)
print("llm_config: ", llm_config)

model_config = MM_LLMs_Config(
    n_frames=6, 
    attention_heads=32, 
    image_conv_kernel=48, 
    image_conv_stride=36, 
    video_conv_kernel=36, 
    video_conv_stride=30, 
    audio_conv_kernel=240, 
    audio_conv_stride=220,
    clip_config=clip_config, whisper_config=whisper_config, llm_config=llm_config
)

macaw_model = MM_LLMs.from_pretrained(
    'pretrained_models/macaw/',
    config = model_config,
    # load_in_8bit=True,
    # torch_dtype=torch.float16,
    # device_map=device_map,
)
TOKENIZER =  get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)

I run the model by:

macaw_model.eval()
with torch.no_grad():
    generate_ids = macaw_model(data_item)
print("generate_ids: ", generate_ids)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("input_texts: ", input_texts)
print("generated_texts: ", generated_texts)

Then I tested the above avsd example. What I get is:

input_texts: ['Below is an instruction that describes a task, with or without input. Write a response that appropriately completes the request.\n\n### Instruction:\nIs the woman already in the room?\n\n### Response:\n\n']
generated_texts: ['\n\n']

So you can see, the output is nonsense. I tried some other examples, and I also tried pure text input, but they results are not satisfying. May I ask what may be wrong?

Questions about the files - which files to download

Thanks for the cool project! I have two questions:

which files exactly we should download? In the COCO, VQA, etc. datasets, there are many files. However, I believe only a part of them are needed. For example, I downloaded the following:

Stage 1:

1. Download the COCO image dataset (2014 Train images [83K/13GB]) from: https://cocodataset.org/#download, unzip to current folder (train2014/).

2. Download the Macaw dataset: https://github.com/lyuchenyang/Macaw-LLM/blob/main/data/generated_examples_coco.json

3. Download the Macaw dataset: https://github.com/lyuchenyang/Macaw-LLM/blob/main/data/generated_examples_avsd.json

4. Download the Charades video dataset (Data (scaled to 480p, 13 GB)) from: https://prior.allenai.org/projects/charades, unzip to current folder (Charades_v1_480/).

5. In the current folder, create a folder named "avsd/". In "./avsd/", create "./avsd/videos/", "./avsd/audios/", and "./avsd/images/". Move all the videos from "Charades_v1_480/" to "./avsd/videos/".

6. In the current folder, create a folder named "coco/". In "./coco/", create "./coco/images/". Move all the images from "train2014/" to "./coco/images/".

Stage 2:

1. From https://visualqa.org/download.html download "Training annotations 2017 v2.0", "Validation annotations 2017 v2.0", "Training questions 2017 v2.0", "Validation questions 2017 v2.0". Put them in "./vqa/" and unzip.

2. From https://video-dialog.com/ download AVSD Dataset (4 files), put them into "./avsd/".

But I'm not sure whether it is all we needs.

In the combine_visual_and_audio_names(): of preprocessing supervised python script, there is a:

def add_image_names(dir=None):
all_examples = json_load(dir)['annotations']

    for ind, e in enumerate(tqdm(all_examples)):
        
        _image_dir = e['image_path']
        if len(_image_dir.split('_')[-1].split('.')[0]) < 12:
            i_str = _image_dir.split('_')[-1].split('.')[0]
            n_str = '0' * (12 - len(i_str)) + i_str
            _image_dir = _image_dir.replace(i_str, n_str)

However, I can't find any "image_path" field in any of the above json files.

Looking forward to your answer. Thank you.

Paths for pretrained models

Hi, can you please provide huggingface paths for the following?

clip_config = CLIPConfig.from_pretrained('trained_models/clip_model')
whisper_config = WhisperConfig.from_pretrained('trained_models/whisper_model')

I tried with openai/clip-vit-base-patch16 and openai/whisper-base but there seems to be a mismatch in shapes upon loading the model.

Thanks

Requirement Versions

Multiple requirement versions are not specified. This is leading to problems during install.

protobuf
scikit-learn
moviepy
ffmpeg-python
tqdm
pandas
opencv-python
clip
openai-whisper
appdirs
loralib
bitsandbytes
black
black[jupyter]
fire
gradio
peft
deepspeed

Different LLM backbones?

Hi, the README mentions several different LLM backbones, but the paper seems to reference only LLaMA and a brief code search didn't turn up any mentions of Vicuna or Bloom. Did you train this with other LLMs beyond LLaMA and if so, where can we find the trained weights for these?

Thank you!

Question about setting pad token

Hi, may I know how to set the pad token?
In the previous version of code, it was set as [32006]. I checked that in LLaMA token files, 32006 isnot used yet. Can I use any num not used before?

TypeError: string indices must be integers, not 'str'

preprocess_data_unsupervised.py", line 105, in preprocess_alpaca_to_tensor_dataset
texts = PROMPT_DICT['prompt_input'].format(e['instruction'], e['input']) if e['input'] != "" else PROMPT_DICT['prompt_no_input'].format(e['instruction'])

Call for paper

Hi, appreciate your great job! I wonder that is there any paper related to this project.

Some weights of MM_LLMs were not initialized from the model checkpoint at ./mm_llms_trainer/ and are newly initialized:

Thank you very much for your outstanding work. I encountered the following problem when loading model weights. When I used torch.load to load pytorch_model.bin, I found that this part of the weights was indeed missing.
Some weights of MM_LLMs were not initialized from the model checkpoint at ./mm_llms_trainer/ and are newly initialized: ['video_long_self_attention.in_proj_bias', 'video_long_self_attention.bias_v', 'video_long_self_attention.in_proj_weight', 'video_long_self_attention.out_proj.bias', 'video_long_self_attention_attention_ .bias_k', 'video_long_self_attention.out_proj.weight']

Resource problem?

I wonder with 3 big models like CLIP, LLama, Whisper, at least how much VRAM will we need to host a demo? Is it possible to host them on a single 4090 GPU?

What is the pad ID for tokenizer?

In the trainer file, I saw:

special_tokens = {
'': 32000,
'': 32001,
'': 32002,
'': 32003,
'': 32005,
}

But in the preprocessing files, I didn't see these tokens are setted. Instead, I tried to print out the token ids, and found that it seems the PAD token id is 32000. What is the potential problem? What is the pad_id for tokenizer?

Question about finetuning all parameters of LLM?

Hello, I found that you propose a one-step instruction fine-tuning approach on all parameters of LLM. Will this cause catastrophic forgetting or Overfitting on the instruction dataset ？

Questions about Model

Dear Author,

I would like to express my sincere gratitude for your open-source contributions. Your neural network model has left a deep impression on me. It seems that your model is driven by text information (CLIP aligns images and text, while Whisper aligns audio and text), and the ultimate goal of the model appears to be more inclined towards multimodal QA and multimodal captioning. However, I have the following questions:

The dimensions of different modalities are vastly different. How do you balance the information from different modalities in your network?
In real-world scenarios, there may be missing modalities. Do you need to input information from all three modalities during the training/inference process of your model, or can you only input certain modalities?

I am looking forward to your work and hope to see your article soon. Thank you.

Best regards,
RitchieAlpha

Could you please share the code to generate the instruct data?

Hi, I want to generate instruct data on my dataset with GPT4. But I don't know how to write the code. And I also notice that there is rate limit from openai. So I 'd like to have some suggestion or help from you~~~

Using pad_token, but it is not set yet.

Hi, when I run "preprocess_data_supervised.py" by using llama-7b-hf tokenizer, it shows "Using pad_token, but it is not set yet" and "Truncation was not explicitly activated but max_length is provided a specific value,...".

Is it ok?

The code implementations to generate the instruct data.

Thank you for your great work!
Could you share the code implementations to generate the instruct data?

Looking forward to your reply!
Thank you

missing file 'data/avsd/avsd_train.json'

Hi, may I know what are these two files in preprocess_data_supervised.py?
Could you please share these files?

Custom Dataset Finetuning

Could you add some pointers on how to use your trained model for fine-tuning on a custom dataset? Also which model weights need to be stored where for the inferencing to work? @longyuewangdcu @minghao-wu @seeledu

How can i run train.sh on only one GPU?

I try to delete the parameters about nccl but it makes no sense. Need your help!

Which tokenizer are you using?

Which tokenizer are you using?
tokenizer = AutoTokenizer.from_pretrained('trained_models/llama_tokenizer')
This seems to not work.

which llama tokenizer to use?

In the preprocessing file, we have tokenizer = AutoTokenizer.from_pretrained('trained_models/llama_tokenizer'). This seems won't get the llama tokenizer from HF. which llama tokenizer we should use? As there are several versions on HF. Thanks.

Always have same response

Hi, I have loaded your pre-trained weights and tried some instructions. However, I found the model responded with the same answer no matter what image I gave.

model = MM_LLMs.from_pretrained(
        "trained_model/mm_llms_trainer",
        config = model_config,
    )
model.eval()
# ...

instruction = "How many boats are in the picture?"
template = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

input_ids = tokenizer.encode(template.format(instruction))
eos_token_id = tokenizer.eos_token_id
if eos_token_id in input_ids:
    input_ids.remove(eos_token_id)
input_ids = torch.tensor([input_ids], dtype=torch.int).to(device)

# image
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000492606.jpg"))
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000344896.jpg"))
image = preprocess(Image.open("data/image_sample/COCO_train2014_000000407061.jpg"))
image = image.unsqueeze(0)

with torch.no_grad():
    bs = 1
    
    inputs = {
        "videos": None,
        "images": image.half(),
        "audios": None,
        "input_ids": input_ids,
        'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
        'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
        'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
        'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
        'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
        'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
    }

    for k,v in inputs.items():
        if v is not None:
            inputs[k] = v.to(device)
    inputs['inference'] = True
    
    
    text_embeddings, attention_mask, labels, debug = model.prepare_inputs_for_generation(inputs)
    
    print()
    print(text_embeddings.size())
        

    model_output = model.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
    generate_ids = model.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How many boats are in the picture?

### Response:
========================================
There are 5000 in the picture.
========================================

No matter what image I gave to the model. The model always replies There are 5000 in the picture. with the same prompt. It seems the model just ignored any multi-modal inputs and replied based on text.

Did I do anything wrong? Thank you.

Deployment of Macaw-LLM

Hi
Do you have any example of Macaw-LLM deployment using docker?

Can I use this in Windows?

Missing License File

Hello,
Thanks for sharing this work.

But the repo seems to be missing a LICENSE file and hence makes it difficult for people to decide if they can use this project in their work or not.

Has a decision regarding licensing been made?

Thanks!

from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")
clip_model.save_pretrained('trained_models/clip_model/')
whisper_model.save_pretrained('trained_models/whisper_model/')
llama7b_model.save_pretrained('trained_models/llama7b_model/')

Is this correct?

missing file ”data/all_visual_names.json“

hi, thank you for making such great work open source.
However, I have encountered some issues:

When I run inference.sh, there is a file missing error on 'data/all_visual_names.json', how can I get this file?
Is there trained models we can do inference directly?

Does it support langchain?

fantastic multimodality model, does it support langchain?

How many GPU memory needed to finetune the model?

Can I finetune it with my dataset with 4 * 3090?