Coder Social home page Coder Social logo

lyuchenyang / macaw-llm Goto Github PK

View Code? Open in Web Editor NEW
1.5K 32.0 119.0 36.72 MB

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

License: Apache License 2.0

Python 98.38% Shell 1.62%
language-model multi-modal-learning natural-language-processing deep-learning machine-learning neural-networks

macaw-llm's People

Contributors

bingshuailiu avatar longyuewangdcu avatar lyuchenyang avatar seeledu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

macaw-llm's Issues

Data filtering step

When processing the dataset, there is a filter criteria:

if 'caption' in e['instruction'] or 'caption' in e['response'] or ' no ' in e['response'] or 'not' in e['response']:
        continue

Why we need such a filtering step?

GPU Memory Requirement

Thank you for your awesome work! I would like to know how much GPU memory at least can run on this project, can It run on a 2*3090 GPU?

Performance of the model

Hello,
I tried to load the pre-trained model you provided and run the following example from AVSD data:

  {
        "instruction": "Is the woman already in the room?",
        "input": "",
        "output": "Yes ahe is already in the room",
        "image": null,
        "audio": null,
        "video": "7UPGT.mp4"
    },

Basically, to prepare the whisper model, clip model, and llama model, I used the following:

   # save whisper, clip, and llama models for future use.
from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")

clip_model.save_pretrained('pretrained_models/clip_model/')
whisper_model.save_pretrained('pretrained_models/whisper_model/')
llama7b_model.save_pretrained('pretrained_models/llama7b_model/')

To load the macaw model you provided, I used the following:

if name == "main":
clip_config = CLIPConfig.from_pretrained('pretrained_models/clip_model/')
whisper_config = WhisperConfig.from_pretrained('pretrained_models/whisper_model/')
llm_config = AutoConfig.from_pretrained('pretrained_models/llama7b_model/')
tokenizer = get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)
llm_config.vocab_size = len(tokenizer)
print("llm_config: ", llm_config)

model_config = MM_LLMs_Config(
    n_frames=6, 
    attention_heads=32, 
    image_conv_kernel=48, 
    image_conv_stride=36, 
    video_conv_kernel=36, 
    video_conv_stride=30, 
    audio_conv_kernel=240, 
    audio_conv_stride=220,
    clip_config=clip_config, whisper_config=whisper_config, llm_config=llm_config
)

macaw_model = MM_LLMs.from_pretrained(
    'pretrained_models/macaw/',
    config = model_config,
    # load_in_8bit=True,
    # torch_dtype=torch.float16,
    # device_map=device_map,
)
TOKENIZER =  get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)

I run the model by:

macaw_model.eval()
with torch.no_grad():
    generate_ids = macaw_model(data_item)
print("generate_ids: ", generate_ids)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("input_texts: ", input_texts)
print("generated_texts: ", generated_texts)

Then I tested the above avsd example. What I get is:

input_texts: ['Below is an instruction that describes a task, with or without input. Write a response that appropriately completes the request.\n\n### Instruction:\nIs the woman already in the room?\n\n### Response:\n\n']
generated_texts: ['\n\n']

So you can see, the output is nonsense. I tried some other examples, and I also tried pure text input, but they results are not satisfying. May I ask what may be wrong?

Questions about the files - which files to download

Thanks for the cool project! I have two questions:

  1. which files exactly we should download? In the COCO, VQA, etc. datasets, there are many files. However, I believe only a part of them are needed. For example, I downloaded the following:

Stage 1:

1. Download the COCO image dataset (2014 Train images [83K/13GB]) from: https://cocodataset.org/#download, unzip to current folder (train2014/).

2. Download the Macaw dataset: https://github.com/lyuchenyang/Macaw-LLM/blob/main/data/generated_examples_coco.json

3. Download the Macaw dataset: https://github.com/lyuchenyang/Macaw-LLM/blob/main/data/generated_examples_avsd.json

4. Download the Charades video dataset (Data (scaled to 480p, 13 GB)) from: https://prior.allenai.org/projects/charades, unzip to current folder (Charades_v1_480/).

5. In the current folder, create a folder named "avsd/". In "./avsd/", create "./avsd/videos/", "./avsd/audios/", and "./avsd/images/". Move all the videos from "Charades_v1_480/" to "./avsd/videos/".

6. In the current folder, create a folder named "coco/". In "./coco/", create "./coco/images/". Move all the images from "train2014/" to "./coco/images/".

Stage 2:

1. From https://visualqa.org/download.html download "Training annotations 2017 v2.0*", "Validation annotations 2017 v2.0*", "Training questions 2017 v2.0*", "Validation questions 2017 v2.0*". Put them in "./vqa/" and unzip.

2. From https://video-dialog.com/ download AVSD Dataset (4 files), put them into "./avsd/".

But I'm not sure whether it is all we needs.

  1. In the combine_visual_and_audio_names(): of preprocessing supervised python script, there is a:

def add_image_names(dir=None):
all_examples = json_load(dir)['annotations']

    for ind, e in enumerate(tqdm(all_examples)):
        
        _image_dir = e['image_path']
        if len(_image_dir.split('_')[-1].split('.')[0]) < 12:
            i_str = _image_dir.split('_')[-1].split('.')[0]
            n_str = '0' * (12 - len(i_str)) + i_str
            _image_dir = _image_dir.replace(i_str, n_str)

However, I can't find any "image_path" field in any of the above json files.

Looking forward to your answer. Thank you.

Paths for pretrained models

Hi, can you please provide huggingface paths for the following?

clip_config = CLIPConfig.from_pretrained('trained_models/clip_model')
whisper_config = WhisperConfig.from_pretrained('trained_models/whisper_model')

I tried with openai/clip-vit-base-patch16 and openai/whisper-base but there seems to be a mismatch in shapes upon loading the model.

Thanks

Requirement Versions

Multiple requirement versions are not specified. This is leading to problems during install.

protobuf
scikit-learn
moviepy
ffmpeg-python
tqdm
pandas
opencv-python
clip
openai-whisper
appdirs
loralib
bitsandbytes
black
black[jupyter]
fire
gradio
peft
deepspeed

Different LLM backbones?

Hi, the README mentions several different LLM backbones, but the paper seems to reference only LLaMA and a brief code search didn't turn up any mentions of Vicuna or Bloom. Did you train this with other LLMs beyond LLaMA and if so, where can we find the trained weights for these?

Thank you!

Question about setting pad token

Hi, may I know how to set the pad token?
In the previous version of code, it was set as [32006]. I checked that in LLaMA token files, 32006 isnot used yet. Can I use any num not used before?
image

image

TypeError: string indices must be integers, not 'str'

preprocess_data_unsupervised.py", line 105, in preprocess_alpaca_to_tensor_dataset
texts = PROMPT_DICT['prompt_input'].format(e['instruction'], e['input']) if e['input'] != "" else PROMPT_DICT['prompt_no_input'].format(e['instruction'])

Call for paper

Hi, appreciate your great job! I wonder that is there any paper related to this project.

Some weights of MM_LLMs were not initialized from the model checkpoint at ./mm_llms_trainer/ and are newly initialized:

Thank you very much for your outstanding work. I encountered the following problem when loading model weights. When I used torch.load to load pytorch_model.bin, I found that this part of the weights was indeed missing.
Some weights of MM_LLMs were not initialized from the model checkpoint at ./mm_llms_trainer/ and are newly initialized: ['video_long_self_attention.in_proj_bias', 'video_long_self_attention.bias_v', 'video_long_self_attention.in_proj_weight', 'video_long_self_attention.out_proj.bias', 'video_long_self_attention_attention_ .bias_k', 'video_long_self_attention.out_proj.weight']

Resource problem?

I wonder with 3 big models like CLIP, LLama, Whisper, at least how much VRAM will we need to host a demo? Is it possible to host them on a single 4090 GPU?

What is the pad ID for tokenizer?

In the trainer file, I saw:

special_tokens = {
'': 32000,
'': 32001,
'': 32002,
'': 32003,
'': 32005,
}

But in the preprocessing files, I didn't see these tokens are setted. Instead, I tried to print out the token ids, and found that it seems the PAD token id is 32000. What is the potential problem? What is the pad_id for tokenizer?

Questions about Model

Dear Author,

I would like to express my sincere gratitude for your open-source contributions. Your neural network model has left a deep impression on me. It seems that your model is driven by text information (CLIP aligns images and text, while Whisper aligns audio and text), and the ultimate goal of the model appears to be more inclined towards multimodal QA and multimodal captioning. However, I have the following questions:

  1. The dimensions of different modalities are vastly different. How do you balance the information from different modalities in your network?
  2. In real-world scenarios, there may be missing modalities. Do you need to input information from all three modalities during the training/inference process of your model, or can you only input certain modalities?

I am looking forward to your work and hope to see your article soon. Thank you.

Best regards,
RitchieAlpha

Using pad_token, but it is not set yet.

Hi, when I run "preprocess_data_supervised.py" by using llama-7b-hf tokenizer, it shows "Using pad_token, but it is not set yet" and "Truncation was not explicitly activated but max_length is provided a specific value,...".

Is it ok?
截屏2023-08-17 13 23 59

Which tokenizer are you using?

Which tokenizer are you using?
tokenizer = AutoTokenizer.from_pretrained('trained_models/llama_tokenizer')
This seems to not work.

which llama tokenizer to use?

In the preprocessing file, we have tokenizer = AutoTokenizer.from_pretrained('trained_models/llama_tokenizer'). This seems won't get the llama tokenizer from HF. which llama tokenizer we should use? As there are several versions on HF. Thanks.

Always have same response

Hi, I have loaded your pre-trained weights and tried some instructions. However, I found the model responded with the same answer no matter what image I gave.

model = MM_LLMs.from_pretrained(
        "trained_model/mm_llms_trainer",
        config = model_config,
    )
model.eval()
# ...

instruction = "How many boats are in the picture?"
template = f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

input_ids = tokenizer.encode(template.format(instruction))
eos_token_id = tokenizer.eos_token_id
if eos_token_id in input_ids:
    input_ids.remove(eos_token_id)
input_ids = torch.tensor([input_ids], dtype=torch.int).to(device)

# image
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000492606.jpg"))
# image = preprocess(Image.open("data/image_sample/COCO_train2014_000000344896.jpg"))
image = preprocess(Image.open("data/image_sample/COCO_train2014_000000407061.jpg"))
image = image.unsqueeze(0)

with torch.no_grad():
    bs = 1
    
    inputs = {
        "videos": None,
        "images": image.half(),
        "audios": None,
        "input_ids": input_ids,
        'image_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<image>')] * bs, dtype=torch.int),
        'image_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</image>')] * bs, dtype=torch.int),
        'audio_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<audio>')] * bs, dtype=torch.int),
        'audio_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</audio>')] * bs, dtype=torch.int),
        'video_starts': torch.tensor([tokenizer.convert_tokens_to_ids('<video>')] * bs, dtype=torch.int),
        'video_ends': torch.tensor([tokenizer.convert_tokens_to_ids('</video>')] * bs, dtype=torch.int),
    }

    for k,v in inputs.items():
        if v is not None:
            inputs[k] = v.to(device)
    inputs['inference'] = True
    
    
    text_embeddings, attention_mask, labels, debug = model.prepare_inputs_for_generation(inputs)
    
    print()
    print(text_embeddings.size())
        

    model_output = model.llm(inputs_embeds=text_embeddings, attention_mask=attention_mask, labels=labels)
    generate_ids = model.llm.generate(inputs_embeds=text_embeddings, max_new_tokens=128, eos_token_id=2, bos_token_id=1, pad_token_id=32006)
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
How many boats are in the picture?

### Response:
========================================
There are 5000 in the picture.
========================================

No matter what image I gave to the model. The model always replies There are 5000 in the picture. with the same prompt. It seems the model just ignored any multi-modal inputs and replied based on text.

Did I do anything wrong? Thank you.

Missing License File

Hello,
Thanks for sharing this work.

But the repo seems to be missing a LICENSE file and hence makes it difficult for people to decide if they can use this project in their work or not.

Has a decision regarding licensing been made?

Thanks!

please update the demo code?

Hi, dear authors:
Thanks for sharing the great work. I noticed that you have upload the training and evaluation code, but without demo code such as VQA. It would be grateful if you could release the demo code。 thanks you.

How to get the whisper, clip, and llama model used by macaw?

I used the following code to get the pretrained models:

from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")
clip_model.save_pretrained('trained_models/clip_model/')
whisper_model.save_pretrained('trained_models/whisper_model/')
llama7b_model.save_pretrained('trained_models/llama7b_model/')

Is this correct?

missing file ”data/all_visual_names.json“

hi, thank you for making such great work open source.
However, I have encountered some issues:

  1. When I run inference.sh, there is a file missing error on 'data/all_visual_names.json', how can I get this file?
  2. Is there trained models we can do inference directly?
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.