Coder Social home page Coder Social logo

affectgpt's Introduction

AffectGPT: Audio-Video-Text Aligned, Human-centric Model for Explainable Multimodal Emotion Reasoning

Task

Explainable multimodal emotion reasoning aims to predict emotion labels and reasoning process for these predictions using multimodal clues.

example

Details can be found in our paper: Explainable Multimodal Emotion Reasoning

@article{lian2023explainable,
  title={Explainable Multimodal Emotion Reasoning},
  author={Lian, Zheng and Sun, Licai and Xu, Mingyu and Sun, Haiyang and Xu, Ke and Wen, Zhuofan and Chen, Shun and Liu, Bin and Tao, Jianhua},
  journal={arXiv preprint arXiv:2306.15401},
  year={2023}
}

Dataset

To construct the initial dataset, we select samples from the large-scale video emotion dataset, MER 2023. Due to the high annotation cost, we randomly annotate 100 non-neutral samples to form the initial dataset.

  1. Download MER 2023 Dataset

To download the dataset, please fill out an EULA and send it to [email protected]. It requires participants to use this dataset only for academic research and not to edit or upload samples to the Internet.

@article{lian2023mer,
  title={MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning},
  author={Lian, Zheng and Sun, Haiyang and Sun, Licai and Zhao, Jinming and Liu, Ye and Liu, Bin and Yi, Jiangyan and Wang, Meng and Cambria, Erik and Zhao, Guoying and others},
  journal={arXiv preprint arXiv:2304.08981},
  year={2023}
}
  1. Annotations

We provide emotion and reason in ./instruction-dataset/EmoReason.

gt-chi.csv and gt-eng.csv represent Chinese and English version annotations, respectively.

Baselines

We exam PandaGPT, Valley, VideoChat,Video-ChatGPT, Video-LLaMA. Prediction results can be found in ./baseline_results.

# files
gt.csv: ground truth annotations
pandagpt.csv: prediction results of PandaGPT
valley.csv: prediction results of Valley
videochat(emb).csv: prediction results of VideoChat-Embed
videochat(text).csv: prediction results of VideoChat-Text
videochatgpt.csv: prediction results of Video-ChatGPT
videollama.csv: prediction results of Video-LLaMA

# columns:
names: video name
chi_reasons: reasoning process (in Chinese). We do not provide English version. You can generate such info via translation function in ChatGPT or other toolkits.

Evaluation

We provide ChatGPT-based evaluation approaches. See evaluation.py for more details.

Please ensure you have an OpenAI key and can access to ChatGPT.

# evaluate for full datasets
python eval_release.py --data_root=[your data root] --openai_key=[your openai key]

# only run the code for debug
python eval_release.py --data_root=[your data root] --openai_key=[your openai key] --debug

AffectGPT

We aim to train an audio-video-text aligned model to deal with explainable multimodal emotion reasoning. Specifically, we modify the code in Video-LLaMA and make it support for audio-video-text aligned training. Meanwhile, we unify the input for different datasets, remove some unnecessary modules, add model.eval() during inference, etc. More details can be found in our implementation.

  1. Environment Preparation

Different CUDA versions may need different install approach:

## machine1: 
# conda env create -f environment.yml

## machine2: cuda=11.7
# conda create --name=videollama python=3.9
# conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c pytorch -c nvidia
# pip install -r requirement.txt # Note: remove already installed packages in requirement.txt

## machine3:cuda=10.2
# conda create --name=videollama python=3.9
# conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=10.2 -c pytorch
# conda install transformers
# pip install -r requirement.txt # Note: remove already installed packages in requirement.txt

## other errors:libGL.so.1 errors in opencv, then yum install mesa-libGL
  1. Dataset Preparation
# instruction-dataset/LLaVA
1. llava_instruct_150k.json: https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_instruct_150k.json
2. COCO train2017: https://cocodataset.org/#download

# instruction-dataset/EmoReason
1. To download the dataset, please fill out an [EULA](https://drive.google.com/file/d/1LOW2e6ZuyUjurVF0SNPisqSh4VzEl5lN) and send it to [email protected]. It requires participants to use this dataset only for academic research and not to edit or upload samples to the Internet.
2. gt-eng.csv and gt-chi.csv

# instruction-dataset/MiniGPT-4
https://drive.google.com/file/d/1nJXhoEcy3KTExr17I7BXqY5Y9Lx_-n-9/view

# instruction-dataset/VideoChat
videochat_instruct_11k.json: https://drive.google.com/file/d/1C-7xmf42QUEi4ApXTcxBHr5nLvTWXyUi
webvid: https://github.com/m-bain/webvid. We also modify download.py for acceleration
  1. Pre-trained Model Preparation
# models/finetune_vicuna7b_audiobranch.pth
https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth

# models/finetune_vicuna7b_videobranch.pth
https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune_vicuna7b_audiobranch.pth

# models/blip2_pretrained_flant5xxl.pth
https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth

# models/eva_vit_g.pth
https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth

# models/imagebind_huge.pth
https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth

# models/pretrained_minigpt4.pth
https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze

# models/vicuna-7b-v0
llama-7b-hf: https://huggingface.co/decapoda-research/llama-7b-hf/tree/main
vicuna-7b-delta-v0: https://huggingface.co/lmsys/vicuna-7b-delta-v0/tree/main
python3 apply_delta.py --base {path for llama-7b-hf} --delta {path for vicuna-7b-delta-v0} --target vicuna-7b-v0

# ./bert-base-uncased
https://huggingface.co/bert-base-uncased
  1. Training and Inference
conda activate videollama

# training with one gpu
python train.py --cfg-path=./train_configs/multimodal_llama_stage3_finetune.yaml

# training with 8 gpu
torchrun --nproc_per_node=8 train.py --cfg-path=./train_configs/multimodal_llama_stage3_finetune.yaml

# inference for one epoch (one emotion reasoning performance)
python inference.py --gpu-id=0 --cfg-path="eval_configs/video_llama_eval_withaudio_stage3.yaml" \
					--ckpt_root="video_llama/output/multimodal_llama_stage3_finetune/20230722155059_100_1000_1000" \
					--test_epochs="0-0"

# inference for multiple epochs (one emotion reasoning performance)
python inference.py --gpu-id=0 --cfg-path="eval_configs/video_llama_eval_withaudio_stage3.yaml" \
					--ckpt_root="video_llama/output/multimodal_llama_stage3_finetune/20230722155059_100_1000_1000" \
					--test_epochs="0-9"

affectgpt's People

Contributors

zeroqiaoba avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.