Coder Social home page Coder Social logo

cross-modal-bert's Introduction

Cross-Modal-BERT

Implementation of the paper: Cross-Modal BERT for Text-Audio Sentiment Analysis (MM 2020)

In this paper, we propose a Cross-Modal BERT (CM-BERT) that introduces the information of audio modality to help text modality fine-tune the pre-trained BERT model. As the core unit of the CM-BERT, the masked multimodal attention is designed to dynamically adjust the weight of words through the cross-modal interaction.

The architecture of the proposed method:

Alt text

Usage

1、Install all required library

pip install -r requirements.txt

2、Get the pre-trained BERT model and modify the --bert_model in run_classifier.py

You can download the pre-trained BERT model from pre-trained BERT model, or you can use the following code to get it.

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip

3、Run the experiments by:

python run_classifier.py

Results

Experimental results on CMU-MOSI dataset.

Model Modality Acc7 Acc2 F1 MAE Corr
EF-LSTM T+A+V 33.7 75.3 75.2 1.023 0.608
LMF T+A+V 32.8 76.4 75.7 0.912 0.668
MFN T+A+V 34.1 77.4 77.3 0.965 0.632
MARN T+A+V 34.7 77.1 77.0 0.968 0.625
RMFN T+A+V 38.3 78.4 78.0 0.922 0.681
MFM T+A+V 36.2 78.1 78.1 0.951 0.662
MCTN T+A+V 35.6 79.3 79.1 0.909 0.676
MulT T+A+V 40.0 83.0 82.8 0.871 0.698
T-BERT T+A+V 41.5 83.2 82.3 0.784 0.774
CM-BERT(ours) T+A 44.9 84.5 84.5 0.729 0.791

Citation

If you mentioned the method in your research, please cite this article:

@inproceedings{10.1145/3394171.3413690,
author = {Yang, Kaicheng and Xu, Hua and Gao, Kai},
title = {CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413690},
doi = {10.1145/3394171.3413690},
abstract = {Multimodal sentiment analysis is an emerging research field that aims to enable machines to recognize, interpret, and express emotion. Through the cross-modal interaction, we can get more comprehensive emotional characteristics of the speaker. Bidirectional Encoder Representations from Transformers (BERT) is an efficient pre-trained language representation model. Fine-tuning it has obtained new state-of-the-art results on eleven natural language processing tasks like question answering and natural language inference. However, most previous works fine-tune BERT only base on text data, how to learn a better representation by introducing the multimodal information is still worth exploring. In this paper, we propose the Cross-Modal BERT (CM-BERT), which relies on the interaction of text and audio modality to fine-tune the pre-trained BERT model. As the core unit of the CM-BERT, masked multimodal attention is designed to dynamically adjust the weight of words by combining the information of text and audio modality. We evaluate our method on the public multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results show that it has significantly improved the performance on all the metrics over previous baselines and text-only finetuning of BERT. Besides, we visualize the masked multimodal attention and proves that it can reasonably adjust the weight of words by introducing audio modality information.},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {521–528},
numpages = {8},
keywords = {pretrained model, multimodal sentiment analysis, attention network},
location = {Seattle, WA, USA},
series = {MM '20}
}

cross-modal-bert's People

Contributors

kaicheng-yang0828 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cross-modal-bert's Issues

CMU-MOSEI Experimentation Settings

Hi @Kaicheng-Yang0828,

I have tried to run CM-BERT on CMU-MOSEI dataset. However, I only manage to get around 74.15 for acc2 instead of 83.6 as mentioned in #1 (comment).

Below are the steps that I have taken:

  1. Changed audio pickle file from MOSI_cmu_audio_CLS.pickle to MOSEI_cmu_audio_CLS.pickle

  2. Changed dev, test and train tsv files to hold CMU-MOSEI data

  3. Modified loaded pickle file to MOSEI_cmu_audio_CLS.pickle in run_classifier.py file
    image

  4. Modified dimension of audio modality from 5 to 74 in model.py file
    image

Is there any step that I have done wrongly or missed out? Just wondering is it also possible to have the code files for CMU-MOSEI too?

Your help is very much appreciated.
Thanks!

数据集问题

您好,我也在做音频和文本两种模态结合的研究,但是我自己的数据属于长尾数据集类型,训练效果不是很好,试了用smote方法做数据均衡,效果并没有提升,现在不知道该怎么修改或者做哪些改进了,请问你有什么建议吗?

Trained model

Hi,

Is there a trained model available? I want to use it for a downstream task, so it'd be great to have the already trained model and use it to predict some of my test samples.

Thank you very much.

audio data

hello,I used WAV2VEC2 to extract audio features,The dimension of each Auido is(50,512),I changed the input to conv1d to 512,But the accuracy rate is always zero.Do you have any suggestions?

CMU-MOSEI data files

Hello,

Do you have the data files and result table for CMU-MOSEI dataset available?

Thanks!

关于文本标签的问题

纯小白想问一下,标签的数字是怎么确定的呢,比如-3.0,-2.0,1.4这些标签代表什么呢?

About the Visualization of the attention matrices.

Hi,
Thanks for your great project! Here I have a question about the visualization of the attention matrices. Is it right if I use the text_attention and fusion_attention to visualize by the following plot code?

            plt.subplot(2, 1, 1)
            plt.imshow(text_attention[0], cmap='Blues', interpolation='nearest')
            plt.colorbar()
            plt.xlabel('Source')
            plt.ylabel('Target')
            plt.title('Visualization of the Text Attention Matrix')
            plt.subplot(2, 1, 2)
            plt.imshow(fusion_attention[0], cmap='Blues', interpolation='nearest')
            plt.colorbar()
            plt.xlabel('Source')
            plt.ylabel('Target')
            plt.title('Visualization of the Fusion Attention Matrix')

Question regarding sequence lengths and padding masks

Hello, first of all congratulations on your paper; amazing work!
I am currently trying to adapt your masked multi modal attention module for multimodal dialog act classification and I have a few questions regarding your architecture.

As written in the paper in Section 4.2, the sequence lengths of audio features are not the same as the ones of text features. I assume this is the case because Bert also takes punctuation marks and other special tokens as inputs in the sequence.
Analyzing your implementation of the masked multimodal Attention (found in BertFinetun in your code), I see that you use a padding mask, which I backtracked and saw that it regards only the text data and has the shape (batch_size, 1, max_seq_length).

Now my question is, since the audio features have smaller sequence lengths and were padded with additional zero vectors to match the sequence lengths of the text features, why don't you use a separate padding mask for the audio features?

Fintune problem

非常感谢作者开源。我注意到你们的代码没有解冻最后一层分类层,我加上对比了下,精度稍微低了一点,不过这有可能是因为超参不合适。你们这样做是因为解冻分类层效果会更差,还是说有其他什么讲究呢?
i wonder why you didn't unfreeze the classifier layer in run_classifier.py line129 ?
I got a slightly lower accuracy when i unfreezed 'classifier' which may due to the unmatched hyperparameters, or is there something else

What does the feature dimension represent?

Hi,
when I run the code
train_audio,valid_audio,test_audio= pickle.load(open('Cross-Modal-BERT-master/data/audio/MOSI_cmu_audio_CLS.pickle','rb'))

I look at the size of the train_audio , valid_audio and test_audio:
train_audio.shape=(1284, 50, 5),valid_audio.shape=(229, 50, 5), test_audio.shape=(686, 50, 5)

1284,229,686 is their data size,But what do 5 and 50 mean?

Looking forward to your reply.
Thanks.

关于video数据

您好,如果想要获得video的MOSI_cmu_audio_CLS.pickle数据,应该怎么做呢?

pip install -r requirements.txt gives an error?

I am using Python 3.10.2. At first, it tries to install boto3 and botocore that fits requirements until i ran out of memory, and so I searched the boto3 and botocore version that was created when you created CM-BERT. However, the next issue came from numpy

`WARNING: The candidate selected for download or install is a yanked version: 'torchvision' candidate (version 0.2.1 at https://files.pythonhosted.org/packages/ca/0d/f00b2885711e08bd71242ebe7b96561e6f6d01fdb4b9dcf4d37e2e13c5e1/torchvision-0.2.1-py2.py3-none-any.whl (from https://pypi.org/simple/torchvision/))
Reason for being yanked: So that users won't accidentally install this when using python 3.11
Installing collected packages: urllib3, pathlib, idna, chardet, tqdm, six, regex, pillow, numpy, joblib, jmespath, docutils, certifi, torchvision, scipy, requests, python-dateutil, scikit-learn, botocore, s3transfer, boto3, pytorch-pretrained-bert
DEPRECATION: numpy is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at pypa/pip#8559
Running setup.py install for numpy ... error
error: subprocess-exited-with-error

× Running setup.py install for numpy did not run successfully.
│ exit code: 1
╰─> [278 lines of output]`

So, I tried using pip install -r requirements.txt --use-pep517 from internet's suggestion, however it still did not work. Please help so I can try your code. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.