thuiar / cross-modal-bert Goto Github PK

CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis（MM2020）

Python 100.00%

multimodal-sentiment-analysis cross-modal-bert acmmm2020

cross-modal-bert's Introduction

Cross-Modal-BERT

Implementation of the paper: Cross-Modal BERT for Text-Audio Sentiment Analysis (MM 2020)

In this paper, we propose a Cross-Modal BERT (CM-BERT) that introduces the information of audio modality to help text modality fine-tune the pre-trained BERT model. As the core unit of the CM-BERT, the masked multimodal attention is designed to dynamically adjust the weight of words through the cross-modal interaction.

The architecture of the proposed method:

Usage

1、Install all required library

pip install -r requirements.txt

2、Get the pre-trained BERT model and modify the --bert_model in run_classifier.py

You can download the pre-trained BERT model from pre-trained BERT model, or you can use the following code to get it.

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip

3、Run the experiments by:

python run_classifier.py

Results

Experimental results on CMU-MOSI dataset.

Model	Modality	Acc7	Acc2	F1	MAE	Corr
EF-LSTM	T+A+V	33.7	75.3	75.2	1.023	0.608
LMF	T+A+V	32.8	76.4	75.7	0.912	0.668
MFN	T+A+V	34.1	77.4	77.3	0.965	0.632
MARN	T+A+V	34.7	77.1	77.0	0.968	0.625
RMFN	T+A+V	38.3	78.4	78.0	0.922	0.681
MFM	T+A+V	36.2	78.1	78.1	0.951	0.662
MCTN	T+A+V	35.6	79.3	79.1	0.909	0.676
MulT	T+A+V	40.0	83.0	82.8	0.871	0.698
T-BERT	T+A+V	41.5	83.2	82.3	0.784	0.774
CM-BERT(ours)	T+A	44.9	84.5	84.5	0.729	0.791

Citation

If you mentioned the method in your research, please cite this article:

@inproceedings{10.1145/3394171.3413690,
author = {Yang, Kaicheng and Xu, Hua and Gao, Kai},
title = {CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis},
year = {2020},
isbn = {9781450379885},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3394171.3413690},
doi = {10.1145/3394171.3413690},
abstract = {Multimodal sentiment analysis is an emerging research field that aims to enable machines to recognize, interpret, and express emotion. Through the cross-modal interaction, we can get more comprehensive emotional characteristics of the speaker. Bidirectional Encoder Representations from Transformers (BERT) is an efficient pre-trained language representation model. Fine-tuning it has obtained new state-of-the-art results on eleven natural language processing tasks like question answering and natural language inference. However, most previous works fine-tune BERT only base on text data, how to learn a better representation by introducing the multimodal information is still worth exploring. In this paper, we propose the Cross-Modal BERT (CM-BERT), which relies on the interaction of text and audio modality to fine-tune the pre-trained BERT model. As the core unit of the CM-BERT, masked multimodal attention is designed to dynamically adjust the weight of words by combining the information of text and audio modality. We evaluate our method on the public multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results show that it has significantly improved the performance on all the metrics over previous baselines and text-only finetuning of BERT. Besides, we visualize the masked multimodal attention and proves that it can reasonably adjust the weight of words by introducing audio modality information.},
booktitle = {Proceedings of the 28th ACM International Conference on Multimedia},
pages = {521–528},
numpages = {8},
keywords = {pretrained model, multimodal sentiment analysis, attention network},
location = {Seattle, WA, USA},
series = {MM '20}
}

cross-modal-bert's People

Contributors

Stargazers

Watchers

cross-modal-bert's Issues

Question about data

How to get MOSI_cmu_audio_CLS.pickle? and MOSEI_cmu_audio_CLS.pickle?

CMU-MOSEI Experimentation Settings

Hi @Kaicheng-Yang0828,

I have tried to run CM-BERT on CMU-MOSEI dataset. However, I only manage to get around 74.15 for acc2 instead of 83.6 as mentioned in #1 (comment).

Below are the steps that I have taken:

Changed audio pickle file from MOSI_cmu_audio_CLS.pickle to MOSEI_cmu_audio_CLS.pickle
Changed dev, test and train tsv files to hold CMU-MOSEI data
Modified loaded pickle file to MOSEI_cmu_audio_CLS.pickle in run_classifier.py file
Modified dimension of audio modality from 5 to 74 in model.py file

Is there any step that I have done wrongly or missed out? Just wondering is it also possible to have the code files for CMU-MOSEI too?

Your help is very much appreciated.
Thanks!

数据集问题

您好，我也在做音频和文本两种模态结合的研究，但是我自己的数据属于长尾数据集类型，训练效果不是很好，试了用smote方法做数据均衡，效果并没有提升，现在不知道该怎么修改或者做哪些改进了，请问你有什么建议吗？

FileNotFoundError: [Errno 2] No such file or directory: 'Cross-Modal-BERT-master/pre-trained BERT\\pytorch_model.bin'

Where can I download this "pytorch_model.bin" file, who can send me a copy?

Trained model

Hi,

Is there a trained model available? I want to use it for a downstream task, so it'd be great to have the already trained model and use it to predict some of my test samples.

Thank you very much.

audio data

hello,I used WAV2VEC2 to extract audio features,The dimension of each Auido is(50,512),I changed the input to conv1d to 512,But the accuracy rate is always zero.Do you have any suggestions？

您好，这里的audio特征是已经处理好的吗，没看到论文里使用的P2FA这部分的代码，我在这里遇到了困难

求一份将语音文件转为pickle文件的代码

CMU-MOSEI data files

Hello,

Do you have the data files and result table for CMU-MOSEI dataset available?

Thanks!

关于文本标签的问题

纯小白想问一下，标签的数字是怎么确定的呢，比如-3.0，-2.0，1.4这些标签代表什么呢？

About the Visualization of the attention matrices.

Hi,
Thanks for your great project! Here I have a question about the visualization of the attention matrices. Is it right if I use the text_attention and fusion_attention to visualize by the following plot code?

            plt.subplot(2, 1, 1)
            plt.imshow(text_attention[0], cmap='Blues', interpolation='nearest')
            plt.colorbar()
            plt.xlabel('Source')
            plt.ylabel('Target')
            plt.title('Visualization of the Text Attention Matrix')
            plt.subplot(2, 1, 2)
            plt.imshow(fusion_attention[0], cmap='Blues', interpolation='nearest')
            plt.colorbar()
            plt.xlabel('Source')
            plt.ylabel('Target')
            plt.title('Visualization of the Fusion Attention Matrix')

在本工作中，为了更公平的比较，我直接使用的CMU开源的特征文件。如果你想自己处理特征，请参考https://github.com/A2Zadeh/CMU-MultimodalSDK

Originally posted by @Kaicheng-Yang0828 in #11 (comment)

在run_classifier.py中的评估指标 acc7是什么

关于Masked Multimodal Attention的数据设置

为什么在这里要直接将𝑄𝑡 = 𝐾𝑡 = ˆ 𝑋𝑡'（和𝑄𝑎 = 𝐾𝑎 = ˆ 𝑋𝑎 ′）为什么不进行——用不同的W_k, W_Q来计算Q和K的操作呢

Question regarding sequence lengths and padding masks

Hello, first of all congratulations on your paper; amazing work!
I am currently trying to adapt your masked multi modal attention module for multimodal dialog act classification and I have a few questions regarding your architecture.

As written in the paper in Section 4.2, the sequence lengths of audio features are not the same as the ones of text features. I assume this is the case because Bert also takes punctuation marks and other special tokens as inputs in the sequence.
Analyzing your implementation of the masked multimodal Attention (found in BertFinetun in your code), I see that you use a padding mask, which I backtracked and saw that it regards only the text data and has the shape (batch_size, 1, max_seq_length).

Now my question is, since the audio features have smaller sequence lengths and were padded with additional zero vectors to match the sequence lengths of the text features, why don't you use a separate padding mask for the audio features?

Fintune problem

非常感谢作者开源。我注意到你们的代码没有解冻最后一层分类层，我加上对比了下，精度稍微低了一点，不过这有可能是因为超参不合适。你们这样做是因为解冻分类层效果会更差，还是说有其他什么讲究呢？
i wonder why you didn't unfreeze the classifier layer in run_classifier.py line129 ?
I got a slightly lower accuracy when i unfreezed 'classifier' which may due to the unmatched hyperparameters, or is there something else

audio_data那里没看懂，传不传好像没区别吧

What does the feature dimension represent?

Hi,
when I run the code
train_audio,valid_audio,test_audio= pickle.load(open('Cross-Modal-BERT-master/data/audio/MOSI_cmu_audio_CLS.pickle','rb'))

I look at the size of the train_audio , valid_audio and test_audio:
train_audio.shape=(1284, 50, 5),valid_audio.shape=(229, 50, 5), test_audio.shape=(686, 50, 5)

1284,229,686 is their data size，But what do 5 and 50 mean？

Looking forward to your reply.
Thanks.

关于video数据

您好，如果想要获得video的MOSI_cmu_audio_CLS.pickle数据，应该怎么做呢？

pip install -r requirements.txt gives an error?

I am using Python 3.10.2. At first, it tries to install boto3 and botocore that fits requirements until i ran out of memory, and so I searched the boto3 and botocore version that was created when you created CM-BERT. However, the next issue came from numpy

`WARNING: The candidate selected for download or install is a yanked version: 'torchvision' candidate (version 0.2.1 at https://files.pythonhosted.org/packages/ca/0d/f00b2885711e08bd71242ebe7b96561e6f6d01fdb4b9dcf4d37e2e13c5e1/torchvision-0.2.1-py2.py3-none-any.whl (from https://pypi.org/simple/torchvision/))
Reason for being yanked: So that users won't accidentally install this when using python 3.11
Installing collected packages: urllib3, pathlib, idna, chardet, tqdm, six, regex, pillow, numpy, joblib, jmespath, docutils, certifi, torchvision, scipy, requests, python-dateutil, scikit-learn, botocore, s3transfer, boto3, pytorch-pretrained-bert
DEPRECATION: numpy is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at pypa/pip#8559
Running setup.py install for numpy ... error
error: subprocess-exited-with-error

× Running setup.py install for numpy did not run successfully.
│ exit code: 1
╰─> [278 lines of output]`

So, I tried using pip install -r requirements.txt --use-pep517 from internet's suggestion, however it still did not work. Please help so I can try your code. Thank you.