ddlbojack / emotion2vec Goto Github PK

[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Python 96.10% Shell 2.35% Jupyter Notebook 1.56%

iemocap pytorch-implementation speech-emotion-recognition speech-representation

emotion2vec's Introduction

Hi there 👋

emotion2vec's People

Contributors

Stargazers

Watchers

Forkers

ishine jiaxin-ye dzr1026 chenxie95 gongchenghhu dariadiatlova chenchy eltociear shinshoji01 airhorizons haorotu zky001 sunnnnnnnny huaminyang duliangang xxsuper baron2050 gaozhaogang alexandajerry t0nych3n boragocode kingfener rogervaas davidmartinrius kevinwang676 qiqihuang thupro27 threadabort tonyzzzzz dunghoang369 wxjnb positivewon pan310 yvankob machadowisck jackzhousz techthiyanes abylouw forwiat

emotion2vec's Issues

About feature layer

Thank you for sharing your nice work!

In the script emotion2vec_extract_features.sh, I noticed that features are extracted from the last layer.
Have you tried extracting features from other layers as well?
I'm just curious if this approach is based on empirical insight.

Two key models in finetune without annotated data

非常感谢作者开源这么好的情绪预训练模型。

我在modelscope上看到有这样的描述：
首先使用语音情感识别学术数据集fine-tune emotion2vec，然后对15万小时中英数据进行标注，筛选文本情感与语音情感相同，并且置信度高的数据。
请问能否开源下文本情绪模型和采用学术数据集训练的语音情绪模型吗，我想基于此方法训练一个3分类模型。

谢谢！

Batching model inference

Hi,

Thank you for the great work you've done on this model! Is there any way to batch the model using funasr? I've been trying to batch with padding and set the padding_mask to mask out the unused frames, but I'm not getting the same results as when I run inference sequentially.

Here's a sample of the code I'm using. I've tried a number of different configurations of arguments - there are several mask parameters, and it seems like mask refers to the MLM pretraining schema, and padding_mask refers to the attention mask? I'm not sure though because there's no documentation. Any guidance would be appreciated.

from funasr.utils.load_utils import load_audio_text_image_video
from funasr import AutoModel
from torch.nn.utils.rnn import pad_sequence

model = AutoModel(model="iic/emotion2vec_plus_large").model
model.eval()
model.to("cuda")

padding_value = -1

# Audios is a list of audio tensors resampled to 16kHz
x = load_audio_text_image_video(audios)
x = [torch.nn.functional.layer_norm(x_, x_.shape).squeeze() for x_ in x]
masked_x = pad_sequence(x, batch_first=True, padding_value=padding_value)
mask = masked_x == padding_value

out = model.extract_features(masked_x, mask=False, padding_mask=mask, remove_extra_tokens=True)
out_mask = out["padding_mask"]
feats = out["x"]

feats[out_mask] = 0
print(feats.sum(dim=1) / (~out_mask).sum(dim=1).unsqueeze(-1))

群聊的二维码过期了

您好，群聊的二维码过期了

About platform

I want to know if the emotion2vec can run on arm server.

Fine Tuning the emotion2vec model

How can I fine tune the emotion2vec+large model on another dataset without using the process that you have used for iemocap?

I have tried to use four features and your bash script train.sh but I got this error:

File "C:\Users\doki_engbu\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\doki_engbu\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated

微信群二维码失效

想加群，群主能否更新一下二维码？谢谢啦

Info about checkpoint file

Hi @ddlBoJack,

Please share some information about the checkpoint file shared in the readme. Is it the best performing model so far?

Also the train.py file given for IEMOCAP, is it the frame-level or utterance level features?

Thanks,

RAVDESS test result

I had tested the result of RAVDESS Speech and RAVDESS Song by the given emotion2vec_large. The weighted acc of RAVDESS Speech is 87%, similar with the result in paper, but the result of RAVDESS Song is 64%, which is very different from the results of the paper. Are there any different of testing two dataset? I dont know why?

群二维码过期了，请问能更新一下吗

如题

Finetuning

Could you please share the script to train the network for upstream task? I want to finetune the model.

Thanks!

The WeChat group QR code has expired

The performance of utterance-level features is poor.

Hello, I'm new to this field. I'd like to ask you why I got a poor result when I used the utterance-level provided by you for emotion recognition, and the WA was probably over 60.I also only use the linear layer as the basemodel.
I am looking forward to your answer, thank you.

Emotion2Vec Pretraining code

Thank you for your contribution; your work is truly amazing. However, I would like to train emotion2vec for a pretraining task. Could you provide the source code or offer any suggestions?

微信二维码失效了，可以更新你一下吗？谢谢！

The WeChat group QR code has expired

sry for missing the last update

Inference

Thank you for providing the code!
I am a novice in the field of SER. I have trained the downstream model using the provided train.npy, train.lengths, and train.emo files, but I'm unsure how to use the obtained model for category inference on the features within train.npy.
I noticed that the shape of the train.npy you provided is (1253877, 768). In my understanding, it represents 1253877 samples with 768-dimensional features each. I would like to classify these 1253877 samples using the pre-trained model. How can I achieve this?

MaxRetryError

When trying to run the prediction using the fine-tuned models through model scope

inference_pipeline = pipeline(
    task=Tasks.emotion_recognition,
    model="iic/emotion2vec_plus_large")  # Alternative: iic/emotion2vec_plus_seed, iic/emotion2vec_plus_base, iic/emotion2vec_plus_large and iic/emotion2vec_base_finetuned

rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', output_dir="./outputs", granularity="utterance", extract_embedding=False)
print(rec_result)

I run into this error


The above exception was the direct cause of the following exception:

MaxRetryError                             Traceback (most recent call last)
Cell In[40], [line 7](vscode-notebook-cell:?execution_count=40&line=7)
      [1](vscode-notebook-cell:?execution_count=40&line=1) '''
      [2](vscode-notebook-cell:?execution_count=40&line=2) Using the emotion representation model
      [3](vscode-notebook-cell:?execution_count=40&line=3) rec_result only contains {'feats'}
...
--> [515](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:515)     raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    [517](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:517) log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
    [519](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:519) return new_retry

MaxRetryError: None: Max retries exceeded with url: https://www.modelscope.cn/api/v1/models/iic/emotion2vec_plus_large/repo?Revision=master&FilePath=emotion2vec+data.png (Caused by HTTPError('404 Client Error: Not Found for url: https://www.modelscope.cn/api/v1/models/iic/emotion2vec_plus_large/repo?Revision=master&FilePath=emotion2vec+data.png'))```

utterance embedding

How are utterance embedding obtained? Are they obtained from frame-level features through convolution or pooling?

_MISSING_TYPE

omegaconf.errors.ValidationError: Object of unsupported type: '_MISSING_TYPE'
full_key:
reference_type=None
object_type=None
Is this due to a software package conflict？I cant solve this problem.

About reproducing data2vec2 results

When loading the data2vec2 model using fairseq. checkpoint_utils. load_model_ensemble_and_task ([ckpt_path]), an error occurred while loading the data2vec2 model: KeyError : "_name", Could you please tell me how to solve the problem of loading the model

miss key in ckpt

Hello, Could you please explain what this warning is about?

KeyError: 'text' when inferring with iic/emotion2vec_plus_large model in FunASR

Description:

I encountered an issue while performing inference using the iic/emotion2vec_plus_large model with FunASR. Here's the traceback of the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 253, in generate
    model = self.model if model is None else model
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 471, in inference_with_vad
    )
      
KeyError: 'text'

Code Used:

from funasr import AutoModel
import librosa
import soundfile as sf
model_emotion = AutoModel(model="iic/emotion2vec_plus_base", model_revision="master",
                          vad_model="fsmn-vad", vad_model_revision="v2.0.4",
                          max_single_segment_time=19000,
                          )

y, sr = librosa.load(wav_file)
y_16k = librosa.resample(y,orig_sr=sr,target_sr=16000)
sf.write("./temp.wav", y_16k, 16000, subtype='PCM_24')
res_emotion = model_emotion.generate("./temp.wav", output_dir="./outputs", granularity="utterance", extract_embedding=True)
print(res_emotion)

Complete Console Information:

>>> model_emotion = AutoModel(model="iic/emotion2vec_plus_base", model_revision="master",
...                           vad_model="fsmn-vad", vad_model_revision="v2.0.4",
...                           max_single_segment_time=1000,
...                           )
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.0.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.0.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.1.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.1.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.2.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.2.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.3.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.3.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.proj.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.proj.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
2024-07-02 17:45:00,793 - modelscope - INFO - Use user-specified model revision: v2.0.4
>>> 
>>> res_emotion = model_emotion.generate("./temp.wav", output_dir="./outputs", granularity="utterance", extract_embedding=True)
rtf_avg: 2.022: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.72s/it]
rtf_avg: 2.878:   0%|▍                                                                                                                           | 1/261 [00:01<06:36,  1.53s/it]
rtf_avg: 1.028:   1%|▋                                                                                                                           | 1/191 [00:01<03:44,  1.18s/it]
rtf_avg: 0.613:   1%|▊                                                                                                                           | 1/154 [00:00<02:28,  1.03it/s]
rtf_avg: 0.423:   1%|▉                                                                                                                           | 1/131 [00:00<01:47,  1.21it/s]
rtf_avg: 0.317:   1%|█                                                                                                                           | 1/113 [00:00<01:22,  1.37it/s]
rtf_avg: 0.246:   1%|█▏                                                                                                                          | 1/102 [00:00<01:06,  1.53it/s]
rtf_avg: 0.209:   1%|█▎                                                                                                                           | 1/94 [00:00<00:57,  1.61it/s]
rtf_avg: 0.183:   1%|█▍                                                                                                                           | 1/84 [00:00<00:49,  1.69it/s]
rtf_avg: 0.159:   1%|█▋                                                                                                                           | 1/75 [00:00<00:42,  1.75it/s]
rtf_avg: 0.138:   1%|█▊                                                                                                                           | 1/69 [00:00<00:37,  1.80it/s]
rtf_avg: 0.115:   2%|██                                                                                                                           | 1/62 [00:00<00:31,  1.96it/s]
rtf_avg: 0.104:   2%|██▏                                                                                                                          | 1/56 [00:00<00:28,  1.95it/s]
rtf_avg: 0.090:   2%|██▍                                                                                                                          | 1/51 [00:00<00:24,  2.04it/s]
rtf_avg: 0.080:   2%|██▋                                                                                                                          | 1/47 [00:00<00:21,  2.09it/s]
rtf_avg: 0.075:   2%|██▊                                                                                                                          | 1/44 [00:00<00:20,  2.05it/s]
rtf_avg: 0.068:   2%|███▏                                                                                                                         | 1/40 [00:00<00:18,  2.12it/s]
rtf_avg: 0.063:   3%|███▍                                                                                                                         | 1/36 [00:00<00:16,  2.10it/s]
rtf_avg: 0.058:   3%|███▊                                                                                                                         | 1/33 [00:00<00:15,  2.07it/s]
rtf_avg: 0.050:   3%|████▎                                                                                                                        | 1/29 [00:00<00:13,  2.12it/s]
rtf_avg: 0.045:   4%|████▊                                                                                                                        | 1/26 [00:00<00:11,  2.09it/s]
rtf_avg: 0.040:   4%|█████▍                                                                                                                       | 1/23 [00:00<00:10,  2.08it/s]
rtf_avg: 0.036:   5%|██████▎                                                                                                                      | 1/20 [00:00<00:09,  2.05it/s]
rtf_avg: 0.034:   6%|███████▎                                                                                                                     | 1/17 [00:00<00:08,  1.92it/s]
rtf_avg: 0.031:   7%|████████▎                                                                                                                    | 1/15 [00:00<00:07,  1.80it/s]
rtf_avg: 0.025:  10%|████████████▌                                                                                                                | 1/10 [00:00<00:05,  1.80it/s]
rtf_avg: 0.023:  12%|███████████████▊                                                                                                              | 1/8 [00:00<00:05,  1.32it/s]
  0%|                                                                                                                                                      | 0/1 [01:13<?, ?it/s]
Traceback (most recent call last):██▊                                                                                                              | 1/8 [00:00<00:05,  1.36it/s]
  File "<stdin>", line 1, in <module>
  File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 253, in generate
    model = self.model if model is None else model
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 471, in inference_with_vad
    )
      
KeyError: 'text'

OOM while processing IEMOCAP dataset

I was trying to create iemocap embedding on my own, but my GPU with 8GB memory gave me OOM from cuda. How much size do I need to process this?

微信群能更新吗？又失效了群主想交流一下

The WeChat group QR code has expired.

Please update the QR code.

extreactfeature won't work with the new models

extrafeature only work with the base model. is there any plan to fix this?

fine-tuning pre train model

Hi, thank you very much for your work.

I want to continue to do some interesting work based on your work.
I have not found any related model fine-tuning on modelscore and github.
Can you please guide me on how to use your model for model fine-tuning and retraining?

many thanks

微信群

你好可以更新微信群二维码吗

二维码过期了

Request for demo on using emotion2vec with Speech + Text modality

Hello there! I'm currently trying to use the emotion2vec for sentiment analysis tasks and appreciate your work. After reading related papers and documentation, I noticed that you have provided instructions on how to predict using speech or text modal data separately.

However, I am also interested in understanding how to combine both speech and text data (i.e., Speech + Text) for multimodal emotion prediction. According to my findings from literature, this seems like an important application scenario.

Therefore, could you please provide a simple example demonstrating how to integrate these two modalities of data and run the model? I believe this would be highly beneficial for other users as well.

Thank you!

A question

Hey Author , Thanks for the opensource

I wanted to ask if emotion2vec is better than https://github.com/audeering/w2v2-how-to

Thanks in advance

Optimal segment length

Hello!

Thank you for such a nice work!

I am performing speaker diarization with pyannote, and want to use the audio segments which i recieve from the diarization model to perfrom emotion detection on them. The segments are of different sizes, I'm sure I'll have to do some kind of splitting because of the CUDA OOM for very long segments (like 200 sec), but I'm wondering what is the optimal segment size for the emotion2vec_plus_large model? 3 seconds, 15 seconds or whatever?

Thank you!

Wechat Group application

Hello! One of my work recently used Emotion2Vec. Could I join this group chat to communicate with you? My wechat can be get by my profile picture(QR code) If you are not busy, you can get my wechat by scanning it! Thank you very much.

What is Emo-262?

What is the dataset Emo-262? Does your group collect it and will it be available for the public? How can I get it?

Hint: The word LSSED in the Table 2 caption is wrong and was written as LSED. Maybe you can check your paper writing.

The WeChat group QR code has expired again

其实我是有一个需求，是长音频需要切片算情感分类概率，比如每5s得到一个，但是目前pipeline api封装得太死了，不支持这么操作，只支持全局平均算出一个。如果pipeline接口能额外输入一个切片长度，得到的概率向量多一个时间维度，就好了

Request for test and dev files

Dear Authors,

You have only shared the train.npy, train.lengths, train.emo in the iemocap_downstream folder.
Do you mind sharing also the test and dev versions of the files? This will make testing your models more convenient.

Thank you in advance.

Best regards,
Aaron

The Performance of the new models are bad for specific languages

Thank you for creating e2v. how can i access the previous model that could only output a few labels instead of 9?
I find this new ckpt (the plus large) to be so much worse compared to the old one at least for Persian.

the model also hallucinates a lot with short inputs (1-2 seconds) even in English.