ddlbojack / emotion2vec Goto Github PK
View Code? Open in Web Editor NEW[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
[ACL 2024] Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation
Thank you for sharing your nice work!
In the script emotion2vec_extract_features.sh
, I noticed that features are extracted from the last layer.
Have you tried extracting features from other layers as well?
I'm just curious if this approach is based on empirical insight.
้ๅธธๆ่ฐขไฝ่ ๅผๆบ่ฟไนๅฅฝ็ๆ ็ปช้ข่ฎญ็ปๆจกๅใ
ๆๅจmodelscopeไธ็ๅฐๆ่ฟๆ ท็ๆ่ฟฐ๏ผ
้ฆๅ
ไฝฟ็จ่ฏญ้ณๆ
ๆ่ฏๅซๅญฆๆฏๆฐๆฎ้fine-tune emotion2vec๏ผ็ถๅๅฏน15ไธๅฐๆถไธญ่ฑๆฐๆฎ่ฟ่กๆ ๆณจ๏ผ็ญ้ๆๆฌๆ
ๆไธ่ฏญ้ณๆ
ๆ็ธๅ๏ผๅนถไธ็ฝฎไฟกๅบฆ้ซ็ๆฐๆฎใ
่ฏท้ฎ่ฝๅฆๅผๆบไธๆๆฌๆ
็ปชๆจกๅๅ้็จๅญฆๆฏๆฐๆฎ้่ฎญ็ป็่ฏญ้ณๆ
็ปชๆจกๅๅ๏ผๆๆณๅบไบๆญคๆนๆณ่ฎญ็ปไธไธช3ๅ็ฑปๆจกๅใ
่ฐข่ฐข๏ผ
Hi,
Thank you for the great work you've done on this model! Is there any way to batch the model using funasr
? I've been trying to batch with padding and set the padding_mask
to mask out the unused frames, but I'm not getting the same results as when I run inference sequentially.
Here's a sample of the code I'm using. I've tried a number of different configurations of arguments - there are several mask
parameters, and it seems like mask
refers to the MLM pretraining schema, and padding_mask
refers to the attention mask? I'm not sure though because there's no documentation. Any guidance would be appreciated.
from funasr.utils.load_utils import load_audio_text_image_video
from funasr import AutoModel
from torch.nn.utils.rnn import pad_sequence
model = AutoModel(model="iic/emotion2vec_plus_large").model
model.eval()
model.to("cuda")
padding_value = -1
# Audios is a list of audio tensors resampled to 16kHz
x = load_audio_text_image_video(audios)
x = [torch.nn.functional.layer_norm(x_, x_.shape).squeeze() for x_ in x]
masked_x = pad_sequence(x, batch_first=True, padding_value=padding_value)
mask = masked_x == padding_value
out = model.extract_features(masked_x, mask=False, padding_mask=mask, remove_extra_tokens=True)
out_mask = out["padding_mask"]
feats = out["x"]
feats[out_mask] = 0
print(feats.sum(dim=1) / (~out_mask).sum(dim=1).unsqueeze(-1))
ๆจๅฅฝ๏ผ็พค่็ไบ็ปด็ ่ฟๆไบ
I want to know if the emotion2vec can run on arm server.
How can I fine tune the emotion2vec+large model on another dataset without using the process that you have used for iemocap?
I have tried to use four features and your bash script train.sh but I got this error:
File "C:\Users\doki_engbu\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\doki_engbu\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
.
ๆณๅ ็พค๏ผ็พคไธป่ฝๅฆๆดๆฐไธไธไบ็ปด็ ๏ผ่ฐข่ฐขๅฆ
Hi @ddlBoJack,
Please share some information about the checkpoint file shared in the readme. Is it the best performing model so far?
Also the train.py file given for IEMOCAP, is it the frame-level or utterance level features?
Thanks,
I had tested the result of RAVDESS Speech and RAVDESS Song by the given emotion2vec_large. The weighted acc of RAVDESS Speech is 87%, similar with the result in paper, but the result of RAVDESS Song is 64%, which is very different from the results of the paper. Are there any different of testing two dataset? I dont know why?
ๅฆ้ข
Could you please share the script to train the network for upstream task? I want to finetune the model.
Thanks!
Hello, I'm new to this field. I'd like to ask you why I got a poor result when I used the utterance-level provided by you for emotion recognition, and the WA was probably over 60.I also only use the linear layer as the basemodel.
I am looking forward to your answer, thank you.
Thank you for your contribution; your work is truly amazing. However, I would like to train emotion2vec for a pretraining task. Could you provide the source code or offer any suggestions?
sry for missing the last update
Thank you for providing the code!
I am a novice in the field of SER. I have trained the downstream model using the provided train.npy, train.lengths, and train.emo files, but I'm unsure how to use the obtained model for category inference on the features within train.npy.
I noticed that the shape of the train.npy you provided is (1253877, 768). In my understanding, it represents 1253877 samples with 768-dimensional features each. I would like to classify these 1253877 samples using the pre-trained model. How can I achieve this?
When trying to run the prediction using the fine-tuned models through model scope
inference_pipeline = pipeline(
task=Tasks.emotion_recognition,
model="iic/emotion2vec_plus_large") # Alternative: iic/emotion2vec_plus_seed, iic/emotion2vec_plus_base, iic/emotion2vec_plus_large and iic/emotion2vec_base_finetuned
rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', output_dir="./outputs", granularity="utterance", extract_embedding=False)
print(rec_result)
I run into this error
The above exception was the direct cause of the following exception:
MaxRetryError Traceback (most recent call last)
Cell In[40], [line 7](vscode-notebook-cell:?execution_count=40&line=7)
[1](vscode-notebook-cell:?execution_count=40&line=1) '''
[2](vscode-notebook-cell:?execution_count=40&line=2) Using the emotion representation model
[3](vscode-notebook-cell:?execution_count=40&line=3) rec_result only contains {'feats'}
...
--> [515](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:515) raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
[517](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:517) log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
[519](https://file+.vscode-resource.vscode-cdn.net/Users/piotr/Projects/SER-models/~/.pyenv/versions/3.8.18/envs/emo2vec/lib/python3.8/site-packages/urllib3/util/retry.py:519) return new_retry
MaxRetryError: None: Max retries exceeded with url: https://www.modelscope.cn/api/v1/models/iic/emotion2vec_plus_large/repo?Revision=master&FilePath=emotion2vec+data.png (Caused by HTTPError('404 Client Error: Not Found for url: https://www.modelscope.cn/api/v1/models/iic/emotion2vec_plus_large/repo?Revision=master&FilePath=emotion2vec+data.png'))```
How are utterance embedding obtained? Are they obtained from frame-level features through convolution or pooling?
omegaconf.errors.ValidationError: Object of unsupported type: '_MISSING_TYPE'
full_key:
reference_type=None
object_type=None
Is this due to a software package conflict๏ผI cant solve this problem.
When loading the data2vec2 model using fairseq. checkpoint_utils. load_model_ensemble_and_task ([ckpt_path]), an error occurred while loading the data2vec2 model: KeyError : "_name", Could you please tell me how to solve the problem of loading the model
I encountered an issue while performing inference using the iic/emotion2vec_plus_large model with FunASR. Here's the traceback of the error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 253, in generate
model = self.model if model is None else model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 471, in inference_with_vad
)
KeyError: 'text'
from funasr import AutoModel
import librosa
import soundfile as sf
model_emotion = AutoModel(model="iic/emotion2vec_plus_base", model_revision="master",
vad_model="fsmn-vad", vad_model_revision="v2.0.4",
max_single_segment_time=19000,
)
y, sr = librosa.load(wav_file)
y_16k = librosa.resample(y,orig_sr=sr,target_sr=16000)
sf.write("./temp.wav", y_16k, 16000, subtype='PCM_24')
res_emotion = model_emotion.generate("./temp.wav", output_dir="./outputs", granularity="utterance", extract_embedding=True)
print(res_emotion)
>>> model_emotion = AutoModel(model="iic/emotion2vec_plus_base", model_revision="master",
... vad_model="fsmn-vad", vad_model_revision="v2.0.4",
... max_single_segment_time=1000,
... )
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.0.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.0.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.1.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.1.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.2.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.2.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.3.0.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.blocks.3.0.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.proj.weight, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
Warning, miss key in ckpt: modality_encoders.AUDIO.decoder.proj.bias, /home/lianqi/.cache/modelscope/hub/iic/emotion2vec_plus_base/model.pt
2024-07-02 17:45:00,793 - modelscope - INFO - Use user-specified model revision: v2.0.4
>>>
>>> res_emotion = model_emotion.generate("./temp.wav", output_dir="./outputs", granularity="utterance", extract_embedding=True)
rtf_avg: 2.022: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:34<00:00, 34.72s/it]
rtf_avg: 2.878: 0%|โ | 1/261 [00:01<06:36, 1.53s/it]
rtf_avg: 1.028: 1%|โ | 1/191 [00:01<03:44, 1.18s/it]
rtf_avg: 0.613: 1%|โ | 1/154 [00:00<02:28, 1.03it/s]
rtf_avg: 0.423: 1%|โ | 1/131 [00:00<01:47, 1.21it/s]
rtf_avg: 0.317: 1%|โ | 1/113 [00:00<01:22, 1.37it/s]
rtf_avg: 0.246: 1%|โโ | 1/102 [00:00<01:06, 1.53it/s]
rtf_avg: 0.209: 1%|โโ | 1/94 [00:00<00:57, 1.61it/s]
rtf_avg: 0.183: 1%|โโ | 1/84 [00:00<00:49, 1.69it/s]
rtf_avg: 0.159: 1%|โโ | 1/75 [00:00<00:42, 1.75it/s]
rtf_avg: 0.138: 1%|โโ | 1/69 [00:00<00:37, 1.80it/s]
rtf_avg: 0.115: 2%|โโ | 1/62 [00:00<00:31, 1.96it/s]
rtf_avg: 0.104: 2%|โโโ | 1/56 [00:00<00:28, 1.95it/s]
rtf_avg: 0.090: 2%|โโโ | 1/51 [00:00<00:24, 2.04it/s]
rtf_avg: 0.080: 2%|โโโ | 1/47 [00:00<00:21, 2.09it/s]
rtf_avg: 0.075: 2%|โโโ | 1/44 [00:00<00:20, 2.05it/s]
rtf_avg: 0.068: 2%|โโโโ | 1/40 [00:00<00:18, 2.12it/s]
rtf_avg: 0.063: 3%|โโโโ | 1/36 [00:00<00:16, 2.10it/s]
rtf_avg: 0.058: 3%|โโโโ | 1/33 [00:00<00:15, 2.07it/s]
rtf_avg: 0.050: 3%|โโโโโ | 1/29 [00:00<00:13, 2.12it/s]
rtf_avg: 0.045: 4%|โโโโโ | 1/26 [00:00<00:11, 2.09it/s]
rtf_avg: 0.040: 4%|โโโโโโ | 1/23 [00:00<00:10, 2.08it/s]
rtf_avg: 0.036: 5%|โโโโโโโ | 1/20 [00:00<00:09, 2.05it/s]
rtf_avg: 0.034: 6%|โโโโโโโโ | 1/17 [00:00<00:08, 1.92it/s]
rtf_avg: 0.031: 7%|โโโโโโโโโ | 1/15 [00:00<00:07, 1.80it/s]
rtf_avg: 0.025: 10%|โโโโโโโโโโโโโ | 1/10 [00:00<00:05, 1.80it/s]
rtf_avg: 0.023: 12%|โโโโโโโโโโโโโโโโ | 1/8 [00:00<00:05, 1.32it/s]
0%| | 0/1 [01:13<?, ?it/s]
Traceback (most recent call last):โโโ | 1/8 [00:00<00:05, 1.36it/s]
File "<stdin>", line 1, in <module>
File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 253, in generate
model = self.model if model is None else model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lianqi/anaconda3/envs/funasr/lib/python3.11/site-packages/funasr/auto/auto_model.py", line 471, in inference_with_vad
)
KeyError: 'text'
I was trying to create iemocap embedding on my own, but my GPU with 8GB memory gave me OOM from cuda. How much size do I need to process this?
Please update the QR code.
extrafeature only work with the base model. is there any plan to fix this?
Hi, thank you very much for your work.
I want to continue to do some interesting work based on your work.
I have not found any related model fine-tuning on modelscore and github.
Can you please guide me on how to use your model for model fine-tuning and retraining?
many thanks
ไฝ ๅฅฝ ๅฏไปฅๆดๆฐๅพฎไฟก็พคไบ็ปด็ ๅ
Hello there! I'm currently trying to use the emotion2vec for sentiment analysis tasks and appreciate your work. After reading related papers and documentation, I noticed that you have provided instructions on how to predict using speech or text modal data separately.
๏ปฟ
However, I am also interested in understanding how to combine both speech and text data (i.e., Speech + Text) for multimodal emotion prediction. According to my findings from literature, this seems like an important application scenario.
๏ปฟ
Therefore, could you please provide a simple example demonstrating how to integrate these two modalities of data and run the model? I believe this would be highly beneficial for other users as well.
๏ปฟ
Thank you!
Hey Author , Thanks for the opensource
I wanted to ask if emotion2vec is better than https://github.com/audeering/w2v2-how-to
Thanks in advance
Hello!
Thank you for such a nice work!
I am performing speaker diarization with pyannote, and want to use the audio segments which i recieve from the diarization model to perfrom emotion detection on them. The segments are of different sizes, I'm sure I'll have to do some kind of splitting because of the CUDA OOM for very long segments (like 200 sec), but I'm wondering what is the optimal segment size for the emotion2vec_plus_large model? 3 seconds, 15 seconds or whatever?
Thank you!
Hello! One of my work recently used Emotion2Vec. Could I join this group chat to communicate with you? My wechat can be get by my profile picture(QR code) If you are not busy, you can get my wechat by scanning it! Thank you very much.
What is the dataset Emo-262? Does your group collect it and will it be available for the public? How can I get it?
Hint: The word LSSED in the Table 2 caption is wrong and was written as LSED. Maybe you can check your paper writing.
ๅ ถๅฎๆๆฏๆไธไธช้ๆฑ๏ผๆฏ้ฟ้ณ้ข้่ฆๅ็็ฎๆ ๆๅ็ฑปๆฆ็๏ผๆฏๅฆๆฏ5sๅพๅฐไธไธช ๏ผไฝๆฏ็ฎๅpipeline apiๅฐ่ฃ ๅพๅคชๆญปไบ๏ผไธๆฏๆ่ฟไนๆไฝ๏ผๅชๆฏๆๅ จๅฑๅนณๅ็ฎๅบไธไธชใๅฆๆpipelineๆฅๅฃ่ฝ้ขๅค่พๅ ฅไธไธชๅ็้ฟๅบฆ๏ผๅพๅฐ็ๆฆ็ๅ้ๅคไธไธชๆถ้ด็ปดๅบฆ๏ผๅฐฑๅฅฝไบ
Dear Authors,
You have only shared the train.npy
, train.lengths
, train.emo
in the iemocap_downstream
folder.
Do you mind sharing also the test and dev versions of the files? This will make testing your models more convenient.
Thank you in advance.
Best regards,
Aaron
Thank you for creating e2v. how can i access the previous model that could only output a few labels instead of 9?
I find this new ckpt (the plus large) to be so much worse compared to the old one at least for Persian.
the model also hallucinates a lot with short inputs (1-2 seconds) even in English.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.