audeering / w2v2-how-to Goto Github PK

How to use our public wav2vec2 dimensional emotion model

License: MIT License

Jupyter Notebook 100.00%

speech-emotion-recognition deep-learning wav2vec2 transformer-models arousal dominance valence msp-podcast onnx

w2v2-how-to's Introduction

How to use our public dimensional emotion model

An introduction to our model for dimensional speech emotion recognition based on wav2vec 2.0. The model is available from doi:10.5281/zenodo.6221127 and released under CC BY-NC-SA 4.0. The model was created by fine-tuning the pre-trained wav2vec2-large-robust model on MSP-Podcast (v1.7). The pre-trained model was pruned from 24 to 12 transformer layers before fine-tuning. In this tutorial we use the ONNX export of the model. The original Torch model is hosted at Hugging Face. Further details are given in the associated paper.

License

The model can be used for non-commercial purposes, see CC BY-NC-SA 4.0. For commercial usage, a license for devAIce must be obtained. The source code in this GitHub repository is released under the following license.

Quick start

Create / activate Python virtual environment and install audonnx.

$ pip install audonnx

Load model and test on random signal.

import audeer
import audonnx
import numpy as np


url = 'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.6bc4a7fd-1.1.0.zip'
cache_root = audeer.mkdir('cache')
model_root = audeer.mkdir('model')

archive_path = audeer.download_url(url, cache_root, verbose=True)
audeer.extract_archive(archive_path, model_root)
model = audonnx.load(model_root)

sampling_rate = 16000
signal = np.random.normal(size=sampling_rate).astype(np.float32)
model(signal, sampling_rate)

{'hidden_states': array([[-0.00711814,  0.00615957, -0.00820673, ...,  0.00666412,
          0.00952989,  0.00269193]], dtype=float32),
 'logits': array([[0.6717072 , 0.6421313 , 0.49881312]], dtype=float32)}

The hidden states might be used as embeddings for related speech emotion recognition tasks. The order in the logits output is: arousal, dominance, valence.

Tutorial

For a detailed introduction, please check out the notebook.

$ pip install -r requirements.txt
$ jupyter notebook notebook.ipynb

Citation

If you use our model in your own work, please cite the following paper:

@article{wagner2023dawn,
    title={Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap},
    author={Wagner, Johannes and Triantafyllopoulos, Andreas and Wierstorf, Hagen and Schmitt, Maximilian and Burkhardt, Felix and Eyben, Florian and Schuller, Bj{\"o}rn W},
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
    pages={1--13},
    year={2023},
}

w2v2-how-to's People

Contributors

Stargazers

Watchers

Forkers

ag027592 ymfiubyf jochemstoel ishine shaun95 amorjnyh chenchy maxmax2016 xinyi920 ofshellohicy nuooos jiaxin-ye xuaki huaxuanw readmeh chodensei ukaserge cyeex hs11015 liyiran55555 hyzhan haorotu guxingan1 yajinou 0000duck 1129193619 kunzhou9646 java-aaa hylazy wwyuan2023 yuejunzhang zhangshaopu caoyuhang sunnnnnnnny cmustard7 yurykonvpalto aqf671 profchwu joseph16388 coderwpf gongchenghhu gamedbc mmarkaki xwixcn gavin-keli deyh2020 zhaopufeng

w2v2-how-to's Issues

How to get both dimensional scores and embedding in one run?

Hi, thank you for sharing the great job, I want to get dimensional scores and embedding in one run, how should I do?

Convert VAD to Ekman

Hello,

This model provides VAD values in 3D space.

However, the Ekman model is more intuitive to share the results with users.

I have found papers with 3D representations hinting at how to perform this conversion.

Are you aware of a straightforward approach to perform the conversion between both models?

Ideally in Python, but any hint on the algorithm would also do.

Best,

korean, depression/normal audio data set

I have korean audio data sets which are labeled as depression and normal.

And each of them are at least 2 minutes.

Can I apply this model??

URLError: <urlopen error [Errno 11004] getaddrinfo failed>

url = 'https://zenodo.org/record/6221127/files/w2v2-L-robust-12.6bc4a7fd-1.1.0.zip'

URLError: <urlopen error [Errno 11004] getaddrinfo failed>

Range value of arousal, valence, dominance

I wonder what the range value of arousal, valence, and dominance is. As far as I know, model output is a logit vector size of 3 representing that feature and looks like its values range [0, 1]. I see that you use MSP-Conversation Corpus for fine-tuning. But when I looked at The MSP-Conversation Corpus paper paperlink, they mentioned that
"Notice that the values of the traces are in the range between -100 and 100. The figure shows that extreme values are uncommon. Most of the annotations are concentrated between -40 to 40 for valence, -20 to 50 for arousal, and -20 to 40 for dominance"

Do you guys normalize that feature, or do something related?

How to fine-tune the pretrain-model

I was wondering how to fine-tune the released model in another dataset.

Negative values for Arousal

When I run the model, I get some negative values for the arousal element. I thought the arousal, dominance, and valence range is between 0 to 1. Can anyone interpret what is happening or what these negative values mean?

Memory Leak during inference

Hi,

I have about more than 100,000 audio files (each audio file is about 1-2 minutes). My goal is to use the API to infer arousal, dominance, and valence from these audio files. I simply use the loop function to feed the audio files to the API one by one, but it seems there is a memory leak problem after 5,000 loops.

The error is like:

[ E : o n n x r u n t i m e : , s e q u e n t i a l _ e x e c u t o r . c c : 5 1 4 o n n x r u n t i m e : : E x e c u t e K e r n e l ] N o n - z e r o s t a t u s c o d e r e t u r n e d w h i l e r u n n i n g S o f t m a x n o d e . N a m e : ' S o f t m a x _ 2 4 6 ' S t a t u s M e s s a g e : C : \ a \ _ w o r k \ 1 \ s \ o n n x r u n t i m e \ c o r e \ f r a m e w o r k \ b f c _ a r e n a . c c : 3 7 6 o n n x r u n t i m e : : B F C A r e n a : : A l l o c a t e R a w I n t e r n a l F a i l e d t o a l l o c a t e m e m o r y f o r r e q u e s t e d b u f f e r o f s i z e 1 4 3 0 4 1 6 0 0 0 0

I was wondering is there some way to fix the problem? I am really new to deep learning framework, and look forward to your help. (I am running the code on a CPU machine with 32G ram).

Avoid random output in notebook

We should set a fixed seed to avoid random output in the notebook.

Other pretrained models

Hi authors,

Is it possible to release the XLS version of the model or the CNN14 model?
My current project needs a classifier that predicts valance majorly with the paralinguistic track. I read the paper and the analysis that the released w2v2-L-robust learned sentiment from the linguistic content.
So, I'm wondering if it's possible to access one of your other models that in your analysis, does not rely on linguistic content that much? It will be a great help if it's possible!

Thanks!

Use emodb train and test splits

Currently we state the following in https://github.com/audeering/w2v2-how-to/blob/main/notebook.ipynb:

But this is no longer true as since version 1.2.0 of emodb we have official splits, compare https://github.com/audeering/emodb/blob/master/CHANGELOG.md.

I would propose we update the example to use the official splits.

Error in using audinterface.Feature

interface = audinterface.Feature(
    model.outputs["logits"].labels,
    process_func=model,
    process_func_applies_sliding_window=False,
    process_func_args={
        "outputs": "logits",
    },
    sampling_rate=sr,
    resample=True,
    verbose=True,
    win_dur=1.0,
    hop_dur=0.5,
)

AttributeError: type object 'type' has no attribute 'id'.

This happened to the latest ver 0.9.0.

there is a error when running the note code

ConnectionRefusedError Traceback (most recent call last)
~/miniconda3/envs/torch/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1349 h.request(req.get_method(), req.selector, req.data, headers,
-> 1350 encode_chunked=req.has_header('Transfer-encoding'))
1351 except OSError as err: # timeout error

~/miniconda3/envs/torch/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked)
1280 """Send a complete request to the server."""
-> 1281 self._send_request(method, url, body, headers, encode_chunked)
1282

~/miniconda3/envs/torch/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
1326 body = _encode(body, 'body')
-> 1327 self.endheaders(body, encode_chunked=encode_chunked)
1328

~/miniconda3/envs/torch/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked)
1275 raise CannotSendHeader()
-> 1276 self._send_output(message_body, encode_chunked=encode_chunked)
1277

~/miniconda3/envs/torch/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked)
1035 del self._buffer[:]
-> 1036 self.send(msg)
1037

~/miniconda3/envs/torch/lib/python3.7/http/client.py in send(self, data)
975 if self.auto_open:
--> 976 self.connect()
977 else:

~/miniconda3/envs/torch/lib/python3.7/http/client.py in connect(self)
1442
-> 1443 super().connect()
1444

~/miniconda3/envs/torch/lib/python3.7/http/client.py in connect(self)
947 self.sock = self._create_connection(
--> 948 (self.host,self.port), self.timeout, self.source_address)
949 self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

~/miniconda3/envs/torch/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
727 try:
--> 728 raise err
729 finally:

~/miniconda3/envs/torch/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
715 sock.bind(source_address)
--> 716 sock.connect(sa)
717 # Break explicitly a reference cycle

ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

URLError Traceback (most recent call last)
/tmp/ipykernel_2304374/3507476301.py in
20 url,
21 dst_path,
---> 22 verbose=True,
23 )
24

~/miniconda3/envs/torch/lib/python3.7/site-packages/audeer/core/io.py in download_url(url, destination, force_download, verbose)
188 pbar.update(block_size)
189
--> 190 urllib.request.urlretrieve(url, destination, reporthook=bar_update)
191
192 return destination

~/miniconda3/envs/torch/lib/python3.7/urllib/request.py in urlretrieve(url, filename, reporthook, data)
245 url_type, path = splittype(url)
246
--> 247 with contextlib.closing(urlopen(url, data)) as fp:
248 headers = fp.info()
249

~/miniconda3/envs/torch/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
--> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):

~/miniconda3/envs/torch/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
523 req = meth(req)
524
--> 525 response = self._open(req, data)
526
527 # post-process response

~/miniconda3/envs/torch/lib/python3.7/urllib/request.py in _open(self, req, data)
541 protocol = req.type
542 result = self._call_chain(self.handle_open, protocol, protocol +
--> 543 '_open', req)
544 if result:
545 return result

~/miniconda3/envs/torch/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
501 for handler in handlers:
502 func = getattr(handler, meth_name)
--> 503 result = func(*args)
504 if result is not None:
505 return result

~/miniconda3/envs/torch/lib/python3.7/urllib/request.py in https_open(self, req)
1391 def https_open(self, req):
1392 return self.do_open(http.client.HTTPSConnection, req,
-> 1393 context=self._context, check_hostname=self.check_hostname)
1394
1395 https_request = AbstractHTTPHandler.do_request

~/miniconda3/envs/torch/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
1350 encode_chunked=req.has_header('Transfer-encoding'))
1351 except OSError as err: # timeout error
-> 1352 raise URLError(err)
1353 r = h.getresponse()
1354 except:

URLError: <urlopen error [Errno 111] Connection refused>

Fine tune on another dataset

Hi,

I am currently conducting a research project with my partner on developing an SER model for New Zealand English. We evaluated the model you provided here and achieved promising results but would like to fine tune it on another corpus.

We were wondering what input format the model expects our dataset to be in for training. We have it as a Dataset object using the datasets library from HuggingFace. The debug console in the below image shows the structure of our Dataset. It currently has audio, arousal, and valence annotations as inputs to the model.

Was this the input used, or was a different input expected?

audonnx requirements trainer and onnx depends on protobuf different protobuf

pip install audonnx

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trainer 0.0.20 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.

and I installed protobuf 3.9.2 then

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
onnx 1.13.1 requires protobuf<4,>=3.20.2, but you have protobuf 3.9.2 which is incompatible.