Coder Social home page Coder Social logo

modelscope / 3d-speaker Goto Github PK

View Code? Open in Web Editor NEW
945.0 18.0 77.0 3.13 MB

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

License: Apache License 2.0

Shell 28.65% Python 51.93% Perl 15.56% CMake 0.27% C++ 3.58%
campplus speaker-diarization speaker-verification voxceleb 3d-speaker eres2net rdino language-identification modelscope cnceleb

3d-speaker's Introduction



license

3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope. Furthermore, we present a large-scale speech corpus also called 3D-Speaker to facilitate the research of speech representation disentanglement.

Quickstart

Install 3D-Speaker

git clone https://github.com/alibaba-damo-academy/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt

Running experiments

# Speaker verification: ERes2Net on 3D-Speaker dataset
cd egs/3dspeaker/sv-eres2net/
bash run.sh
# Speaker verification: ERes2NetV2 on 3D-Speaker dataset
cd egs/3dspeaker/sv-eres2netv2/
bash run.sh
# Speaker verification: CAM++ on 3D-Speaker dataset
cd egs/3dspeaker/sv-cam++/
bash run.sh
# Speaker verification: ECAPA-TDNN on 3D-Speaker dataset
cd egs/3dspeaker/sv-ecapa/
bash run.sh
# Self-supervised speaker verification: RDINO on 3D-Speaker dataset
cd egs/3dspeaker/sv-rdino/
bash run.sh
# Self-supervised speaker verification: SDPN on VoxCeleb dataset
cd egs/voxceleb/sv-sdpn/
bash run.sh
# Audio and multimodal Speaker diarization:
cd egs/3dspeaker/speaker-diarization/
bash run_audio.sh
bash run_video.sh
# Language identification
cd egs/3dspeaker/language-idenitfication
bash run.sh

Inference using pretrained models from Modelscope

All pretrained models are released on Modelscope.

# Install modelscope
pip install modelscope
# ERes2Net trained on 200k labeled speakers
model_id=iic/speech_eres2net_sv_zh-cn_16k-common
# ERes2NetV2 trained on 200k labeled speakers
model_id=iic/speech_eres2netv2_sv_zh-cn_16k-common
# CAM++ trained on 200k labeled speakers
model_id=iic/speech_campplus_sv_zh-cn_16k-common
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id

# SDPN trained on VoxCeleb
model_id=iic/speech_sdpn_ecapa_tdnn_sv_en_voxceleb_16k
# Run SDPN inference
python speakerlab/bin/infer_sv_ssl.py --model_id $model_id

Overview of Content

What‘s new 🔥

Contact

If you have any comment or question about 3D-Speaker, please contact us by

  • email: {chenyafeng.cyf, zsq174630, tongmu.wh, shuli.cly}@alibaba-inc.com

License

3D-Speaker is released under the Apache License 2.0.

Acknowledge

3D-Speaker contains third-party components and code modified from some open-source repos, including:
Speechbrain, Wespeaker, D-TDNN, DINO, Vicreg, TalkNet-ASD , Ultra-Light-Fast-Generic-Face-Detector-1MB

Citations

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{chen2024eres2netv2,
  title={ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and and others},
  booktitle={INTERSPEECH},
  year={2024}
}
@article{chen2024sdpn,
  title={Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},
  url={https://arxiv.org/pdf/2308.02774},
  year={2024}
}
@article{chen20243d,
  title={3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and others},
  url={https://arxiv.org/pdf/2403.19971},
  year={2024}
}
@inproceedings{zheng20233d,
  title={3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement},
  author={Siqi Zheng, Luyao Cheng, Yafeng Chen, Hui Wang and Qian Chen},
  url={https://arxiv.org/pdf/2306.15354},
  year={2023}
}
@inproceedings{wang2023cam++,
  title={CAM++: A Fast and Efficient Network For Speaker Verification Using Context-Aware Masking},
  author={Wang, Hui and Zheng, Siqi and Chen, Yafeng and Cheng, Luyao and Chen, Qian},
  booktitle={INTERSPEECH},
  year={2023}
}
@inproceedings{chen2023enhanced,
  title={An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian and Qi, Jiajun},
  booktitle={INTERSPEECH},
  year={2023}
}
@inproceedings{chen2023pushing,
  title={Pushing the limits of self-supervised speaker verification using regularized distillation framework},
  author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian},
  booktitle={ICASSP},
  year={2023}
}

3d-speaker's People

Contributors

alibaba-oss avatar geekorangeluyao avatar querryton avatar speaker-lover avatar wanghuii1 avatar yfchenlucky avatar yfchenmodelscope avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

3d-speaker's Issues

运行bash run.sh 出错

我运行的是sv-cam++中的run.sh, 只用了一个GPU,到Stage3的时候报错,是python的问题吗?求赐教。
Stage3: Training the speaker model...
/root/miniconda3/envs/3D-Speaker/bin/python: can't open file 'speakerlab/bin/train.py': [Errno 20] Not a directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 3209) of binary: /root/miniconda3/envs/3D-Speaker/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in
sys.exit(main())
File "/root/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/root/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/root/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

speakerlab/bin/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-11-02_16:34:10
host : autodl-container-9ee2119752-04687cb0
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 3209)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Problem with training part.

Hi, I am Nathan and i am facing some problem with training part.

My env
Centos7.5
#PIP
pytorch-wpe 0.0.1
rotary-embedding-torch 0.5.3
torch 1.12.1+cu113 //To use cuda, I did reinstall torch and torchaudio.
torch-complex 0.4.3
torchaudio 0.12.1+cu113
torchvision 0.13.1+cu113

#rpm
libcudnn8-devel-8.2.0.53-1.cuda11.3.x86_64
libcudnn8-8.2.0.53-1.cuda11.3.x86_64

libnccl-devel-2.9.9-1+cuda11.3.x86_64
libnccl-2.9.9-1+cuda11.3.x86_64

To run a script , I follow 'egs/voxceleb/sv-ecapa/run.sh'
I set 4 gpus. (When i set single gpu, It's not working too)
But I got error as below.

Stage3: Training the speaker model...
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2024-02-15 14:31:58,001 - INFO: Use GPU: 3 for training.
2024-02-15 14:31:58,003 - INFO: Use GPU: 2 for training.
2024-02-15 14:31:58,009 - INFO: Use GPU: 1 for training.
2024-02-15 14:31:58,011 - INFO: Use GPU: 0 for training.
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 176, in
main()
File "speakerlab/bin/train.py", line 60, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 176, in
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
main()
File "speakerlab/bin/train.py", line 60, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 176, in
main()
File "speakerlab/bin/train.py", line 60, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
Traceback (most recent call last):
File "speakerlab/bin/train.py", line 176, in
main()
File "speakerlab/bin/train.py", line 60, in main
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 121550 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 121547) of binary: /home/asr/miniconda3/envs/3D-Speaker/bin/python
Traceback (most recent call last):
File "/home/asr/miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in
sys.exit(main())
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

speakerlab/bin/train.py FAILED

Failures:
[1]:
time : 2024-02-15_14:32:03
host : e7bcf3a85e2c
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 121548)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-02-15_14:32:03
host : e7bcf3a85e2c
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 121549)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-02-15_14:32:03
host : e7bcf3a85e2c
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 121547)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

错误数据

用脚本处理数据,发现一条0kb大小的数据3dspeaker/train/3D_SPK_00014/3D_SPK_00014_008_Device06_Distance08_Dialect00.wav

Inference acceleration

When applying the module of speaker classification, hundreds of millions of data inference, how to perform batch inference when vad, extraction embedding.

Thanks to the author for his reply and suggestions

windows

Is there a training environment deployed under Windows?

speech_eres2net_sv_zh-cn_16k-common预训练模型相关问题

1、提出使用200k的说话人进行训练,但是3D-Speaker中只有10000个说话人,请问是还使用了其他数据吗?
2、使用这个模型对CNCeleb的测试集和注册集分别提取embedding,然后再使用项目中的compute_score_metrics.py计算EER,我这边结果是4.08,这样对吗?比给出的结果2.8高出不少呢

关于ERes2Net的250k新模型

您好:
首先特别感谢您在modelscope中贡献的模型及代码。
我看到modelscope最近更新了ERes2Net的250k模型:“speech_eres2net_base_250k_sv_zh-cn_16k-common”
下面是使用该模型在本地推理的代码:

model_id=damo/speech_eres2net_base_250k_sv_zh-cn_16k-common
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path

但是我发现这个模型缺少了一些必要命令,如:

ERes2Net_Large_3D_Speaker = {
    'obj': 'speakerlab.models.eres2net.ResNet.ERes2Net',
    'args': {
        'feat_dim': 80,
        'embedding_size': 512,
        'm_channels': 64,
    }


supports = {...}
希望得到您的帮助,非常感谢~

Missing transcripts?

I read the FAQ on page. But I still find missing some transcripts, for example, the speaker 3D_SPK_00001 does not exist in transcription/train_transcription or transcription/test_transcription.
I missed something?
Or it just provides some transcripts.

ValueError: need at least one array to stack

/opt/conda/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of n_init will change from 10 to 'auto' in 1.4. Set the value of n_init explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
Traceback (most recent call last):
File "/vepfs/code/MossFormer/3D-Speaker/egs/3dspeaker/speaker-diarization/local/cluster_and_postprocess_h5.py", line 93, in audio_only_func_getnums
labels = cluster(embeddings)
File "/vepfs/code/MossFormer/3D-Speaker/speakerlab/process/cluster.py", line 186, in call
labels = self.filter_minor_cluster(labels, X, self.min_cluster_size)
File "/vepfs/code/MossFormer/3D-Speaker/speakerlab/process/cluster.py", line 203, in filter_minor_cluster
major_center = np.stack([x[labels == i].mean(0)
File "/opt/conda/lib/python3.10/site-packages/numpy/core/shape_base.py", line 445, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

During handling of the above exception, another exception occurred:

===========================
运行这块时,labels = cluster(embeddings) # embedding [14, 192]
报了上边的错误;
请问是什么导致的问题,如何解决呢?

sv-rdino - RuntimeError

I’ve been trying to train sv-rdino, my code did report such an error at runtime:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2048]] is at version 3;

How should we solve this problem?

分类器?

您好,eres2net模型分为两部分,embedding和classifier,但是只提供了提取embedding的预训练模型,是否考虑提供分类器的预训练模型?

关于 sv 识别结果的问题

问题描述

使用 SV 进行声纹验证,一段音频是存在人声的音频,另一段音频几乎没有声音(没有人声)。验证结果应该是低于阈值 0.6,但是结果却是高于0.6。想问下对于模型的识别结果,能获取到判断依据么?另外这个 threshold 一般应该设置多少合适?

使用模型

damo/speech_campplus_sv_cn_cnceleb_16k

识别结果

{'score': 0.68535, 'text': 'yes'}

speaker-diarization需要哪个版本的Funasr,第六步无法输出

使用了最新的Funasr==1.0.4,需要补充model_revision和修改vad_pipeline(wpath),但是在执行第六步的时候,会出现这样的报错,换成旧的0.8.8也是无法执行

Stage 1: Prepare input wavs...
--2024-01-30 18:07:32--  https://modelscope.cn/api/v1/models/damo/speech_eres2net-large_speaker-diarization_common/repo?Revision=master&FilePath=examples/example.wav
正在解析主机 modelscope.cn (modelscope.cn)... 39.101.130.40
正在连接 modelscope.cn (modelscope.cn)|39.101.130.40|:443... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:30720078 (29M) [application/octet-stream]
正在保存至: “examples/example.wav”

examples/example.wav                   100%[==========================================================================>]  29.30M  43.9MB/s  用时 0.7s    

2024-01-30 18:07:34 (43.9 MB/s) - 已保存 “examples/example.wav” [30720078/30720078])

--2024-01-30 18:07:34--  https://modelscope.cn/api/v1/models/damo/speech_eres2net-large_speaker-diarization_common/repo?Revision=master&FilePath=examples/example.rttm
正在解析主机 modelscope.cn (modelscope.cn)... 39.101.130.40
正在连接 modelscope.cn (modelscope.cn)|39.101.130.40|:443... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:1329 (1.3K) [application/octet-stream]
正在保存至: “examples/example.rttm”

examples/example.rttm                  100%[==========================================================================>]   1.30K  --.-KB/s  用时 0s      

2024-01-30 18:07:34 (29.3 MB/s) - 已保存 “examples/example.rttm” [1329/1329])

Stage2: Do vad for input wavs...
2024-01-30 18:07:37,343 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:07:37,345 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:07:37,470 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
[2024-01-30 18:07:38,659] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Please install rotary_embedding_torch by: 
 pip install -U rotary_embedding_torch
Please install rotary_embedding_torch by: 
 pip install -U rotary_embedding_torch
Please install rotary_embedding_torch by: 
 pip install -U rotary_embedding_torch
Please install rotary_embedding_torch by: 
 pip install -U rotary_embedding_torch
2024-01-30 18:07:44,757 - modelscope - INFO - Use user-specified model revision: v2.0.4
2024-01-30 18:07:45,018 - modelscope - INFO - initiate model from /home/winner/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch
2024-01-30 18:07:45,018 - modelscope - INFO - initiate model from location /home/winner/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch.
2024-01-30 18:07:45,019 - modelscope - INFO - initialize model from /home/winner/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch
2024-01-30 18:07:49,164 - modelscope - WARNING - No preprocessor field found in cfg.
2024-01-30 18:07:49,164 - modelscope - WARNING - No val key and type key found in preprocessor domain of configuration.json file.
2024-01-30 18:07:49,164 - modelscope - WARNING - Cannot find available config to build preprocessor at mode inference, current config: {'model_dir': '/home/winner/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch'}. trying to build by task and model information.
2024-01-30 18:07:49,164 - modelscope - WARNING - No preprocessor key ('funasr', 'voice-activity-detection') found in PREPROCESSOR_MAP, skip building preprocessor.
[INFO]: Start computing VAD...
rtf_avg: 0.225: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.69s/it]
rtf_avg: 594.604: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.90s/it]
[INFO]: VAD json is prepared in exp/json/vad.json
Stage3: Prepare subsegments info...
[INFO]: Generate sub-segmetns...
[INFO]: Subsegments json is prepared in exp/json/subseg.json
Stage4: Extract speaker embeddings...
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2024-01-30 18:08:21,239 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:08:21,241 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:08:21,262 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:08:21,264 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:08:21,274 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:08:21,275 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:08:21,362 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:08:21,363 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:08:21,382 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:08:21,384 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:08:21,386 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:08:21,388 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:08:21,394 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
2024-01-30 18:08:21,414 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
2024-01-30 18:08:21,430 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
2024-01-30 18:08:21,486 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
2024-01-30 18:08:21,502 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
2024-01-30 18:08:21,510 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
2024-01-30 18:08:21,716 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:08:21,718 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:08:21,829 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
2024-01-30 18:08:21,835 - modelscope - INFO - PyTorch version 2.0.0+cu118 Found.
2024-01-30 18:08:21,837 - modelscope - INFO - Loading ast index from /home/winner/.cache/modelscope/ast_indexer
2024-01-30 18:08:21,968 - modelscope - INFO - Loading done! Current index file version is 1.11.1, with md5 e4ea8cecd8079cde83f512df2bae21a7 and a total number of 956 components indexed
[2024-01-30 18:08:22,719] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-30 18:08:22,719] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-30 18:08:22,743] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-30 18:08:22,763] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-30 18:08:22,797] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-30 18:08:22,825] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-30 18:08:23,048] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-30 18:08:23,275] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-01-30 18:08:32,879 - modelscope - INFO - Use user-specified model revision: v1.0.0
WARNING: The number of threads exceeds the number of files
WARNING: The number of threads exceeds the number of files
WARNING: The number of threads exceeds the number of files
[INFO] Start computing embeddings...
[INFO] Start computing embeddings...
WARNING: The number of threads exceeds the number of filesWARNING: The number of threads exceeds the number of files

[WARNING] Embeddings has been saved previously. Skip it.
[WARNING] Embeddings has been saved previously. Skip it.
WARNING: The number of threads exceeds the number of files
Stage5: Perform clustering and output sys rttms...
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[INFO] Start clustering...
[INFO] Start clustering...
[INFO] Start clustering...
WARNING: The number of threads exceeds the number of files
WARNING: The number of threads exceeds the number of files
WARNING: The number of threads exceeds the number of files
WARNING: The number of threads exceeds the number of files
WARNING: The number of threads exceeds the number of files
/home/winner/anaconda3/envs/py38-pt200/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/home/winner/anaconda3/envs/py38-pt200/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
/home/winner/anaconda3/envs/py38-pt200/lib/python3.8/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
Stage6: Get the final metrics...
Computing DER...
2024-01-30 18:08:53,245 - INFO: Concatenating individual RTTM files...
2024-01-30 18:08:53,285 - INFO: MS: 2.069159, FA: 0.203668, SER: 0.000000, DER: 2.272828
Computing ACC...
error,there is no fileid_sys in ref rttm: output
seg pur error,there is no fileid_sys in ref rttm: %s output
eval_elems_seg error,there is no fileid_sys in ref rttm: %s output
All metrics have been done.

关于voxceleb dino

我最近在复习您的项目,我用voxceleb2 训练dino,但是eer刚开始几轮只有14%。我不确定这是不是正常的,您可以给我一份您的训练日志吗。非常感谢

ValueError: need at least one array to stack

/opt/conda/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of n_init will change from 10 to 'auto' in 1.4. Set the value of n_init explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
Traceback (most recent call last):
File "/vepfs/code/MossFormer/3D-Speaker/egs/3dspeaker/speaker-diarization/local/cluster_and_postprocess_h5.py", line 93, in audio_only_func_getnums
labels = cluster(embeddings)
File "/vepfs/code/MossFormer/3D-Speaker/speakerlab/process/cluster.py", line 186, in call
labels = self.filter_minor_cluster(labels, X, self.min_cluster_size)
File "/vepfs/code/MossFormer/3D-Speaker/speakerlab/process/cluster.py", line 203, in filter_minor_cluster
major_center = np.stack([x[labels == i].mean(0)
File "/opt/conda/lib/python3.10/site-packages/numpy/core/shape_base.py", line 445, in stack
raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

During handling of the above exception, another exception occurred:

===========================
运行这块时,labels = cluster(embeddings) # embedding [14, 192]
报了上边的错误;
请问是什么导致的问题,如何解决呢?

Exmaple can not run

Stage5: Get the final metrics...
Refrttm.list is not detected. Can't calculate the result

Low GPU Training speed of CAM++?

Hello, thank you for your open source of CAM++ model. The results are impressive!

I tried to train CAM++, but found it a little bit slower than ResNet34. The same training configs are used for both models (2*A100).
The interesting thing is that after exporting the models into onnx types and infer them using onnxruntime in CPUs,I can still see that CAM++ is about 3 times faster than ResNet34 (about 1/3 in rtf), which is consistent with your conclusion in your recent PR on 20230420.

My question is that do you have the same training phenomenon as me that CAM++ is slower than ResNet34? And how do you explain this phenomenon? lower inference rtf in cpu while lower training speed in gpu?

Inconsistent Performance and Loss when Resuming Training

Thank you for your excellent work. 🙂

We have observed that whenever we resume training with a different number of epochs after training completion, the loaded historical model exhibits significantly lower accuracy compared to the corresponding epoch during the original training. For instance, when loading a model trained for 100 epochs, its performance is only comparable to that of a model trained for 30 epochs.

This inconsistency in performance after resuming training poses a challenge for us to continue training from a checkpoint and obtain the desired results.

image
image

transcription

3d_speaker里有些audio clip没有相应的转录文本

fine-tune

您好,我再训练完dino之后想用label数据fine-tune,如何加载之前的模型.pth

DDP WARNING

您好,非常感谢您补充了ecapatdnn,我在训练egs中voxceleb ecapatdnn中遇到了这个warning,我不知道是因为什么
image

ERes2Net模型 load报错

在用torch.load 载入模型speech_eres2net_sv_zh-cn_16k-common时,报错_pickle.UnpicklingError: invalid load key, '\x08'.。请问下这个有遇到过吗?环境信息:Python 3.10.9、torch 1.12.1。
而用同样的代码载入speech_campplus_sv_zh-cn_16k-common这个模型就没问题

Methods for fine-tuning of pretrained models in modelscope

Hello, thank you for the wonderful repository! It really helped.
Currently, our team is trying to fine-tune ERes2Net-200k published in modelscope using a large amount of speech data. As I was not able to fine-tune properly, I think that several parameters within the configuration need to be modified for the task. Could you please share those details? If my fine-tuning is successful with good results, I will share the methodologies for the community.

EResNet result on VoxCeleb is not comparable

I ran the exact same script for the EResNet experiment on VoxCeleb. The EER and minDCF I got is 1.0105 and 0.1146, which is not comparable to the paper. The only difference is that I trained the model on 4 A100 machines, but I doubt that is the reason behind. Can you please provide the train.log and train_epoch.log files?

I also notice that in prepare_data_csv.csv, the default segment duration is 4 seconds, but in conf/eres2net.yaml it's 3 seconds. May I ask why is that?

Error occurred during "bash run.sh" for speaker diarization

Hi My name is Nathan. And i try to test 3d-speaker to get rttm from pretrained model on model scope.
But i get error as below.

(3D-Speaker) [asr@0419bb3cf325 speaker-diarization]$ bash run.sh
Stage 1: Prepare input wavs...
--2024-02-05 09:07:39-- https://modelscope.cn/api/v1/models/damo/speech_eres2net-large_speaker-diarization_common/repo?Revision=master&FilePath=examples/2speakers_example.wav
Resolving modelscope.cn (modelscope.cn)... 39.101.130.40
Connecting to modelscope.cn (modelscope.cn)|39.101.130.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2528044 (2.4M) [application/octet-stream]
Saving to: 'examples/2speakers_example.wav'

100%[===========================================================================>] 2,528,044 831KB/s in 3.0s

2024-02-05 09:07:43 (831 KB/s) - 'examples/2speakers_example.wav' saved [2528044/2528044]

--2024-02-05 09:07:43-- https://modelscope.cn/api/v1/models/damo/speech_eres2net-large_speaker-diarization_common/repo?Revision=master&FilePath=examples/2speakers_example.rttm
Resolving modelscope.cn (modelscope.cn)... 39.101.130.40
Connecting to modelscope.cn (modelscope.cn)|39.101.130.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 380 [application/octet-stream]
Saving to: 'examples/2speakers_example.rttm'

100%[===========================================================================>] 380 --.-K/s in 0s

2024-02-05 09:07:44 (40.0 MB/s) - 'examples/2speakers_example.rttm' saved [380/380]

Stage2: Do vad for input wavs...
2024-02-05 09:07:46,885 - modelscope - INFO - PyTorch version 1.13.1 Found.
2024-02-05 09:07:46,886 - modelscope - INFO - Loading ast index from /home/asr/.cache/modelscope/ast_indexer
2024-02-05 09:07:47,056 - modelscope - INFO - Updating the files for the changes of local files, first time updating will take longer time! Please wait till updating done!
2024-02-05 09:07:47,083 - modelscope - INFO - AST-Scanning the path "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/modelscope" with the following sub folders ['models', 'metrics', 'pipelines', 'preprocessors', 'trainers', 'msdatasets', 'exporters']
2024-02-05 09:08:18,037 - modelscope - INFO - Scanning done! A number of 964 components indexed or updated! Time consumed 30.954344987869263s
2024-02-05 09:08:18,114 - modelscope - INFO - Loading done! Current index file version is 1.12.0, with md5 ccb085697b83dbefd09232fac3402a63 and a total number of 964 components indexed
Please install rotary_embedding_torch by:
pip install -U rotary_embedding_torch
Please install rotary_embedding_torch by:
pip install -U rotary_embedding_torch
Please Requires the ffmpeg CLI and ffmpeg-python package to be installed.
Please install rotary_embedding_torch by:
pip install -U rotary_embedding_torch
Please install rotary_embedding_torch by:
pip install -U rotary_embedding_torch
2024-02-05 09:08:22,477 - modelscope - WARNING - Model revision not specified, use revision: v2.0.4
2024-02-05 09:08:22,825 - modelscope - INFO - initiate model from /home/asr/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch
2024-02-05 09:08:22,826 - modelscope - INFO - initiate model from location /home/asr/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch.
2024-02-05 09:08:22,827 - modelscope - INFO - initialize model from /home/asr/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch
2024-02-05 09:08:22,874 - modelscope - WARNING - No preprocessor field found in cfg.
2024-02-05 09:08:22,875 - modelscope - WARNING - No val key and type key found in preprocessor domain of configuration.json file.
2024-02-05 09:08:22,875 - modelscope - WARNING - Cannot find available config to build preprocessor at mode inference, current config: {'model_dir': '/home/asr/.cache/modelscope/hub/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch'}. trying to build by task and model information.
2024-02-05 09:08:22,875 - modelscope - WARNING - No preprocessor key ('funasr', 'voice-activity-detection') found in PREPROCESSOR_MAP, skip building preprocessor.
2024-02-05 09:08:22,876 - modelscope - INFO - cuda is not available, using cpu instead.
[INFO]: Start computing VAD...
rtf_avg: 0.043: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.22it/s]
Traceback (most recent call last):
File "local/voice_activity_detection.py", line 90, in
main()
File "local/voice_activity_detection.py", line 71, in main
for vad_t in vad_time['text']:
TypeError: list indices must be integers or slices, not str

if i print "vad_time", I get check
[{'key': 'rand_key_2yW4Acq9GFz6Y', 'value': [[5240, 29010], [29290, 37360], [37640, 67570], [67860, 78980]]}]

I don't understand meaning of text.
Please check this problem.
Thank you.

准备cnceleb wav

我发现在准备cnceleb的时候,flac2wav 那一步,在local下没有flac2wav.py 文件,我使用sv-ecapa下的文件的时候发现有部分flac无法转换成wav文件

The problem about the selection of num_of_spk in speaker-diarization

The spectral clustering in speakerlab/process/cluster.py, the following code is used to estimate the number of speakers

lambda_gap_list = self.getEigenGaps(
                lambdas[self.min_num_spks - 1:self.max_num_spks + 1])
num_of_spk = np.argmax(lambda_gap_list) + self.min_num_spks

But in other related projects, the following code is used to estimate the number of speakers

num_spks = num_spks if num_spks is not None \
                else cp.argmax(cp.diff(eig_values[:max_num_spks + 1])) + 1
num_spks = max(num_spks, min_num_spks)

# another
lambda_gap_list = self.getEigenGaps(lambdas[1 : self.max_num_spkrs])

num_of_spk = (
    np.argmax(
        lambda_gap_list[
            : min(self.max_num_spkrs, len(lambda_gap_list))
        ]
    )
    if lambda_gap_list
    else 0
) + 2

I would like to know what is the theoretical basis for your design? If the number of speakers' sentences is uneven, such as if a speaker speaks very little, is this estimation still valid? Perhaps you can provide relevant information? Thank you in advance for your answer.

关于learning rate

我用了你的cosine schedule之后,我发现我的learning rate 每轮都在增加。但是不应该是从0.2cosine schedule减小到0.00005吗
image
这个是我的learning rate变化到结果

about the result

我复现了一遍代码,我用dino框架,然后mutil crop :local 2条 (2s) global 1条(3s) ecapa(512). 没有用rdino。 最终的结果是5.0 请问算是一个正常的结果吗

案例写的是比较两个人音频是否为同一个人,那如果与很多人的音频库比较怎么操作呢

from modelscope.pipelines import pipeline
sv_pipeline = pipeline(
    task='speaker-verification',
    model='damo/speech_campplus_sv_zh-cn_16k-common',
    model_revision='v1.0.0'
)
speaker1_a_wav = 'https://modelscope.cn/api/v1/models/damo/speech_campplus_sv_zh-cn_16k-common/repo?Revision=master&FilePath=examples/speaker1_a_cn_16k.wav'
speaker1_b_wav = 'https://modelscope.cn/api/v1/models/damo/speech_campplus_sv_zh-cn_16k-common/repo?Revision=master&FilePath=examples/speaker1_b_cn_16k.wav'
speaker2_a_wav = 'https://modelscope.cn/api/v1/models/damo/speech_campplus_sv_zh-cn_16k-common/repo?Revision=master&FilePath=examples/speaker2_a_cn_16k.wav'
# 相同说话人语音
result = sv_pipeline([speaker1_a_wav, speaker1_b_wav])
print(result)
# 不同说话人语音
result = sv_pipeline([speaker1_a_wav, speaker2_a_wav])
print(result)
# 可以自定义得分阈值来进行识别,阈值越高,判定为同一人的条件越严格
result = sv_pipeline([speaker1_a_wav, speaker2_a_wav], thr=0.31)
print(result)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.