This project is a modified version of ASR for Chinese, https://github.com/CynthiaSuwi/ASR-for-Chinese-Pipeline, however, that project is mainly for madarin, in this project, we try to use this pipeline and choose the dataset to be from mozilla's common voice Hong Kong cantonese dataset (https://commonvoice.mozilla.org/en/datasets , zh-HK_100h_2020-12-11), and based on the corpus information from pycantonese (https://pycantonese.org/searches.html). The training is based on cantonese corpus and dataset.
-
Setup:
System: Ubuntu 20.04, with GPU hardware.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80 Driver Version: 460.80 CUDA Version: 11.2 |
python3.6: install python3.6 by typing "sudo apt-get install python3.6"
-
clone the source code by "git clone https://github.com/kathykyt/cantonese_ASR.git"
-
Create a virtual python environment: "cd catonese_ASR" , run "virtualenv -p /usr/bin/python3.6 venv"
-
setup python virtual environment: "source venv/bin/activate"
-
Install required packages: "pip install -r requirements.txt"
-
Visit https://commonvoice.mozilla.org/en/datasets and select the download the cantonese dataset file, zh-HK_100h_2020-12-11 to download, the file is zh-HK.tar.gz. copy it under the directory, cantonest_ASR/dataset/ by "cp zh-HK.tar.gz {your top diretory}/cantonest_ASR/dataset/ "
-
extract the file by "tar xvf zh-HK.tar.gz"
-
Prepare the wave file for training and testing. Since the commonvoice data is mp3, we have to convert them to .wav files. To convert it, under cantonest_ASR/dataset/ run "./convert_to_mp3.py ", after that run "./convert_to_mp3_test.py".
-
Since the trained model file will be located under model_speech, so create the direcotry m251 under model_speech/, by "mkdir m251"
-
To start the training, cd catonese_ASR, type "python train_mspeech.py" , remember to change into python virtual environment before issung the command.
-
Please be patient, the training is very slow even with GPU.