Transformer Based Korean Sentence Spacing Corrector
This solution is made available under Apache 2 license. See the LICENSE file.
It is recommended that you run the Trainig on a machine with Nvidia GPU with drivers and CUDA installed.
-
Clone this repo and cd into it.
-
Install dependencies. Preferrably in a virtual env.
a. Optional: Create new virtual env. Conda example below.
conda create --name TKOrrector python=3.9 -y
conda activate TKOrrector
b. Install PyTorch with CUDA
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
or
b. Install PyTorch without GPU
conda install pytorch torchvision torchaudio cpuonly -c pytorch
c. Install dependencies
pip install -r requirements.txt
You can run the pretrained model without the need to Train.
-
Download the pretrained model and extract into the current directory (
tar zxvf TKOrrector.tar.gz
). -
sh demo.sh
OR
sh demo_realtime.sh
for Realtime correction instead of waiting until press of Enter key (GPU is required).
OR
- Run a pre-packaged container with pretrained model already downloaded.
- docker run --gpus all -it paulhkim80/tkorrector OR (revert back to non-realtime, non-gpu accelerated mode)
- docker run -e "STARTUP_CMD=demo.sh" -it paulhkim80/tkorrector
Note: paulhkim80/tkorrector:0.2 and after (latest) have real-time correction as the startup script. Recommend running the image on a node with Nvidia GPU (--gpus all) for best experience.
Example demo run screen and results.
-
Go to NIKL Corpus Download Site and apply for a new license.
The cost is free but you need to sign an agreement. It is recommended that you upload the corpus file on an object storage such as GCS to quickly download on additional machines such as GCP GCE to use a VM with GPU for training as needed without huge upfront cost. Edit src/download_corpus.sh to download the Corpus file and expand it into the designated directory.
cd src
sh download_corpus.sh
Change lines 51, 53 in prepare_corpus_with_tokenizer.sh to increase the training dataset size.
The second argument is the number of files to include into the training set + 1.
`get_corpus "../data/$CORPUS1/*" 10`
Above command would include 9 files (manual pdf file is skipped) from the Newspaper corpus.
-
Run the data prep command.
sh prepare_corpus_with_tokenizer.sh
-
Run the training command.
sh train.sh
-
After the training is done, evaluation of the model with test dataset can be performed with batch translations by running the command below.
sh calculate_metrics.sh