Coder Social home page Coder Social logo

thailmcut's Introduction

ThaiLMCut - Word Tokenizer for Thai Language based on Transfer Learning and bidirectional-LSTM

About

  • the tokenizer utilizes transfer learning from a character language model which is trained on a large Thai hotel review corpus and InterBEST2009.
  • at the moment, the tokenizer supports only Thai texts. Texts that includes English characters or special symbols will not be tokenized correctly, since the model was trained exclusively using Thai texts (also with out any spaces, special symbols, and digits).
  • we will soon release the model that supports those characters as well.
  • Try ThaiLMCut in Colab
  • Paper: ThaiLMCut: Unsupervised Pretraining for Thai Word Segmentation
  • bibtex

  • an example input from a hotel review

Requirements

  • Python 3.5+
  • PyTorch 1.0+
  • numpy

How to setup

Download the weight file from:

https://drive.google.com/file/d/1e39tNMfUFzYQ4MDHTMyNWfNUxu9RoaTA/view?usp=sharing

Move the weight file to this path:

lmcut/weight/.

Create a package wheel using:

python setup.py bdist_wheel

Install the package using:

pip install dist/lmcut*

How to use

Tokenize a given Thai text

After importing the package, you can tokenize any Thai text by using:

from lmcut import tokenize
text = "โรงแรมดี สวยงามน่าอยู่มากๆ"
result = tokenize(text)
print(result)

Result will be a list of tokens:

['โรง', 'แรม', 'ดี', 'สวยงาม', 'น่า', 'อยู่', 'มาก', 'ๆ']

Train a language model

  • Define the training and development dataset in train/get_corpus_lm.py
  • Input data can be any text. Example of an input text can be found in data/TEST_100K.txt
  • If you use InterBEST2009, the boundary markers must be removed (see train/get_corpus.py) To train a new language model, you could run:
python train/LanguageModel.py --dataset [dataset name] --batchSize 60  --char_dropout_prob 0.01  --char_embedding_size 200   --hidden_dim 500  --layer_num 3  --learning_rate 0.0001 --sequence_length 100  --epoch 20 --len_lines_per_chunk 1000 --optim [adam or sgd] --lstm_num_direction [2 for bidirectional LSTM]  --add_note "..add some note.."

To resume the training of a language model, you could run

python train/LanguageModel.py   --load_from [model name]  --dataset [dataset name]  --learning_rate 0.0001 --epoch 20  --optim [adam or sgd] "

Train a new tokenizer

  • Expected input is InterBEST2009 or any corpus with boundary marker |
  • Define the train, dev, test dataset in train/get_corpus.py
  • Example of an input text can be found in data/news_00001.txt

To train a new tokenizer, you could run:

python Tokenizer.py --epoch 5 --lstm_num_direction 2 --batchSize 30 --sequence_length 80 --char_embedding_size 100 --hidden_dim 60 --layer_num 2 [adam or sgd] --learning_rate 0.0001

to transfer the embedding layer and recurrent layer of a pre-trained language model, you could run

python Tokenizer.py --load_from [language model name] --epoch 5  --learning_rate 0.0001

to resume the training of a tokenizer, you could run

python Tokenizer.py --load_from [tokenizer name] --epoch 5  --learning_rate 0.0001 
  • use --over_write 1 if you want to over write the weights to the resumed model

  • with --over_write 0 it will save the trained model as a new model

  • More detail about other arguments, see train/Tokenizer.py and train/LanguageModel.py

  • data/news_00001.txt and data/TEST_100K.txt is from InterBEST2009 corpus which is publicly available at NECTEC

Credits

Acknowledgements

The project is funded by TrustYou. The author would like to sincerely thank TrustYou and other contributors of this project.

Contributors

  • Suteera Seeha
  • Ivan Bilan
  • Liliana Mamani Sanchez
  • Johannes Huber
  • Michael Matuschek

License

All original code in this project is licensed under the MIT License. See the included LICENSE file.

thailmcut's People

Contributors

meanna avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.