undertheseanlp / word_tokenize Goto Github PK

View Code? Open in Web Editor NEW

47.0 47.0 25.0 29.23 MB

Vietnamese Word Tokenize

Python 100.00%

natural-language-processing nlp vietnamese vietnamese-nlp word-segmentation

word_tokenize's People

Contributors

Stargazers

Watchers

word_tokenize's Issues

Add more training data to current model.

I'm currently trying to reproduce your workflow. As far as I know, word_tokenize module of underthesea only use Vlsp2016 as training data. I think, training more data may help you better improve the model.
For the time being, we can incorporate Vlsp2013 data into the existing one.
So here is the step I took to preprocess data:
+) In the repository, for vlsp2016 you took the data in Conll format (kindly correct me if I'm wrong), then use preprocessing_vlsp2016.py to convert it to iob2 format, which produce 3 file in corpus folder: train, dev, test respectively. Here I'll use the same approach, since vlsp2013 use different data format, I wrote a different script to transform it to the same iob2 format as above.
+) Then, I would concatenate train2013 + train2016 -> train; dev2016 -> dev, test2013 + test2016 -> test.
Though, train-dev-test evaluate is a good strategy here. I think we would better off using grid search CV for hyper-parameter tuning the model.
Please discuss your idea here.

Training custom data not working

I am trying to train custom input data.

Juts a simple text including:

custom_train_data.txt

capuchino B-W
.   B-W

cà B-W
phê I-W
việt I-W
.   B-W

macchiato B-W
.   B-W

trà B-W
đào I-W
.   B-W

bánh B-W
ngọt I-W
.   B-W

latte B-W
.   B-W

cà B-W
phê I-W
ý I-W
.   B-W

capuchino B-W
.   B-W

cà B-W
phê I-W
việt I-W
.   B-W

macchiato B-W
.   B-W

trà B-W
đào I-W
.   B-W

bánh B-W
ngọt I-W
.   B-W

latte B-W
.   B-W

cà B-W
phê I-W
ý I-W
.   B-W

bún B-W
huế I-W
.   B-W

bánh B-W
huế I-W
.   B-W

chè B-W
huế I-W
.   B-W

cuốn B-W
huế I-W
.   B-W

cơm B-W
hến I-W
.   B-W

bánh B-W
canh I-W
.   B-W

and I ran the training script as

python train.py --train custom_train_data.txt

which generated the model.bin file ok.

However, when I replaced the model into underthesea at underthesea/underthesea/word_tokenize/model_9.bin. (I have set debug and was sure the right model was made). And tried to tokenize string using the model, which was not working.

>> underthesea.word_tokenize('cơm hến tại nhà hàng Việt')
['cơm', 'hến', 'tại', 'nhà', 'hàng', 'Việt']

So what do you think is the problem here?

undertheseanlp / word_tokenize Goto Github PK

word_tokenize's People

Contributors

Stargazers

Watchers

Forkers

word_tokenize's Issues

Add more training data to current model.

Training custom data not working

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent