niger-volta-lti / yoruba-adr Goto Github PK
View Code? Open in Web Editor NEWAutomatic Diacritic Restoration of Yorùbá language Text
License: MIT License
Automatic Diacritic Restoration of Yorùbá language Text
License: MIT License
Fix drop-out option warning, esp. as its distracting and muddies the output of the Jupyter prediction notebook.
/Users/iroro/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46:
UserWarning: dropout option adds dropout after all but last recurrent layer,
so non-zero dropout expects num_layers greater than 1,
but got dropout=0.3 and num_layers=1 "num_layers={}".format(dropout, num_layers))
The paved-road for this project is python3.6 (soon 3.7), but we should support python3.5. @dadelani has reported some errors with the training script. Ensure that we retain backward compatibility with python3.5.
I tried running the code on our server with python 3.5, it gave some encoding errors. I tried it on another system with python 3.6 and it works, since i don't have root permission, i need to find a way to make sure python3.5 works for the code
Traceback (most recent call last): File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 172, in <module> main() File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 159, in main examples = list(make_data(ARGS.source_file, ARGS.min_len, ARGS.max_len)) File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 111, in make_data print("Skipping: " + line2) UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-25: ordinal not in range(128)
Since OpenNMT-py is now on PyPI, we don't need to keep a full fork of the src in src/onmt
We do need scorers and other utilities (like code to prepare a model for release, stripping out optimizer info and keeping only model weights and biases)
TODO:
Refactor top level scripts and code in src
to use a pip install
ed OpenNMT-py for training and evaluation, using only the custom scoring and utils source where necessary.
It will be useful to have, in addition to the OpenNMT implementation, support for training self-attentive (Transformer) models with the reference/canonical implementation from Google: https://github.com/tensorflow/tensor2tensor
Add Travis CI (or another continuous integration system), so we know when pushing breaking changes, say to the dependencies like yoruba-text
or OpenNMT-py
that we catch issues early on.
Also just good practise for having a confidence open-source project. Confam!
I'm having some challenges reproducing the code on my local machine. The issue seems to be an error with torchtext but I honestly can't seem to figure out what exactly is causing it.
To confirm it's not an environment issue, I tried running on google colab too but the same issue comes up, so it doesn't seem to be a versioning issue.
Do you have any idea what I might be missing?
Hi @ruohoruotsi @emmadedayo
I wonder whether what Size mean? Is it Layer and rnn_size ?
You do not use any pre-trained word embedding, Is that right?
How you can embed them?
Hi team, how I can predict a sentence with models in /models?
Do you have any suggestion?
thank for your support.
In small data, there is no strange because there is a distance between src vocab size and tgt vocab size
But with larger data, Are there something wrong if src vocab size: 50002 and tgt vocab size: 50004 ?
I think it has to be bigger
Thank you.
...
[2019-03-05 17:58:34,600 INFO] * reloading ./data/demo.train.8.pt.
[2019-03-05 17:58:36,239 INFO] * tgt vocab size: 50004.
[2019-03-05 17:58:36,661 INFO] * src vocab size: 50002.
[INFO] running Bahdanau seq2seq training, for GPU training add: -gpuid 0
[2019-03-05 17:58:40,955 INFO] * src vocab size = 50002
[2019-03-05 17:58:40,956 INFO] * tgt vocab size = 50004
[2019-03-05 17:58:40,956 INFO] Building model...
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.3 and num_layers=1
"num_layers={}".format(dropout, num_layers))
[2019-03-05 17:58:42,292 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50002, 500, padding_idx=1)
)
)
)
(rnn): LSTM(500, 128, dropout=0.3)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50004, 500, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.3)
(layers): ModuleList(
(0): LSTMCell(628, 128)
)
)
(attn): GlobalAttention(
(linear_out): Linear(in_features=256, out_features=128, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=128, out_features=50004, bias=True)
(1): LogSoftmax()
)
)
[2019-03-05 17:58:42,292 INFO] encoder: 25323560
[2019-03-05 17:58:42,292 INFO] decoder: 31873380
...
To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.
The dataset can be used in the following ways:
Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.
The ADR model is too big.
Add all models from Improving ADR paper to bintray:https://bintray.com/ruohoruotsi/prebuilt-models
Alternatively, find a place to host them on GCP, we need a cloud solution as these models are too big to exist on local computers or as part of a 100MB max PyPI sdist or Wheel package.
Our fork of OpenNMT used in ./src/
https://github.com/ruohoruotsi/OpenNMT-py, is very much behind the current https://github.com/OpenNMT/OpenNMT-py HEAD. Update our code to the latest.
Additionally, decide if a submodule-like mechanism that doesn't require manual merging, will facilitate maintenance of framework dependencies like OpenNMT or T2T.
Resources: https://stackoverflow.com/questions/6500524/alternatives-to-git-submodules
Fix Lẹ́síkà logic to ensure that vocabulary is shared across training, dev and test splits.
http://www.albertauyeung.com/post/generating-ngrams-python/
https://gist.github.com/amontalenti/7975313
https://developer.ibm.com/articles/cc-patterns-artificial-intelligence-part2/
https://pypi.org/project/icegrams/
http://www.ling.helsinki.fi/kit/2014s/clt237/nltk-02-2-print.shtml
Fine-tune the seq2seq decoder parameters (like beam-width) for the ADR task. This also includes error-analysis from the validation & test sets so we have a deep understanding of the performance of the model.
Please refer to this document from the CMU-LTI: https://github.com/neubig/nmt-tips
Add enhancements to the model that include :
new data from text-reserve
(TImi_Wuraola text, new books, dictionaries & proverbs, 1-5 grams, (Agbanilolúwa ==> a-gba-ẹni-ni-olúwa from Yorùbá Name) taking Yorùbá Word vocabulary as input (to constrain predictions) to that canonical set.
During prediction (there can be a lookup, perhaps best implemented in Ìrànlọ́wọ́) that validates an entered word is in the dictionary and either rejects it or looks up a nearest neighbour from a pretrained text-embedding)
Prepare Iroyin as a validation dataset
Once training dataprep is complete, hand over to David to retrain on his GPU.
Twitter Yorùbá scraper for conversational (create new Ìrànlọ́wọ́ issue) when we get here
For Reference, which captures all of the detailed discussions about next steps:
See this: https://yorubaname.slack.com/archives/C16A699LY/p1564362548029800
Can u help with this,am using python3 and whenever i run the .sh script on terminal ,this error shoot out module nltk not found and i have install nltk
Cards can be added to your board to track the progress of issues and pull requests. You can also add note cards, like this one!
I want training in difference language, What part I have to modify like data, tokenizer, vocabulary set ... ???
Thank you for your response.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.