niger-volta-lti / yoruba-adr Goto Github PK

Automatic Diacritic Restoration of Yorùbá language Text

License: MIT License

Shell 5.44% Python 78.94% Jupyter Notebook 15.62%

diacritics text-processing seq2seq neural-machine-translation yoruba african-languages orthographic-diacritics adr attention python3

yoruba-adr's People

Contributors

Stargazers

Watchers

Forkers

timilehin tolulope olamyy dupsys ngtrang misterola alimi001 awujo-olopolo-pipe afro-lingo nativemaps

yoruba-adr's Issues

[FIX] drop-out option warning

Fix drop-out option warning, esp. as its distracting and muddies the output of the Jupyter prediction notebook.

/Users/iroro/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: 
UserWarning: dropout option adds dropout after all but last recurrent layer,
so non-zero dropout expects num_layers greater than 1, 
but got dropout=0.3 and num_layers=1 "num_layers={}".format(dropout, num_layers))

[FIX] training script for python3.5

The paved-road for this project is python3.6 (soon 3.7), but we should support python3.5. @dadelani has reported some errors with the training script. Ensure that we retain backward compatibility with python3.5.

I tried running the code on our server with python 3.5, it gave some encoding errors. I tried it on another system with python 3.6 and it works, since i don't have root permission, i need to find a way to make sure python3.5 works for the code

Traceback (most recent call last): File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 172, in <module> main() File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 159, in main examples = list(make_data(ARGS.source_file, ARGS.min_len, ARGS.max_len)) File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 111, in make_data print("Skipping: " + line2) UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-25: ordinal not in range(128)

Remove OpenNMT-py code

Since OpenNMT-py is now on PyPI, we don't need to keep a full fork of the src in src/onmt
We do need scorers and other utilities (like code to prepare a model for release, stripping out optimizer info and keeping only model weights and biases)

TODO:
Refactor top level scripts and code in src to use a pip installed OpenNMT-py for training and evaluation, using only the custom scoring and utils source where necessary.

[ADD] support for self-attentive training with T2T

It will be useful to have, in addition to the OpenNMT implementation, support for training self-attentive (Transformer) models with the reference/canonical implementation from Google: https://github.com/tensorflow/tensor2tensor

[ADD] Travis CI

Add Travis CI (or another continuous integration system), so we know when pushing breaking changes, say to the dependencies like yoruba-text or OpenNMT-py that we catch issues early on.

Also just good practise for having a confidence open-source project. Confam!

Issues with reproduction

I'm having some challenges reproducing the code on my local machine. The issue seems to be an error with torchtext but I honestly can't seem to figure out what exactly is causing it.

Here's a stacktrace.

To confirm it's not an environment issue, I tried running on google colab too but the same issue comes up, so it doesn't seem to be a versioning issue.

Do you have any idea what I might be missing?

Size Layer & RNN size question

Hi @ruohoruotsi @emmadedayo
I wonder whether what Size mean? Is it Layer and rnn_size ?

Which way you embed the words?

You do not use any pre-trained word embedding, Is that right?
How you can embed them?

How do I use a saved model to predict?

Hi team, how I can predict a sentence with models in /models?
Do you have any suggestion?
thank for your support.

Are there any limitations size of vocab ?

In small data, there is no strange because there is a distance between src vocab size and tgt vocab size
But with larger data, Are there something wrong if src vocab size: 50002 and tgt vocab size: 50004 ?
I think it has to be bigger
Thank you.

...
[2019-03-05 17:58:34,600 INFO]  * reloading ./data/demo.train.8.pt.
[2019-03-05 17:58:36,239 INFO]  * tgt vocab size: 50004.
[2019-03-05 17:58:36,661 INFO]  * src vocab size: 50002.
[INFO] running Bahdanau seq2seq training, for GPU training add: -gpuid 0 
[2019-03-05 17:58:40,955 INFO]  * src vocab size = 50002
[2019-03-05 17:58:40,956 INFO]  * tgt vocab size = 50004
[2019-03-05 17:58:40,956 INFO] Building model...
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.3 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
[2019-03-05 17:58:42,292 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50002, 500, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(500, 128, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50004, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3)
      (layers): ModuleList(
        (0): LSTMCell(628, 128)
      )
    )
    (attn): GlobalAttention(
      (linear_out): Linear(in_features=256, out_features=128, bias=False)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=128, out_features=50004, bias=True)
    (1): LogSoftmax()
  )
)
[2019-03-05 17:58:42,292 INFO] encoder: 25323560
[2019-03-05 17:58:42,292 INFO] decoder: 31873380
...

Prepare partially diacritized input dataset

To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.

The dataset can be used in the following ways:

Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider
Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]

Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.

[FIX] ADR model size

The ADR model is too big.

200MB is what the training script emits, but normal people cannot be downloading 200MB haba!!
What is comprising the large size?? My suspicion is that Pytorch is also saving other data along w/ the weights/biases. Investigate and optimize the size of the model so that we can store it either locally (within github's limits) or at least make it an easier download.

Ìrànlọ́wọ́:

https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/release_model.py

[ADD] all models from Improving ADR paper to bintray

Add all models from Improving ADR paper to bintray:https://bintray.com/ruohoruotsi/prebuilt-models

Alternatively, find a place to host them on GCP, we need a cloud solution as these models are too big to exist on local computers or as part of a 100MB max PyPI sdist or Wheel package.

[UPDATE] OpenNMT code

Our fork of OpenNMT used in ./src/ https://github.com/ruohoruotsi/OpenNMT-py, is very much behind the current https://github.com/OpenNMT/OpenNMT-py HEAD. Update our code to the latest.

Additionally, decide if a submodule-like mechanism that doesn't require manual merging, will facilitate maintenance of framework dependencies like OpenNMT or T2T.

Resources: https://stackoverflow.com/questions/6500524/alternatives-to-git-submodules

Can I train a model with larger text corpus ?

I have trained with 8 million sentences. and it works well
But with larger corpus (more than 5 times) I have a problem with memory, How to deal with it? Which parameter I have to change?
I use Tesla K40m with 12GiB memory.
Thank you ;) .

Fix Lẹ́síkà

Fix Lẹ́síkà logic to ensure that vocabulary is shared across training, dev and test splits.

http://www.albertauyeung.com/post/generating-ngrams-python/
https://gist.github.com/amontalenti/7975313
https://developer.ibm.com/articles/cc-patterns-artificial-intelligence-part2/
https://pypi.org/project/icegrams/
http://www.ling.helsinki.fi/kit/2014s/clt237/nltk-02-2-print.shtml

https://dl.bintray.com/ruohoruotsi/prebuilt-models/

Tune ADR decoder parameters

Fine-tune the seq2seq decoder parameters (like beam-width) for the ADR task. This also includes error-analysis from the validation & test sets so we have a deep understanding of the performance of the model.

Please refer to this document from the CMU-LTI: https://github.com/neubig/nmt-tips

[ADD] enhancements for new training session

Add enhancements to the model that include :

new data from text-reserve (TImi_Wuraola text, new books, dictionaries & proverbs, 1-5 grams, (Agbanilolúwa ==> a-gba-ẹni-ni-olúwa from Yorùbá Name) taking Yorùbá Word vocabulary as input (to constrain predictions) to that canonical set.
During prediction (there can be a lookup, perhaps best implemented in Ìrànlọ́wọ́) that validates an entered word is in the dictionary and either rejects it or looks up a nearest neighbour from a pretrained text-embedding)
Prepare Iroyin as a validation dataset
Once training dataprep is complete, hand over to David to retrain on his GPU.
Twitter Yorùbá scraper for conversational (create new Ìrànlọ́wọ́ issue) when we get here

For Reference, which captures all of the detailed discussions about next steps:
See this: https://yorubaname.slack.com/archives/C16A699LY/p1564362548029800

For Sagemaker the model size is too big.
Use model release preparation code to reduce the size, so that we can not incur additional expenses on AWS.
Apply to all trained models. Will be interesting to see how big the Transformers end up being.

Training in new languages

I want training in difference language, What part I have to modify like data, tokenizer, vocabulary set ... ???
Thank you for your response.