Coder Social home page Coder Social logo

niger-volta-lti / yoruba-adr Goto Github PK

View Code? Open in Web Editor NEW
24.0 6.0 11.0 157 MB

Automatic Diacritic Restoration of Yorùbá language Text

License: MIT License

Shell 5.44% Python 78.94% Jupyter Notebook 15.62%
diacritics text-processing seq2seq neural-machine-translation yoruba african-languages orthographic-diacritics adr attention python3

yoruba-adr's People

Contributors

ruohoruotsi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

yoruba-adr's Issues

[FIX] drop-out option warning

Fix drop-out option warning, esp. as its distracting and muddies the output of the Jupyter prediction notebook.

/Users/iroro/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py:46: 
UserWarning: dropout option adds dropout after all but last recurrent layer,
so non-zero dropout expects num_layers greater than 1, 
but got dropout=0.3 and num_layers=1 "num_layers={}".format(dropout, num_layers))

[FIX] training script for python3.5

The paved-road for this project is python3.6 (soon 3.7), but we should support python3.5. @dadelani has reported some errors with the training script. Ensure that we retain backward compatibility with python3.5.

I tried running the code on our server with python 3.5, it gave some encoding errors. I tried it on another system with python 3.6 and it works, since i don't have root permission, i need to find a way to make sure python3.5 works for the code

Traceback (most recent call last): File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 172, in <module> main() File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 159, in main examples = list(make_data(ARGS.source_file, ARGS.min_len, ARGS.max_len)) File "/work/smg/v-david/try_project/yoruba-adr/src/make_parallel_text.py", line 111, in make_data print("Skipping: " + line2) UnicodeEncodeError: 'ascii' codec can't encode characters in position 24-25: ordinal not in range(128)

Remove OpenNMT-py code

  • Since OpenNMT-py is now on PyPI, we don't need to keep a full fork of the src in src/onmt

  • We do need scorers and other utilities (like code to prepare a model for release, stripping out optimizer info and keeping only model weights and biases)

TODO:
Refactor top level scripts and code in src to use a pip installed OpenNMT-py for training and evaluation, using only the custom scoring and utils source where necessary.

[ADD] Travis CI

Add Travis CI (or another continuous integration system), so we know when pushing breaking changes, say to the dependencies like yoruba-text or OpenNMT-py that we catch issues early on.

Also just good practise for having a confidence open-source project. Confam!

Issues with reproduction

I'm having some challenges reproducing the code on my local machine. The issue seems to be an error with torchtext but I honestly can't seem to figure out what exactly is causing it.

Here's a stacktrace.
Screen Shot 2019-05-14 at 4 23 39 PM

To confirm it's not an environment issue, I tried running on google colab too but the same issue comes up, so it doesn't seem to be a versioning issue.

Screen Shot 2019-05-14 at 4 19 10 PM

Do you have any idea what I might be missing?

Are there any limitations size of vocab ?

In small data, there is no strange because there is a distance between src vocab size and tgt vocab size
But with larger data, Are there something wrong if src vocab size: 50002 and tgt vocab size: 50004 ?
I think it has to be bigger
Thank you.

...
[2019-03-05 17:58:34,600 INFO]  * reloading ./data/demo.train.8.pt.
[2019-03-05 17:58:36,239 INFO]  * tgt vocab size: 50004.
[2019-03-05 17:58:36,661 INFO]  * src vocab size: 50002.
[INFO] running Bahdanau seq2seq training, for GPU training add: -gpuid 0 
[2019-03-05 17:58:40,955 INFO]  * src vocab size = 50002
[2019-03-05 17:58:40,956 INFO]  * tgt vocab size = 50004
[2019-03-05 17:58:40,956 INFO] Building model...
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py:46: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.3 and num_layers=1
  "num_layers={}".format(dropout, num_layers))
[2019-03-05 17:58:42,292 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50002, 500, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(500, 128, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50004, 500, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3)
      (layers): ModuleList(
        (0): LSTMCell(628, 128)
      )
    )
    (attn): GlobalAttention(
      (linear_out): Linear(in_features=256, out_features=128, bias=False)
    )
  )
  (generator): Sequential(
    (0): Linear(in_features=128, out_features=50004, bias=True)
    (1): LogSoftmax()
  )
)
[2019-03-05 17:58:42,292 INFO] encoder: 25323560
[2019-03-05 17:58:42,292 INFO] decoder: 31873380
...

Prepare partially diacritized input dataset

To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.

The dataset can be used in the following ways:

  1. Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider
  2. Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]

Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.

[FIX] ADR model size

The ADR model is too big.

  • 200MB is what the training script emits, but normal people cannot be downloading 200MB haba!!
  • What is comprising the large size?? My suspicion is that Pytorch is also saving other data along w/ the weights/biases. Investigate and optimize the size of the model so that we can store it either locally (within github's limits) or at least make it an easier download.

Ìrànlọ́wọ́:

Can I train a model with larger text corpus ?

I have trained with 8 million sentences. and it works well
But with larger corpus (more than 5 times) I have a problem with memory, How to deal with it? Which parameter I have to change?
I use Tesla K40m with 12GiB memory.
Thank you ;) .
image

Tune ADR decoder parameters

Fine-tune the seq2seq decoder parameters (like beam-width) for the ADR task. This also includes error-analysis from the validation & test sets so we have a deep understanding of the performance of the model.

Please refer to this document from the CMU-LTI: https://github.com/neubig/nmt-tips

[ADD] enhancements for new training session

Add enhancements to the model that include :

  • new data from text-reserve (TImi_Wuraola text, new books, dictionaries & proverbs, 1-5 grams, (Agbanilolúwa ==> a-gba-ẹni-ni-olúwa from Yorùbá Name) taking Yorùbá Word vocabulary as input (to constrain predictions) to that canonical set.

  • During prediction (there can be a lookup, perhaps best implemented in Ìrànlọ́wọ́) that validates an entered word is in the dictionary and either rejects it or looks up a nearest neighbour from a pretrained text-embedding)

  • Prepare Iroyin as a validation dataset

  • Once training dataprep is complete, hand over to David to retrain on his GPU.

  • Twitter Yorùbá scraper for conversational (create new Ìrànlọ́wọ́ issue) when we get here

For Reference, which captures all of the detailed discussions about next steps:
See this: https://yorubaname.slack.com/archives/C16A699LY/p1564362548029800

Module nltk not found

Can u help with this,am using python3 and whenever i run the .sh script on terminal ,this error shoot out module nltk not found and i have install nltk

**Cards**

Cards can be added to your board to track the progress of issues and pull requests. You can also add note cards, like this one!

Reduce model sizes in preparation for Productization

  • For Sagemaker the model size is too big.
  • Use model release preparation code to reduce the size, so that we can not incur additional expenses on AWS.
  • Apply to all trained models. Will be interesting to see how big the Transformers end up being.

Training in new languages

I want training in difference language, What part I have to modify like data, tokenizer, vocabulary set ... ???
Thank you for your response.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.