Coder Social home page Coder Social logo

cahya-wirawan / indonesian-language-models Goto Github PK

View Code? Open in Web Editor NEW
149.0 13.0 28.0 62.45 MB

Indonesian Language Models and its Usage

Home Page: https://cahya-wirawan.github.io/indonesian-language-models

License: MIT License

Python 2.24% Jupyter Notebook 97.76%
machine-learning deep-learning nlp language-model pytorch fastai transformer huggingface-transformers

indonesian-language-models's Introduction

Indonesian Language Models

The language model is a probability distribution over word sequences used to predict the next word based on previous sentences. This ability makes the language model the core component of modern natural language processing. We use it for many different tasks, such as speech recognition, conversational AI, information retrieval, sentiment analysis, or text summarization.

For this reason, many big companies are competing to build large and larger language models, such as Google BERT, Facebook RoBERTa, or OpenAI GPT3, with its massive number of parameters. Most of the time, they built only language models in English and some other European languages. Other countries with low resource languages have big challenges to catch up on this technology race.

Therefore the author tries to build some language models for Indonesian, started with ULMFiT in 2018. The first language model has been only trained with Indonesian Wikipedia, which is very small compared to other datasets used to train the English language model.

Universal Language Model Fine-tuning (ULMFiT)

Jeremy Howard and Sebastian Ruder proposed ULMFiT in early 2018 as a novel method for fine-tuning language models for inductive transfer learning. The language model ULMFiT for Indonesian has been trained as part of the author's project while learning FastAI. It achieved a perplexity of 27.67 on Indonesian Wikipedia.

Transformers

Ashish Vaswani et al. proposed Transfomer in the paper Attention Is All You Need. It is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease.

At the time of writing (March 2021), there are already more than 50 different types of transformer-based language models (according to the model list at huggingface), such as BERT, GPT2, Longformer, or MT5, built by companies and individual contributors. The author built also several Indonesian transformer-based language models using Huggingface Transformers Library and hosted them in the Huggingfaces model hub.

indonesian-language-models's People

Contributors

cahya-wirawan avatar guspan-tanadi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

indonesian-language-models's Issues

File not found in your Nofile.io link

Dear Pak Cahya,

I have tried your script with my dataset. I got a problem because I don't have the wiki_id_lm.h5 file. I think the file is located on your Nofile.io link in readme files but when I clicked it, it shows a message: a file not found. Could you help me to fix it?

Thank you
Sigit Purnomo

ulmfit_test.py error

mas cahya, dalam requirement.txt, tidak mengharuskan untuk setting fastai versi berapa...

namun klu saya lihat, perintah:
seq_rnn = get_language_model(vs, em_sz, nh, nl, 1)
sepertinya dimaksudkan untuk digunakan dengan modul fastai 0.70

saya sudah coba cek, versi 1.0.0, keatas, rata2 fungsinya meminta lebih dari 4 untuk dimasukkan, saya sudah coba juga untuk perbaiki, bermasalah di dict_key-nya.

tapi pastinya kalau gunakan fastai >=1.0.0, (saya memakai versi 1.0.60), akan muncul error:

File "ulmfit_test.py", line 26, in <module>
    seq_rnn = get_language_model(vs, em_sz, nh, nl, 1)
TypeError: get_language_model() takes from 2 to 4 positional arguments but 5 were given

saya coba susun ulang, mencari fungsi yang bisa menjalankan script, sesuai file2 jupyter notebook yang ada, sepertinya memakan waktu. jadi saya terpaksa ke mas cahya saja lgsung.

btw, rencananya saya mau coba untuk buat file wav, dalam rangka membuat dataset suara dengan bantuan espeakng + mbrola indonesia.

untuk membuat variasi suara, saya juga akan gunakan yang bahasa inggris, tetapi dengan modifikasi text, dengan tujuan agar wav yg dihasilkan dari espeak inggris seperti orang indonesia saat membaca text.

misalnya kamu = kaamuu / kaamoo

intinya modifikasi berbasis suku kata dengan tujuan membuat espeak mengucapkan layaknya orang indonesia.

masalah utamanya tetap speech recognition dari DeepSpeech, membutuhkan dataset yang sangat besar, agar bisa baik hasilnya, >500 jam, dari mozilla common voice, hanya 3 jam.
rencananya pakai punya mas cahya saja untuk mengcreate random text sejumlah 5-7 kata, kemudian dibaca oleh espeak dan disimpan sebagai wave.

jadi memang baru kali ini coba setting environtment untuk jalankan script mas cahya, tapi ya begitulah, ada error dalam hal input ke fungsi, saya belum tau apakah ada error lain atau tidak.

mungkin mas cahya bisa share versi fastai yg mas cahya pakai saat buat script...

Transfer Learning

bg cahya, boleh minta file 'wiki_id_lm.h5' dan file 'wiki_id_itos.pkl'-nya? saya sedang coba membuat speech recognition, tapi sulit sekali mendapatkan data set untuk dibuat lm-nya...apalagi untuk train.

tahap pertama speech recognition, membuat lm, karena saya menggunakan deepspeech jadi saya akan membuatnya dengan kenlm / memformat yg sudah ada ke format kenlm, sekalian mohon petunjuk untuk konversi ke format kenlm.

terima kasih ya mas, kemajuan teknologi di bidang bahasa Indonesia sangat lambat, mayoritas karena yg tertarik belajar, cenderung memulai dari awal...pasti perkembangan tekonologi kita akan lebih mantap kalau banyak bahan setengah jadi seperti yang mas buat.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.