Coder Social home page Coder Social logo

yaboli / arabic-text-diacritization Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aliosm/arabic-text-diacritization

0.0 1.0 0.0 27.37 MB

Benchmark Arabic text diacritization dataset

License: MIT License

Jupyter Notebook 42.05% Python 57.95%

arabic-text-diacritization's Introduction

Arabic Text Diacritization

This repository contains the dataset, helpers, and systems comparison for our paper on Arabic Text Diacritization:

"Arabic Text Diacritization Using Deep Neural Networks", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh, and Mahmoud Al-Ayyoub, ICCAIS 2019.

Files

  • train.txt - Contains 50,000 lines of diacritized Arabic text which can be used as training dataset
  • val.txt - Contains 2,500 lines of diacritized Arabic text which can be used as validation dataset
  • test.txt - Contains 2,500 lines of diacritized Arabic text which can be used as testing dataset
  • constants
    • ARABIC_LETTERS_LIST.pickle - Contains list of Arbaic letters
    • CLASSES_LIST.pickle - Contains list of all possible classes
    • DIACRITICS_LIST.pickle - Contains list of all diacritics
  • count_characters.py - Counts the number of Arabic letters and diacritics in a file
  • count_fathatan.py - Counts the number of fathatan occurrences before and after Alif in all files from a folder
  • diacritization_stat.py - Calculates DER and WER using the gold data and the predicted output
  • diacritics_rate_extractor.py - Keeps lines with p% diacritics to Arabic characters rate or more in all files from a folder
  • file_lookup.py - Searches for a line in all files from a folder
  • fix_fathatan.py - Changes after-Alif fathatan to before-Alit fathatan in a file
  • remove_diacritics.py - Removes diacritics from a file
  • transliteration.py - Converts a file from Arabic text to Buckwalter transliteration and vice-versa
  • pre_process_tashkeela_corpus.ipynb - Pre-process Tashkeela Corpus data
  • ali-soft - Contains some bugs that exist in Ali-Soft system
  • farasa - Contains Farasa system output, fixed output, and DER/WER statistics
  • harakat - Contains Harakat system testing script, output, fixed output, and DER/WER statistics
  • madamira - Contains MADAMIRA system output, fixed output, and DER/WER statistics
  • mishkal - Contains Mishkal system output, fixed output, and DER/WER statistics
  • shakkala - Contains Shakkala system data splitting script, output, fixed output, and DER/WER statistics
  • tashkeela_model - Contains Tashkeela-Model system output, fixed output, and DER/WER statistics for each n-gram model provided by them

Note: All codes in this repository tested on Ubuntu 18.04

Contributors

  1. Ali Hamdi Ali Fadel.
  2. Ibraheem Tuffaha.
  3. Bara' Al-Jawarneh.
  4. Mahmoud Al-Ayyoub.

License

The project is available as open source under the terms of the MIT License.

arabic-text-diacritization's People

Contributors

aliosm avatar baraajaw avatar yaboli avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.