Coder Social home page Coder Social logo

lafifi-24 / arabic-dialect-identification Goto Github PK

View Code? Open in Web Editor NEW
12.0 1.0 5.0 15.92 MB

Fine-tune BERT models to classify Arabic text by different dialects.

Jupyter Notebook 99.58% Python 0.42%
arabert arabic-dialects arabic-nlp arabic-dialect-identification arabic-bert bert bert-models

arabic-dialect-identification's Introduction

Arabic-dialect-identification

the project aims to develop an NLP system that can accurately identify the dialect of an Arabic text. We used a combination of pre-trained BERT models, Naive Bayes Multinomial, Random Forest, and fine-tuning techniques along with large datasets to train and test the system. The goal is to improve the accuracy of identifying the dialect of Arabic text.

website link:

https://arabic-dialect-id.streamlit.app/

APIs:

To make our Arabic Dialect Identification models accessible to users, we deployed three models (arabert, arabicbert, and arbert) on the Hugging Face platform and utilized their APIs to integrate the models into our application.

Dataset Construction

The dataset used in this project is a collection of Arabic sentences and their corresponding dialect labels. The dataset was constructed by combining data from multiple sources:

Name Source Paper
arabic_pos_dialect https://huggingface.co/datasets/arabic_pos_dialect ---
IADD: An integrated Arabic dialect identification dataset https://github.com/JihadZa/IADD https://www.sciencedirect.com/science/article/pii/S2352340921010519
QADI: Arabic Dialect Identification in the Wild https://github.com/qcri/QADI https://www.researchgate.net/publication/341396032_Arabic_Dialect_Identification_in_the_Wild
The MADAR Arabic Dialect Corpus and Lexicon https://sites.google.com/nyu.edu/madar/?pli=1 https://aclanthology.org/L18-1535.pdf

In addition to these sources, we also collected 10,000 sentences in Modern Standard Arabic (MSA) from Wikipedia. These MSA examples were added to the dataset to balance the number of examples across the different dialects.

The following plot shows the distribution of data in our Arabic Dialect Identification dataset.The plot shows the number of examples we use for each dialect, along with the corresponding source for each example.

Alt text

Preprocessing

Before training the model, the data is preprocessed by performing the following steps:

  • drop all word or letters, which are not Arabic (like tags,..)
  • remove repetitive letters and word which have one letter
  • apply arabert preprocessing

Modeling

In this project, we experimented with different models for Arabic Dialect Identification, including Random Forest, Naive Bayes, Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM), and BERT models such as AraBERT and ArabicBERT... We trained each model on the preprocessed dataset and evaluated their performance on two test sets:

  • QADI test set: This test set contains manually annotated examples of Arabic sentences, where each sentence is labeled with the corresponding dialect country. The test set contains about 200 examples for each of the 18 dialects.

  • arabic_pos_dialect test set: This test set contains manually annotated examples of Arabic sentences, where each sentence is labeled with the corresponding dialect region (GLF,MGR,EGY,LEV). The test set contains 350 examples for each of the 4 regions, resulting in a total of 1,400 examples.

Model QADI F1 Score arabic_pos_dialect F1 Score
TF-IDF + Multinomial NB 0.7506 0.8671
TF-IDF + RandomForest 0.7435 0.7228
Bidirectional LSTM 0.5251 --
arabert 0.7637 0.8621
arbert 0.7424 0.8792
arabic bert 0.7465 0.86642
marbert 0.7374 --
multilingual Bert 0.6683 --

Confusion Matrix

QADI test set

In the confusion matrix, each small square represents a specific dialect region, allowing for a visual representation of the model's performance in predicting the correct dialect for each region.

arabert Model

Alt text

arabicbert Model

Alt text

TF-IDF + Multinomial NB

Alt text

arabic_pos_dialect test set

arabert Model

Alt text

arabicbert Model

Alt text

TF-IDF + Multinomial NB

Alt text

arabic-dialect-identification's People

Contributors

lafifi-24 avatar oual99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.