Coder Social home page Coder Social logo

arabic-nlp's Introduction

arabic-nlp

This repo explores the value of several common text classification techniques for an Arabic language dataset. It also serves as an introduction to those techniques for a reader who has basic or intermediate Python skills but little familiarity with text classification.

Each notebook also digs into the data to offer some thoughts on what types of text provide a challenge for each text classification approach. But you don't need to understand any Arabic to (hopefully) benefit from the explanations of machine learning and text classification!

We use this dataset of Arabic news labeled for category: https://data.mendeley.com/datasets/322pzsdxwy/1

The authors provide this description of the dataset:

RTAnews dataset is a collections of multi-label Arabic texts, collected form Russia Today in Arabic news portal. It consists of 23,837 texts (news articles) distributed over 40 categories, and divided into 15,001 texts for the training and 8,836 texts for the test. The original dataset (without preprocessing), a preprocessed version of the dataset, versions of the dataset in MEKA and Mulan formats, single-label version, and WEAK version all are available.

00-text-preprocessing.ipynb cleans the raw dataset and produces a train/val/test split that we'll use throughout this repo for consistency.

01-sklearn-binary.ipynb simplifies our problem to binary classification and introduces the text classification utilities in sklearn.

02-sklearn-multiclass.ipynbintroduces the multiclass classification utilities in sklearn, since our dataset actually has 40 categories of news.

03-word2vec.ipynb introduces the concept of word embeddings and uses word2vec to train word embeddings from scratch with gensim.

04-fasttext.ipynb uses pre-trained embeddings for Arabic available through the fasttext package.

05-laser.ipynbintroduces LASER, a single set of multilingual word embeddings.

06-lstms.ipynb introduces LSTMs, a type of neural network that works well for text classification, and builds an LSTM from scratch using keras.

07-multilingual-bert.ipynb introduces transformer models in general and multilingual BERT, a large language model pre-trained on many languages including Arabic.

arabic-nlp's People

Contributors

ajw-42 avatar

Stargazers

Majid Omri avatar Muhammad Sulaiman avatar Esraa sultan avatar ‪Karim Negm‬‏ avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.