Coder Social home page Coder Social logo

aranizer's Introduction

AraNizer

Description

AraNizer is a sophisticated toolkit of custom tokenizers tailored for Arabic language processing. Integrating advanced methodologies such as SentencePiece and Byte Pair Encoding (BPE), these tokenizers are specifically designed for seamless integration with the transformers and sentence_transformers libraries. The AraNizer suite offers a range of tokenizers, each optimized for distinct NLP tasks and accommodating varying vocabulary sizes to cater to a multitude of linguistic applications.

Key Features

  • Versatile Tokenization: Supports multiple tokenization strategies (BPE, SentencePiece) for varied NLP tasks.
  • Broad Vocabulary Range: Customizable tokenizers with vocabulary sizes ranging from 32k to 86k.
  • Seamless Integration: Compatible with popular libraries like transformers and sentence_transformers.
  • Optimized for Arabic: Specifically engineered for the intricacies of the Arabic language.

Installation

Install AraNizer effortlessly with pip:

pip install aranizer

Usage

Importing Tokenizers

Import your desired tokenizer from AraNizer. Available tokenizers include:

  • BEP variants: aranizer_bpe50k, aranizer_bpe64k,
  • SentencePiece variants: aranizer_bpe86k, aranizer_sp32k, aranizer_sp50k, aranizer_sp64k, aranizer_sp86k.
from aranizer import aranizer_sp32k  # Replace with your chosen tokenizer

#Load your tokenizer:
tokenizer = aranizer_sp32k.get_tokenizer()

Tokenizing Text

Tokenize Arabic text using the selected tokenizer:

text = "مثال على النص العربي"  # Example Arabic text
tokens = tokenizer.tokenize(text)
print(tokens)

Encoding and Decoding

Encode text into token ids and decode back to text.

Encoding: To encode text, use the encode method.

text = "مثال على النص العربي"  # Example Arabic text
encoded_output = tokenizer.encode(text, add_special_tokens=True)
print(encoded_output)

Decoding: To convert token ids back to text, use the decode method:

decoded_text = tokenizer.decode(encoded_output)
print(decoded_text)

Available Tokenizers

- aranizer_bpe32k: Based on BEP Tokenizer with Vocab Size of 32k
- aranizer_bpe50k: Based on BEP Tokenizer with Vocab Size of 50k
- aranizer_bpe64k: Based on BEP Tokenizer with Vocab Size of 64k
- aranizer_bpe86k: Based on BEP Tokenizer with Vocab Size of 86k
- aranizer_sp32k: Based on Sentence Peice Tokenizer with Vocab Size of 32k
- aranizer_sp50k: Based on Sentence Peice Tokenizer with Vocab Size of 50k
- aranizer_sp64k: Based on Sentence Peice Tokenizer with Vocab Size of 64k
- aranizer_sp86k: Based on Sentence Peice Tokenizer with Vocab Size of 86k

System Requirements

  • Python 3.x
  • transformers library

Contact:

For queries or assistance, please contact [email protected].

Acknowledgments:

This work is maintained by the Robotics and Internet-of-Things Lab at Prince Sultan University.

Team:

  • Prof. Anis Koubaa (Lab Leader)
  • Dr. Lahouari Ghouti (NLP Team Leader)
  • Eng. Omar Najjar (AI Research Assistant)
  • Eng. Serry Sebai (NLP Research Engineer)

Version:

0.1.8

Citations:

Coming soon

aranizer's People

Contributors

aniskoubaa avatar omarnj-lab avatar muhassar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.