Coder Social home page Coder Social logo

amankshihab / tener-malayalam Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 78.51 MB

Named entity recognition in Malayalam using BiLSTM and TENER (Transformer Encoder)

Home Page: https://tener-malayalam-k2i4cw0x9c.streamlit.app/

Jupyter Notebook 96.85% Python 3.15%
byte-pair-encoding deep-learning indic-languages malayalam named-entity-recognition natual-language-processing pytorch

tener-malayalam's Introduction

Named Entity Recognition In Malayalam

This repo implements and compare two models, namely a Bidirectional LSTM and TENER (Transformer Encoder for Named Entity Recognition) on the ai4bharat-IndicNER dataset,

Find live demo at https://tener-malayalam-k2i4cw0x9c.streamlit.app/

Pretrained Models

Please find the pretrained weights at: https://drive.google.com/drive/folders/13DQ7zTz8fiSTkwmScpd8ZuO5mVtC4y_C?usp=sharing

File Structure

  • modeling_TENER.ipynb has the code for training and contains some results as well.
  • models/ contains the code definitions for both the models.
  • malayalam_ner.py implements a helper class that makes it easier to predict with either of the models.
  • predict.py contains code for running inference on a single string.

Tokenization & Embedding

  • Byte Pair Encoding has been used here.
  • It was chosen after a comparison between it and fasttext.
  • The tokens were also vectorized using it's vectorizer.
  • It was taken from BPEmb, which has pretrained embedding models for over 275 languages.
  • More details can be found here and for the specific one I used

The Models

  1. Bidirectional LSTM

    • Uses 3 layers, with hidden size of 200.
    • Uses ReLU as the activation funcion.
    • Combines manually initialized weights and LayerNorm layers for numerical stability.
  2. TENER

    • Employs an adaptation of TENER
    • Compared to the paper, the CRF layer at the end has been dropped.
    • Here, I have set d_model = 512 and n_heads = 16
    • A weight vector has been used in the loss function to address for the imbalance of tags in the dataset

Results

  • The highest f1-score obtained with BiLSTM is 0.96 and lowest val_loss of 0.09

  • The highest f1-score obtained with TENER is 0.98 and lowest val_loss of 0.05.

NOTE: This was on the test set provided with the dataset.

Caveats

  • While testing I have observed that the model performs better on sentences from the test that the headlines or the title.

tener-malayalam's People

Contributors

amankshihab avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.