Coder Social home page Coder Social logo

sangitanlp / sangita Goto Github PK

View Code? Open in Web Editor NEW
41.0 13.0 41.0 37.33 MB

A Natural Language Toolkit for Indian Languages

License: Apache License 2.0

Python 100.00%
natural-language-processing deep-learning deep-neural-networks lstm recurrent-neural-networks machine-learning python

sangita's Introduction

Sangita.



forthebadge

Chat at Slack

GitHub open pull requests GitHub closed pull requests GitHub closed issues GitHub open issues

A Natural Language Toolkit for Indian Languages

![](https://img.shields.io/badge/language-Hindi-red.svg?style=for-the-badge)

What is Sangita?

Sangita is a natural language toolkit for Indian languages built in Python. The aim of the project is to provide basic Natural Language Functionalities that include tokenization, lemmatisation, stemming, named entity recognition and Part of Speech Tagging for popular Indian Languages with Deep Neural Networks being employed for some of these tasks.

Dependencies

* Keras
* Scikit Learn
* Corpus is Stored at Sangita Data

License

The code and the models are distributed under the Apache 2.0 License.

We have used the following datasets and their respective Liesnses are enclosed along with them.

  • Hindi Dependency Treebank - LANGUAGE TECHNOLOGIES RESEARCH CENTER, IIIT Hyderabad.

    • Creative Commons License Attribution-NonCommercial-ShareAlike 4.0 International.

Contributions

Issues relating to Girlscript Summer of Code are referenced with the respective tags:

* Cakewalk - 10 points
* Intermediate - 20 points
* Pro - 30 points
* TopCoder - 50 points

You can look at the ongoing issues here on this project board

sangita's People

Contributors

djokester avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sangita's Issues

Improve the Accuracy of The Gender Taggger

The (word, gender) tuple is currently available here
In accordance with Issue #9 we will move this file to Sangita Data
We will also create a new repository for Hindi Word Vectors and one for machine learning models. These will be referenced in a separate issue.
Along with this we will remove the dependencies for Scikit Learn and work only with Keras.
The task list is given below

  • Move the gender.py to Sangita Data - Cakewalk.

  • Create a fresh set of word vectors and store it under a new repository especially for word vectors. - Pro.

  • Train the word vectors against the gender tags, and store the model under a separate repository. - Intermediate.

  • Refactor the code here, to accommodate these changes. - Intermediate.

Create a Website for Sangita

Checklist for the Project:

  • Use a white background

  • Chose a nice font.

  • Top Menubar with five Code, Corpora, Documentation, Blog

  • Code should redirect to GitHub repo sangita

  • Corpora should redirect to the Organisation Page

  • Documentation should redirect to Readme

  • Blog should be a separate page with the Quora blog page posts embedded on it. Instructions are given here.

Project should be pushed to this repo

Discovering Datasets

The repo currently doesn’t have a specific Hind Corpora to work on. We are looking for a corpora which satisfies the following points:-

  • Type of Datasets: We are looking for something that can be used to train the stemmer, lemmatiser, tokenizer, POS tagger and the named entity recognizer on.
    • POS tagger: The corpora should have sentences with the parts of speech or “Shabdo ke prakaar” tagged in them. Read about what POS in Hindi is here and here.
    • We will also need a wordnet, something like this: http://www.cfilt.iitb.ac.in/wordnet/webhwn/hindi_examples.php
    • Dataset containing stems of words along with different forms in which that word can be represented for the stemmer.
  • Restrictions on the dataset:

Some of the datasets that can be used might be available with the LTRC committee in IIT-B.
This issue is about discovering good Hindi Corpora for this project. Participants and contributors can search for and create PRs adding the datasets and links to the datasets.

Guidelines before sending Pull Requests:

  • This will be a issue with variable difficulty. You score points depending on the difficulty of the data extraction.
  • Firstly you need to comment the link of the dataset on this issue along with details about the data and it's licensing.
  • Once a mentor approves it you need to add the dataset to Sangita Data. A mentor will assist you in this task.
  • You can use alternative methods like web scraping to generate data yourself.

Here is a rough outline of the requirements.

  • WordNet

  • Word, Lemma Pairs

  • Word, POS pairs

  • Others

Add setup.py to the base repository

setup.py is a python file, which usually tells you that the module/package you are about to install has been packaged and distributed with Distutils, which is the standard for distributing Python Modules.This allows you to easily install Python packages. Often it's enough to write:
python setup.py install
and the module will install itself.

Write the setup.py file for this package.

Implement Sangita in Bengali

Task List for getting Started

  • Implement Tokeniser for Bengali

  • Find Datasets for the Language to move forward with.

Implement sangita in Telugu

New language - Telugu

  • Find Data corpus Telugu language. Add links in the comments to this Issue

  • Implement tokenizer for Telugu

  • Create a lemmentizer for Telugu

  • Create a POS Tagger for Telugu.

Extraction BenLem dataset

Extraction of Word, Lemma pairs from the BenLem dataset.
Citation: A. Chakrabarty and U. Garain (2015): BenLem (a Bengali Lemmatizer) and its Role in WSD, in ACM Trans. Asian and Low-Resource Language Information Processing (TALIIP).

Transfer all the Data To Sangita Data

Currently some of the data is housed in Corpora in this Repository
We need to transfer this data along with the rest of the incoming data to Sangita Data
The reason for doing so is that PyPi has a limit on the size of packages one can use.
This would require code refactoring, changing of file types as well as writing an installation script for automatic installation of data files.

Here is a task list for the issue

  • Change file type to something more usable and lesser in size.

  • Change the directory structure at Sangita Data.

  • Transfer the files and refactor code to reflect the change in location

  • Write installation Script.

Improve Upon the Idea Implemented in The Stemmer

Have look at the code in stemmer.py
Then do try to improve upon or provide an alternative pathway for that.
Requirements:

  • Proficiency in Python.
  • Proficiency in Hindi and basic word inflections.
  • Knowledge of the different types of Stemmer.

This issue involves three steps:

  • Looking for appropriate data to help build rules for creating the stemmer.
  • Formulating linguistic rules for the stemmer.
  • Implementing the linguistic rules in code.

The pull request must contain detailed layout of how the developer sought to tackle the problem.
It must also contain the source of data and it's licensing specifications. Along with this there must be sufficient examples to demonstrate the working of the stemmer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.