The sangita from sangitanlp

The (word, gender) tuple is currently available here
In accordance with Issue #9 we will move this file to Sangita Data
We will also create a new repository for Hindi Word Vectors and one for machine learning models. These will be referenced in a separate issue.
Along with this we will remove the dependencies for Scikit Learn and work only with Keras.
The task list is given below

Move the gender.py to Sangita Data - Cakewalk.
Create a fresh set of word vectors and store it under a new repository especially for word vectors. - Pro.
Train the word vectors against the gender tags, and store the model under a separate repository. - Intermediate.
Refactor the code here, to accommodate these changes. - Intermediate.

Create a Website for Sangita

Checklist for the Project:

Use a white background
Chose a nice font.
Top Menubar with five Code, Corpora, Documentation, Blog
Code should redirect to GitHub repo sangita
Corpora should redirect to the Organisation Page
Documentation should redirect to Readme
Blog should be a separate page with the Quora blog page posts embedded on it. Instructions are given here.

Project should be pushed to this repo

Discovering Datasets

The repo currently doesn’t have a specific Hind Corpora to work on. We are looking for a corpora which satisfies the following points:-

Type of Datasets: We are looking for something that can be used to train the stemmer, lemmatiser, tokenizer, POS tagger and the named entity recognizer on.
- POS tagger: The corpora should have sentences with the parts of speech or “Shabdo ke prakaar” tagged in them. Read about what POS in Hindi is here and here.
- We will also need a wordnet, something like this: http://www.cfilt.iitb.ac.in/wordnet/webhwn/hindi_examples.php
- Dataset containing stems of words along with different forms in which that word can be represented for the stemmer.
Restrictions on the dataset:
- The dataset should be available to be used freely without any restrictions and to open source that Corpora. It can be under Open Database License (ODbL) v1.0 which allows free use of that data.
- Or better: http://www.wtfpl.net/

Some of the datasets that can be used might be available with the LTRC committee in IIT-B.
This issue is about discovering good Hindi Corpora for this project. Participants and contributors can search for and create PRs adding the datasets and links to the datasets.

Guidelines before sending Pull Requests:

This will be a issue with variable difficulty. You score points depending on the difficulty of the data extraction.
Firstly you need to comment the link of the dataset on this issue along with details about the data and it's licensing.
Once a mentor approves it you need to add the dataset to Sangita Data. A mentor will assist you in this task.
You can use alternative methods like web scraping to generate data yourself.

Here is a rough outline of the requirements.

WordNet
Word, Lemma Pairs
Word, POS pairs
Others

Create an Enconder and Decoder Block

Add setup.py to the base repository

setup.py is a python file, which usually tells you that the module/package you are about to install has been packaged and distributed with Distutils, which is the standard for distributing Python Modules.This allows you to easily install Python packages. Often it's enough to write:
python setup.py install
and the module will install itself.

Write the setup.py file for this package.

Implement Sangita in Bengali

Task List for getting Started

Implement Tokeniser for Bengali
Find Datasets for the Language to move forward with.

Implement sangita in Telugu

New language - Telugu

Find Data corpus Telugu language. Add links in the comments to this Issue
Implement tokenizer for Telugu
Create a lemmentizer for Telugu
Create a POS Tagger for Telugu.

Extraction BenLem dataset

Extraction of Word, Lemma pairs from the BenLem dataset.
Citation: A. Chakrabarty and U. Garain (2015): BenLem (a Bengali Lemmatizer) and its Role in WSD, in ACM Trans. Asian and Low-Resource Language Information Processing (TALIIP).

Transfer all the Data To Sangita Data

Currently some of the data is housed in Corpora in this Repository
We need to transfer this data along with the rest of the incoming data to Sangita Data
The reason for doing so is that PyPi has a limit on the size of packages one can use.
This would require code refactoring, changing of file types as well as writing an installation script for automatic installation of data files.

Here is a task list for the issue

Change file type to something more usable and lesser in size.
Change the directory structure at Sangita Data.
Transfer the files and refactor code to reflect the change in location
Write installation Script.

Create Documentation for POS Tagger

The documentation for POS Tagger isn't present.
Have a look at postagger.py insert the documentation for the same.

Improve Upon the Idea Implemented in The Stemmer

Have look at the code in stemmer.py
Then do try to improve upon or provide an alternative pathway for that.
Requirements:

Proficiency in Python.
Proficiency in Hindi and basic word inflections.
Knowledge of the different types of Stemmer.

This issue involves three steps:

Looking for appropriate data to help build rules for creating the stemmer.
Formulating linguistic rules for the stemmer.
Implementing the linguistic rules in code.

The pull request must contain detailed layout of how the developer sought to tackle the problem.
It must also contain the source of data and it's licensing specifications. Along with this there must be sufficient examples to demonstrate the working of the stemmer.

sangitanlp / sangita Goto Github PK

sangita's Introduction

Sangita.

A Natural Language Toolkit for Indian Languages

What is Sangita?

Dependencies

* Keras

* Scikit Learn

* Corpus is Stored at Sangita Data

License

The code and the models are distributed under the Apache 2.0 License.

We have used the following datasets and their respective Liesnses are enclosed along with them.

Hindi Dependency Treebank - LANGUAGE TECHNOLOGIES RESEARCH CENTER, IIIT Hyderabad.

Creative Commons License Attribution-NonCommercial-ShareAlike 4.0 International.

Contributions

Issues relating to Girlscript Summer of Code are referenced with the respective tags:

* Cakewalk - 10 points

* Intermediate - 20 points

* Pro - 30 points

* TopCoder - 50 points

You can look at the ongoing issues here on this project board

Check out the first evaluation milestone

sangita's People

Contributors

Stargazers

Watchers

Forkers

sangita's Issues

Recommend Projects

Recommend Topics

Recommend Org