Coder Social home page Coder Social logo

kartikaggarwal98 / indian_parallelcorpus Goto Github PK

View Code? Open in Web Editor NEW
28.0 4.0 3.0 9 KB

Curated list of publicly available parallel corpus for Indian Languages

machinetranslation indian-languages nlp corpus parallel-corpus parallel-corpora multilingual-translation low-resource-machine-translation low-resource-languages neural-machine-translation

indian_parallelcorpus's Introduction

Parallel Corpus for Indian Languages

Available parallel data for training machine translation models in indic languages: Hindi, Bengali, Gujarati, Gondi, Kannada, Manipuri, Marathi, Malayalam, Oriya, Punjabi, Sanskrit, Tamil, Telugu.

Assamese-X

  1. Samaantar Corpus
  2. As-En PMIndia Corpus
  3. As-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row asm-eng.

Bengali-X

  1. Samaantar Corpus
  2. Bn-En BEUT Parallel corpus: 2.75million pairs of bengali-english sentences @EMNLP 2020
  3. Bn-En Project Anuvaad
  4. Bn-En Indian Parallel Corpora
  5. CVIT-IIITH PIB Multilingual Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Bn-En Indian-Language Dataset
  8. Bn-En Asian Language Treebank (ALT) Parallel Corpus
  9. Bn-En PMIndia Corpus
  10. Bn-En OPUS: Set source as en and target as bn
  11. Bn-En SUPARA 0.8M: Requires an IEEE DataPort Subscription
  12. Bn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ben-eng.

Gujarati-X

  1. Samaantar Corpus
  2. Gu-En WikiTitles Parallel Corpus : wikititles-v1.gu-en.tsv.gz
  3. Gu-En Project Anuvaad
  4. Gu-En Tsardia
  5. CVIT-IIITH PIB Multilingual Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Gu-En Shahparth123
  8. Gu-En PMIndia Corpus
  9. Gu-En Bible Corpus
  10. Gu-En OPUS: Set source as en and target as gu
  11. Gu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row guj-eng.

Gondi-X

  1. Gondi-Hindi Parallel Corpus

Hindi-X

  1. Samaantar Corpus
  2. Hi-En IITB Parallel Corpus: v3.0 released !!
  3. Hi-En Project Anuvaad
  4. Hi-En Indian Parallel Corpora
  5. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Hi-En Asian Language Treebank (ALT) Parallel Corpus
  8. Hi-En PMIndia Corpus
  9. Hi-En Bible Corpus
  10. Hi-En Wiki Matrix Comparable Corpus
  11. Hi-En OPUS: Set source as en and target as hi. [ Some of the corpus are part of IITB Parallel Corpus.]
  12. Hi-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row hin-eng.
  13. IIITH Code-Mix Hi-En Corpus
  14. Hi-En Flickr 8k: Multimodal Dataset
  15. Hi-San parallel corpus: Hindi-Sanskrit monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.

Kannada-X

  1. Samaantar Corpus
  2. Kn-En Project Anuvaad
  3. Kn-En PMIndia Corpus
  4. Kn-En Bible Corpus
  5. OPUS: Set source as en and target as kn
  6. Kn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row kan-eng.

Manipuri-X

  1. Mn-En PMIndia Corpus

Marathi-X

  1. Samaantar Corpus
  2. Mr-En Project Anuvaad
  3. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  4. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  5. Mr-En PMIndia Corpus
  6. Mr-En Bible Corpus
  7. Mr-En OPUS: Set source as en and target as mr
  8. Mr-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mar-eng.

Malayalam-X

  1. Samaantar Corpus
  2. Ml-en Project Anuvaad
  3. Indian Parallel Corpora
  4. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  5. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. Ml-en Indian-Language Dataset
  7. Ml-en English_Malayalam_ParallelCorpora
  8. Ml-en PMIndia Corpus
  9. Ml-en Bible Corpus
  10. Ml-en OPUS: Set source as en and target as ml
  11. Ml-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mal-eng.

Oriya-X

  1. Samaantar Corpus
  2. Or-En MTEnglish2Odia
  3. Or-En OdiEnCorp 2.0
  4. Or-En OdiEnCorp 1.0
  5. Or-En IndoWordnet Parallel Corpus
  6. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  8. Or-En PMIndia Corpus
  9. Or-En OPUS: Set source as en and target as or
  10. Or-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ori-eng.

Punjabi-X

  1. Samaantar Corpus
  2. Pu-En Project Anuvaad
  3. Pu-En Punjabi-English Corpus
  4. Pu-En PMIndia Corpus
  5. Pu-En OPUS: Set source as en and target as pa
  6. Pu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row pan-eng.

Sanskrit-X

  1. San-Hi parallel corpus: Sanskrit Hindi monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.

Tamil-X

  1. Samaantar Corpus
  2. Ta-En Project Anuvaad
  3. Ta-En Indian Parallel Corpora
  4. Ta-En National Language Process Center
  5. Ta-En EnTam
  6. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
  7. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
  8. Ta-En Indian-Language Dataset
  9. Ta-En Multiple Dataset Links
  10. Ta-En PMIndia Corpus
  11. Ta-En Parallel Corpus
  12. Ta-En PMIndia Corpus
  13. Ta-En OPUS: Set source as en and target as ta
  14. Ta-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tam-eng.

Telugu-X

  1. Samaantar Corpus
  2. Te-En Project Anuvaad
  3. Te-En Indian Parallel Corpora
  4. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
  5. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
  6. Te-En Indian-Language Dataset
  7. Te-En PMIndia Corpus
  8. Te-En Bible Corpus
  9. Te-En OPUS: Set source as en and target as te
  10. Te-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tel-eng.

Other Resources

  1. PMIndia Parallel Corpus Creation: Code for creating a parallel corpus from pmindia.gov.in. [Paper Link]

indian_parallelcorpus's People

Contributors

kartikaggarwal98 avatar madaan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

indian_parallelcorpus's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.