Coder Social home page Coder Social logo

banadda / salt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sunbirdai/salt-data-archive

0.0 0.0 0.0 12.33 MB

Multi-way parallel text corpus of 5 key Ugandan languages.

License: Creative Commons Attribution Share Alike 4.0 International

Jupyter Notebook 100.00%

salt's Introduction

SALT (Sunbird African Language Technology) dataset

SALT is a multi-way parallel text corpus of Engish and five key Ugandan languages: Luganda, Lugbara, Acholi, Runyankole and Ateso. The dataset contains 25,000 sentences covering a range of topics of local relevance, such as agriculture, health and society. Each sentence is translated into all six languages.

More details are in the following publication:

Machine Translation For African Languages: Community Creation Of Datasets And Models In Uganda. Benjamin Akera, Jonathan Mukiibi, Lydia Sanyu Naggayi, Claire Babirye, Isaac Owomugisha, Solomon Nsumba, Joyce Nakatumba-Nabende, Engineer Bainomugisha, Ernest Mwebaze, John Quinn. 3rd Workshop on African Natural Language Processing, 2022. [pdf]

This repository contains a notebook for training and evaluation which can be used to replicate the results in the paper. To try translation with the resulting models, see translate.sunbird.ai.

Changelog

[v1.2] 2023-04-21

Added TTS data for Luganda and English.

[v1.1] 2023-04-19

Data format changed, some duplicate sentences removed, and sentences partitioned into splits (train: 23,497; dev: 500, test: 500).

[v1.0] 2021-12-01

Initial commit of 25,000 sentences, with each sentence translated from English to Acholi, Ateso, Luganda, Lugbara and Runyankole.

salt's People

Contributors

cmplx-xyttmt avatar jqug avatar nlsanyu avatar emwebaze avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.