Coder Social home page Coder Social logo

dsfsi / puodata Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 8.52 MB

Curated corpora for Setswana. Used to train PuoBERTa.

License: Creative Commons Attribution Share Alike 4.0 International

african-languages african-nlp corpora natural-language-processing setswana south-africa tn tsn dsfsi-datasets

puodata's Introduction

PuoData: A curated corpora for Setswana

arXiv

Give Feedback ๐Ÿ“‘: DSFSI Resource Feedback Form

We believe that PuoData is a valuable resource for the Setswana language community. We hope that PuoData will be used to develop new and innovative applications that benefit the Setswana-speaking community.

Dataset Curation

Dataset Name Kind Num. of Tokens
PuoData
NCHLT Setswana \cite{eiselen2014developing} Government Documents 1,010,147
Nalibali Setswana Childrens Books 57,654
Setswana Bible Book(s) 879,630
SA Constitution Official Document 56,194
Leipzig Setswana Corpus BW Curated Dataset 219,149
Leipzig Setswana Corpus ZA Curated Dataset 218,037
SABC Dikgang tsa Setswana FB (Facebook) News Headlines 167,119
SABC MotswedingFM FB Online Content 33,092
Leipzig Setswana Wiki Online Content 230,333
Setswana Wiki Online Content 183,168
Vukuzenzele Monolingual TSN Government News 157,798
gov-za Cabinet speeches TSN Government Speeches 591,920
Department Basic Education TSN Education Material 708,965
PuoData Total 25MB on disk 4,513,206
PuoData+JW300
JW300 Setswana Book(s) 19,782,122
PuoData+JW300 124MB on disk 24,295,328

Dataset Uses

We used this corpus to train PuoBERTa, ๐Ÿค— https://huggingface.co/dsfsi/PuoBERTa. It is also part of the corpus used for PuoBERTaJW300.

Citation Information

Bibtex Reference

@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {SACAIR 2023 (To Appear)},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}

License

The license of PuoData is in CC-BY-SA-4.0. the monolingual data have difference licenses depending on the news website license

Dataset Contact

For more details, reach out or check our website.

Email: [email protected]

Enjoy exploring Setswana through AI!

puodata's People

Contributors

vukosim avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.