Coder Social home page Coder Social logo

cisnlp / glot500 Goto Github PK

View Code? Open in Web Editor NEW
95.0 8.0 3.0 155 KB

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023

Home Page: https://aclanthology.org/2023.acl-long.61

License: Other

Python 67.14% Shell 13.94% TeX 18.93%
acl multilingual multilingual-models multilingual-nlp nlp xlm xlm-r glot500 glot natural-language-processing

glot500's Introduction

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Model Model arXiv

Introduction

This repository contains information about Glot500 model, data, and code.

  • Glot500-m is an extended version of XLM-R-base, covering more than 500 languages compared to XLM-R's 104 languages. Glot500-m is available at huggingface-models.

  • Glot2000-c comprises corpora for over 2000 languages, while Glot500-c is a subset of Glot2000-c for over 500 languages, including languages with more than 30,000 sentences.

Glot500-m

You can use this model directly with a pipeline for masked language modeling:

>>> ! pip install transformers
>>> ! pip install sentencepiece
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cis-lmu/glot500-base')
>>> unmasker("Hello I'm a <mask> model.")

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('cis-lmu/glot500-base')
model = AutoModelForMaskedLM.from_pretrained("cis-lmu/glot500-base")

# prepare input
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input, output_hidden_states=True)

Glot500-m Evaluation

We provide in-depth evaluation of Glot500-m model and baselines in our paper. Each number is an average over head languages, tail languages and all languages. See the paper for detailed results per task and language. Glot500-m outperforms XLM-R-B (base) in all tasks for head (except for POS) and tail languages and XLM-R-L (large) for tail languages. Best result per task/language set is in bold.

tail tail tail head head head all all all
XLM-R-B XLM-R-L Glot500-m XLM-R-B XLM-R-L Glot500-m XLM-R-B XLM-R-L Glot500-m
Pseudoperplexity 304.2 168.6 12.2 12.5 8.4 11.8 247.8 136.4 11.64
Sentence Retrieval Tatoeba (Top 10 Acc.) 32.6 33.6 59.8 66.2 71.1 75.0 56.6 60.4 70.7
Sentence Retrieval Bible (Top 10 Acc.) 7.4 7.1 43.2 54.2 58.3 59.0 19.3 20.1 47.3
Text Classification (F1) 13.7 13.9 46.6 51.3 60.5 54.7 23.3 25.8 48.7
NER (F1) 47.5 51.8 60.7 61.8 66.0 63.9 55.3 59.5 62.4
POS (F1) 41.7 43.5 62.3 76.4 78.4 76.0 65.8 67.7 71.8
Roundtrip Alignment (Acc.) 2.57 3.13 4.45 3.42 4.06 5.46 2.77 3.34 4.68

Glot500-c

This is an overview of the corpora included Glot500-c presented in our paper. Glot500-c will be sent via email upon filling the data request form. The part that we can redistribute is available at huggingface-dataset. For more information, check out the table below.

Disclaimer Please note that, while the data sources utilized in this study do not explicitly prohibit the reuse of data for research purposes, some sources do have copyright statements indicating that such use is permissible, while others do not. Additionally, certain sources prohibit the redistribution of data. As such, data from these sources is omitted from the published version of Glot500-c.
As regards the ND (NoDerivs) constraint for some datasets, we only change the format of the container while preserving the original contents. The first column of the table indicates the availability of each corpus in the downloadable Glot500-c (yes/no/partially).

We request all the users of Glot500-c to cite the original creators of the datsets and comply to each datasets' license. A BibTex file is available.

If you are a dataset owner and wish to update any part of this overview, or do not want your dataset to be included in Glot500-c, please send us an email at [email protected] .

Glot500-c overview table:

Available Dataset Related Papers Languages Domain / Notes Data collection / Verification method License
Click to Expand Table
Available Dataset Related Papers Languages Domain / Notes Data collection / Verification method License
Partially 1000Langs - 1500 languages Religious Web-crawled Apache License 2.0
Yes Add Link arz, afb, ajp, apc Dialects, arabic commentaries Annotated Freely available for research purposes
Yes AfriBERTa Link amh, hau, ibo, orm, pcm, som, swa, tir, yor mostly BBC, some Common Crawl Apache License 2.0
Yes AfroMAFT Link ; Link
afr, amh ,ara, eng, fra, hau, ibo, mlg, nya, orm, pcm, kin, sna, som, sot, swa, xho, yor, zul expand
Language Adaptation Corpus https://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/
Yes AI4Bharat Link
pan, hin, ben, ori, asm, guj, mar, kan, tel, mal, tam expand
News, magazine, blog posts Automatically curated CC BY-NC-SA 4.0
Yes AIFORTHAI-LotusCorpus - tha Large vOcabualry Thai continUous Speech recognition (LOTUS) corpus CC BY-NC-SA 3.0 TH , 2005 Copyright by National Electronics and Computer Technology Center (NECTEC) For more information, visit http://www.nectec.or.th/rdi/lotus
Yes Akuapem - aka Parallel sentences Verified by native speakers CC-BY 4.0
Yes Anuvaad -
hin, ben, tam, mal, tel, kan, mar, pan, guj, asm, urd, ori expand
Various domains (General, Legal, Education, Healthcare, Automobile, News) CC-BY 4.0
Yes AraBench Link arz, apc, afb, ary Translations of 'travelling phrases', blogs, tv transcripts, Bible Available Dialectal Arabic-English resources and with curated evaluation sets Apache License 2.0
Yes AUTSHUMATO - tsn, tso South African government domain Creative Commons Attribution 2.5 South Africa License
Yes Bianet Link kur, eng, tur Parallel news corpus Automatically curated CC-BY-SA 4.0 open license
Yes BLOOM Link
aaa, abc, ada, adq, aeu, agq, ags, ahk, aia, ajz, aka, ame, amp, amu, ann, aph, awa, awb, azn, azo, bag, bam, baw, bax, bbk, bcc, bce, bec, bef, bfd, bfm, bfn, bgf, bho, bhs, bis, bjn, bjr, bkc, bkh, bkm, bkx, bob, bod, boz, bqm, bra, brb, bri, brv, bss, bud, buo, bwt, bwx, bxa, bya, bze, bzi, cak, cbr, cgc, chd, chp, cim, clo, cmo, csw, cuh, cuv, dag, ddg, ded, dig, dje, dmg, dnw, dtp, dtr, dty, dug, eee, ekm, enb, enc, ewo, fli, fon, fub, fuh, gal, gbj, gou, gsw, guc, guz, gwc, hao, hbb, hig, hil, hla, hna, hre, hro, idt, ilo, ino, isu, jgo, jmx, jra, kak, kam, kau, kbq, kbx, kby, kek, ken, khb, kik, kin, kjb, kmg, kmr, kms, kmu, kqr, krr, ksw, kvt, kwd, kwu, kwx, kxp, kyq, laj, lan, lbr, lfa, lgg, lgr, lhm, lhu, lkb, llg, lmp, lns, loh, lsi, lts, lug, luy, lwl, mai, mam, mdr, mfh, mfj, mgg, mgm, mgo, mgq, mhx, miy, mkz, mle, mlk, mlw, mmu, mne, mnf, mnw, mot, mqj, mrn, mry, msb, muv, mve, mxu, myk, myx, mzm, nas, nco, new, nge, ngn, nhx, njy, nla, nlv, nod, nsk, nsn, nso, nst, nuj, nwe, nwi, nxa, nxl, nyo, nyu, nza, odk, oji, oki, omw, ozm, pae, pag, pbt, pce, pcg, pdu, pea, pex, pis, pkb, pmf, pnz, psp, pwg, qaa, qub, quc, quf, quz, qve, qvh, qvm, qvo, qxh, rel, rnl, roo, rue, rug, saq, sat, sdk, sea, sgd, shn, sml, snk, snl, sox, sps, ssn, stk, sxb, syw, taj, tbj, tdb, tdg, tdt, teo, tet, the, thk, thl, thy, tio, tkd, tnl, tnn, tnp, tnt, tod, tom, tpi, tpl, tpu, tsb, tsn, tso, tuv, tuz, tvs, udg, unr, ven, vif, war, wbm, wbr, wms, wni, wnk, wtk, xkg, xmd, xmg, xmm, xog, xty, yas, yav, ybb, ybh, ybi, ydd, yea, yet, yin, ymp, zaw, zlm, zuh expand
Web Crawl from Internet and filtering CC BY 4.0
Yes CMU_Haitian_Creole - hat, eng Medical domain phrases and sentences in English translated into Haitian Creole by Eriksen Translations, Inc. Curated http://www.speech.cs.cmu.edu/haitian/text/COPYING
Yes CC100 Link ; Link
asm, ful, grn, lim, lin, lug, nso, orm, que, roh, srd, ssw, tsn, wol expand
Web Crawl from Internet Statistical Machine Translation at the University of Edinburgh makes no claims of intellectual property on the work of preparation of the corpus. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
Yes CCNet Link Multiple languages Multiple domains Datasets from Common Crawl MIT License
Yes Clarin (subset) - Multiple languages Multiple domains Multiple CC-BY 4.0
Yes CORP.NCHLT -
nde, nso, sot, ssw, tsn, tso, ven, xho, zul expand
Various Various Creative Commons Attribution 2.5 South Africa License
Yes DART Link arz, afb, acm, apc, ary Tweets Annotators involved also for quality control Publicly available
Yes Earthlings Link
acu, afr, amh, amu, asm, aze, bel, ben, bod, bus, cak, cbc, cbs, cbv, ceb, chv, coe, crn, csb, cym, des, div, dop, epo, eus, fao, gle, glg, guj, gum, gym, hat, hbs, hye, ido, ilo, ipi, isl, jav, kab, kal, kan, kaz, khm, kir, knv, kpr, kur, kyc, kyq, lao, lez, lus, maa, mal, mar, maz, mkd, mlg, mlp, mon, mop, mpx, mri, mya, myy, nep, opm, ori, pan, pck, pir, poh, ptu, pus, que, sab, sah, scn, sin, sja, sme, snd, som, srd, srm, sua, swa, tat, tbc, tbz, tca, tel, tgk, tgl, tpi, tuk, ubu, udm, uig, urd, uzb, wal, wln, wol, yid, yor expand
Subset of CommonCrawl Crawl from Internet and filtering GNU-GPL v.3 License
Yes Flores200 Link
ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, als_Latn, amh_Ethi, apc_Arab, arb_Arab, arb_Latn, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn, bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gaz_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, kaz_Cyrl, kbp_Latn, kea_Latn, khk_Cyrl, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kmr_Latn, knc_Arab, knc_Latn, kon_Latn, kor_Hang, lao_Laoo, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, lvs_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Arab, min_Latn, mkd_Cyrl, mlt_Latn, mni_Beng, mos_Latn, mri_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pbt_Arab, pes_Arab, plt_Latn, pol_Latn, por_Latn, prs_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Olck, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, taq_Latn, taq_Tfng, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zsm_Latn, zul_Latn expand
Misc Human annotated CC-BY-SA 4.0
FrenchEwe - ewe, fra Parallel sentences Annotated CC-BY 4.0
Yes FFR Link fon, fra Parallel sentences Clean curated corpora MIT License and Licence Creative Commons Attribution - No Commercial Use - Sharing under the Same Conditions 4.0 International.
Yes GiossaMedia Link ; Link spa, grn Parallel sentences, news and social media Automatically curated also used by NLLB, freely available
Yes Glosses Link 256 languages Disambiguated glosses Wikipedia, Wiktionary, WordNet, OmegaWiki and Wikidata. CC BY-NC-SA 3.0
Yes Habibi Link arz, afb, acm, ary, apd, apc Song lyrics Collected from the Web Freely available for research purposes
Yes Hindialect Link
anp, awa, ben, bgc, bhb, bhd, bho, bjj, bns, bra, gbm, guj, hin, hne, kfq, kfy, mag, mar, mis, mup, noe, pan, raj, san expand
script all in Devanagari folksongs CC BY-NC-SA 4.0
Yes HornMT - aar, amh, eng, orm, som, tir multi-way parallel corpus CC-BY 4.0
Yes IITB Link eng, hin Collected from different sources and corpora Automatically collected CC-BY-NC 4.0
Yes Indiccorp Link asm, ben, guj, kan, mal, mar, ory, pan, tel Web Web crawled CC BY-NC-SA 4.0
Yes isiZulu - zul, eng English sentences, sampled from News Crawl datasets that were translated into isiZulu Annotated CC BY 4.0
Yes JESC Link eng, jpn Movie and tv subtitles Web-crawled CC-BY-NC 4.0
Yes JParaCrawl Link eng, jpn Various domains Web crawled, automatically aligned Custom License
No JW - Religious Web crawled Private
Yes KinyaSMT Link kin,eng Bible+other Automatically translated GNU General Public License v3.0
Yes LeipzigData Link
aar, ace, ach, aka, als, als-al, als-sqi, anw, arg, arz, asm, ast, aym, aze, azj, azj-az, bak, bam, ban, ban-id, bar, bcl, bem, bew, bih, bik, bjn, bjn-id, bod, bos, bpy, bua, bug, cdo, ceb, che, chv, ckb, cos, csb, diq, div, div-mv, dsb, dyu, ekk, emk, eml, ewe, ext, fao, fao-fo, fon, frr, fuc, ful, gan, glk, glv, gom, grn, gsw, gsw-ch, guj, hat, hat-ht, hbs, hbs-rs, hif, hil, hsb, ibb, ibo, ido, ile, ilo, ina, kab, kal, kal-gl, kas, kbd, kde, kea, khk, kik, kin, kng, knn, knn-in, koi, kom, kon, krc, ksh, ksw, lad, lgg, lim, lim-nl, lin, lmo, ltz, ltz-lu, lug, lup, lus, lus-in, lvs, mad, mad-id, mai, mhr, min, min-id, mkw, mlt, mos, mri, mri-nz, mrj, mwl, myv, mzn, nan, nap-tara, nav, nbl, ndo, nds, nds-nl, new, ngl, nno, nno-no, nob, nob-com, nob-no, nso, nso-za, nya, nyn, oci, oci-fr, orm, oss, pag, pam, pap, pcm, pfl, plt, pms, pnb, pnt, pus, roh, roh-ch, rom, rue, rue-ua, run, sah, san, scn, sco, seh, sgs, sin, skr, sme, sme-no, smi, sna, sna-zw, snd, snk, som, sot, sot-za, srd, ssw, ssw-za, suk, sun, sun-id, sus, swa, swh, szl, tat, tel, tem, tgk, tgk-tj, tgk-uz, tgl, tir, tiv, tsn, tsn-bw, tsn-za, tso, tso-za, tuk, tuk-tm, tum, tyv, udm, uig, uzb, uzn-uz, vec, vec-br, vec-hr, ven, ven-za, vls, vol, vro, war, wln, wol, wuu, xmf, ydd, yid, yor, zea, zha, zsm, zul, zul-za expand
Wikipedia, News, WebCrawl corpora of different years Crawl from Internet CC BY-NC-SA 3.0
Yes Lindat - Multiple languages Multiple Multiple CC-BY-NC 4.0
Yes Lingala_Song_Lyrics - fra, lin Scrape the content of the website www.ndombolo.co, the site have almost 30 songs in lingala and their french traduction Web scraped also used by NLLB, freely available
Lyrics -
aar, abq, adq, ady, agx, aih, ain, aka, akk, ale, ami, ang, arg, arn, arp, asm, ast, aym, bak, bam, bci, bft, bfy, bgc, bhb, bho, bik, bis, bns, bod, bsk, bvd, bya, cab, cbk, cha, che, chg, cho, chr, chv, ckm, cnr, com, cor, cre, crh, csb, ctg, dak, dng, doi, dua, dum, dyu, dzo, enm, evn, ewe, ewo, ext, fao, fij, fon, frm, fro, fur, gag, gbm, gil, gla, glg, glk, gmh, goh, gon, got, gqn, grc, grt, hif, hil, hlb, hne, hop, hsb, ido, ina, inh, ist, izh, jam, jbo, kab, kas, kbd, kca, kdr, kea, kfy, kha, kik, kin, kio, kir, kjh, kmb, kok, kom, kon, krc, krl, kru, ksh, kum, lad, lbj, ldd, lij, lin, lki, lkt, lmo, ltg, lzh, lzz, mag, mah, mai, mbx, mby, min, mjw, mnc, mni, mnk, mns, moh, mos, mrg, mus, mwl, mxi, nan, nap, nav, nds, new, nio, niu, nog, non, nys, oci, odt, ohu, orm, ory, ota, pag, pap, pau, pcd, pcm, pdt, pjt, pli, pnt, pot, que, qya, raj, rar, rhg, roh, rom, rop, rtm, rup, sag, sah, sat, scn, sco, sdc, sel, sgh, sgs, sjn, skr, slr, smn, srn, ssw, sux, syl, szl, tah, tat, tbh, tcy, tet, tir, tlh, tpi, tsn, tuk, twe, twi, tyv, tzo, udm, uig, uki, ulk, unr, vec, ven, vep, vot, wbl, wol, wym, xal, xmf, xno, xxb, yux, zap, zha, zpu, zun, zza expand
Song lyrics Web-crawled
Yes MaCoCu Link mlt Crawl from Internet and filtering CC0 - No Rights Reserved
Yes Makerere MT Corpus - lug, eng Parallel sentences Annotated CC BY 4.0
Yes Masakhane MT Corpus - African languages Multiple domains Multiple MIT License
Yes Mburisano_Covid - afr, eng, nde, sot, ssw, tsn, tso, ven, xho, zul Corpus with limited domain Manually translated CC BY 3.0
Yes MC4 Link
aze, ceb, cos, fil, guj, hat, haw, hmn, ibo, ltz, mlt, mri, nya, smo, sna, sot, sun, tgk, yor, zul expand
Web Crawl from Internet ODC-By
Yes Menyo20K Link yor, eng Parallel, multidomain
News articles (JW), ted talks, movie transcripts, radio transcripts, science and technology texts, and other short articles curatedfrom the web and professional translators Various sources:
Non-commercial use
Yes Minangkabau corpora Link min_Latn, ind Parallel sentences Annotated MIT License
Yes MoT Link kin, lin, nde, orm, bod, tir Data collected from Voice of America (VOA) news websites MIT License
Partially MTData Link Multiple languages Various sources Multiple licenses (check spreadsheet)
Yes Nart/abkhaz - abk multiple sources Creative Commons Universal Public Domain License
Yes Ndc without informant codes dan, fao, isl, ovd, swe Nordic Dialect Corpus comprises recorded speech data from the Nordic countries, in languages that belong to the North Germanic language family. Various CC BY-NC-SA 4.0
Yes NLLB_seed Link
ace_Arab, ace_Latn, ary, arz, bam, ban, bho, bja_Arab, bjn_Latn, bug, crh, dik, dzo, fur, fuv, grn, hne, kas_Latn, kas_Deva, knc_Arab, knc_Latn, lij, lim, lmo, ltg, mag, mni, mri, nus, prs, pbt, scn, shn, srd, szl, taq_Tfng, taq_Latn, tzm, vec expand
Collection of topics in different fields of knowledge and human activity Professionally-translated sentences in the Wikipedia domain CC-BY-SA 4.0
OfisPublik Link ; Link bre Texts from the Ofis Publik ar Brezhoneg (Breton Language Board) provided by Francis Tyers
Partially OPUS Link Collection of translated texts from the web Automatically collected Multiple licenses (check spreadsheet)
Yes OSCAR Link
als, arg, arz, asm, ast, ava, aze, bak, bho, bod, bos, bpy, bxr, ceb, che, chv, ckb, cor, diq, div, dsb, eml, gom, grn, guj, hbs, hsb, ido, ilo, ina, jbo, kom, krc, lez, lim, lmo, ltz, mai, mhr, min, mlt, mrj, mzn, nah, nds, new, nno, oci, oss, pms, pnb, que, sah, scn, sun, tat, tgk, tuk, vol, war, wln, wuu, xal, xmf, yor expand
Web crawled Crawl from Internet and filtering CC BY 4.0
Yes ParaCrawl (subset) Link eng, ukr Various domains Web-crawled CC0
Upon direct request Parallel Bible Corpus Link Religious Automatically collected You can contact Michael Cysouw, Philipps University of Marburg, to request access to the PBC for academic purposes.
Yes Parallel Corpora for Ethiopian Languages Link amh, orm, tir Parallel sentences, religious domain Automatically curated CC-BY 4.0
Yes Phontron - eng, jpn Wikipedia Annotated CC-BY-SA 3.0
Yes QADI Link
afb, abv, arq, arz, acm, apc, ary, acx, ajp, apd, aeb expand
Tweets Tweets Apache License 2.0
Yes Quechua-IIC Link que multiple sources Apache License 2.0
Yes Shami Link apc, ajp Several topics from regular conversations such as politics, education, society, health care, house keeping and others Automatic and manual approaches Apache License 2.0
Yes SLI_GalWeb.1.0 Link glg Galician political party, newspaper, government official website Crawling data from many Web data sources CC BY 4.0
Yes Stanford NLP: nmt Link eng, deu, cze
Partially StatMT - Multiple languages Various sources Various sources Multiple licenses (check spreadsheet)
Yes Tatoeba -
abk, acm, ady, afb, afh, afr, aii, ain, ajp, akl, aln, alt, amh, ang, aoz, apc, ara, arg, arq, ary, arz, asm, ast, avk, awa, ayl, aym, aze, bak, bal, bam, ban, bar, bcl, bel, ben, ber, bfz, bho, bis, bjn, bod, bom, bos, bre, brx, bua, bul, bvy, bzt, cat, cay, cbk, ceb, ces, cha, che, chg, chn, cho, chr, chv, cjy, ckb, ckt, cmn, cmo, cor, cos, cpi, crh, crk, crs, csb, cycl, cym, cyo, dan, deu, diq, div, dng, drt, dsb, dtp, dws, egl, ell, emx, eng, enm, epo, est, eus, evn, ewe, ext, fao, fij, fin, fkv, fra, frm, fro, frr, fry, fuc, fur, fuv, gaa, gag, gan, gbm, gcf, gil, gla, gle, glg, glv, gom, gos, got, grc, grn, gsw, guc, guj, hak, hat, hau, haw, hax, hbo, hdn, heb, hif, hil, hin, hnj, hoc, hrv, hrx, hsb, hsn, hun, hye, iba, ibo, ido, igs, iii, ike, ile, ilo, ina, ind, isl, ita, izh, jam, jav, jbo, jdt, jpa, jpn, kaa, kab, kal, kam, kan, kas, kat, kaz, kek, kha, khm, kin, kir, kiu, kjh, klj, kmr, knc, koi, kor, kpv, krc, krl, ksh, kum, kxi, kzj, laa, lad, lao, lat, ldn, lfn, lij, lim, lin, lit, liv, lkt, lld, lmo, lou, ltg, ltz, lug, lut, lvs, lzh, lzz, mad, mah, mai, mal, mar, max, mdf, mfa, mfe, mgm, mhr, mic, mik, min, mkd, mlg, mlt, mnc, mni, mnr, mnw, moh, mon, mri, mrj, mus, mvv, mwl, mww, mya, myv, nah, nan, nau, nav, nch, nds, new, ngt, ngu, niu, nld, nlv, nnb, nno, nob, nog, non, nov, npi, nst, nus, nya, nys, oar, oci, ofs, oji, ood, ori, orv, osp, oss, osx, ota, otk, pag, pal, pam, pan, pap, pau, pcd, pdc, pes, pfl, phn, pli, pms, pnb, pol, por, ppl, prg, pus, quc, que, qxq, qya, rap, rel, rhg, rif, roh, rom, ron, rue, run, rus, ryu, sag, sah, san, sat, scn, sco, sdh, sgs, shi, shs, shy, sin, sjn, skr, slk, slv, sma, sme, smo, sna, snd, som, sot, spa, sqi, srd, srn, srp, ssw, stq, sun, sux, swc, swe, swg, swh, syc, szl, tah, tam, tat, tel, tet, tgk, tgl, tha, thv, tig, tir, tkl, tlh, tly, tmr, tmw, toi, tok, ton, tpi, tpw, tsn, tso, tts, tuk, tur, tvl, tyv, tzl, udm, uig, ukr, umb, urd, urh, uzb, vec, vep, vie, vol, vro, war, wln, wol, wuu, xal, xho, xmf, xqa, yid, yor, yua, yue, zea, zgh, zlm, zsm, zul, zza expand
180922 version Voluntary contributions of thousands of members CC-BY 2.0 FR, CC0 1.0 Universal (more info)
Yes TeDDi Link
abk, aey, amp, ape, apu, arn, arz, ayz, bmi, bsk, bsn, cha, ckt, crk, dgz, dni, fij, gni, gry, gug, gyd, hae, hau, hix, hnj, imn, jac, kal, kan, kew, kgo, khk, kio, kjq, kut, laj, lue, lvk, mig, mph, mya, myh, myp, mzh, naq, ote, pav, plt, pwn, qvi, ram, rap, rma, sag, spp, swh, tiw, tml, tzm, vma, wba, wic, wyb, xsu, yad, yaq, yor, zoc, zul expand
Collection of different sources (see paper) Language identification and filtering CC BY-NC-SA 4.0
Yes TICO Link
amh, ara, ben, ckb, din, eng, fas, fra, fuv, hau, hin, ind, khm, knc, kmr, lug, lin, mar, msa, mya, npi, nus, orm, prs, por, pus, rus, kinn, som, spa, swh, tam, tir_et, tir_er, tgl, urd, zho, zul expand
COVID-19 materials for a variety of the world’s languages Annotated CC0 1.0 Universal
Yes TIL Link
aze, bak, chv, eng, kaz, kir, rus, tuk, tur, tat, uig, uzb expand
Large-scale parallel corpus combinin gmost of the public datasets for 22 Turkic languages Automatically collected CC BY-NC-SA 4.0
Yes Tilde Link Various domains Automatically curated CC-BY 4.0
Yes W2C - 122 languages Corpus Automatically collected from wikipedia and the web CC BY-SA 3.0
Yes WAT 2020 https://arxiv.org/abs/2008.04550 Asian languages Multiple domains Collection of corpora CC-BY-NC 4.0
Yes Wikipedia -
aar, abk, ace, ady, aka, als, ang, arc, arg, arz, asm, ast, atj, ava, aym, aze, bak, bam, bar, bcl, ben, bih, bis, bjn, bod, bos, bpy, bre, bug, bul, bxr, cbk, cdo, ceb, cha, che, cho, chr, chu, chv, chy, ckb, cor, cos, cre, crh, csb, din, diq, div, dsb, dty, dzo, eml, ewe, ext, fao, fij, frp, frr, ful, fur, gag, gan, glg, glk, glv, gom, gor, got, grn, guj, hak, hat, haw, hbs, hif, hmo, hsb, ibo, ido, iii, iku, ile, ilo, ina, inh, ipk, isl, jam, jbo, jpn, kaa, kab, kal, kas, kbd, kbp, kik, kin, koi, kom, kon, krc, ksh, kua, lad, lbe, lez, lfn, lij, lim, lin, lmo, lrc, ltg, ltz, lug, lzh, mah, mai, mdf, mhr, min, mlt, mri, mrj, mus, mwl, myv, mzn, nah, nan, nap, nau, nav, ndo, nds, new, nno, nov, nrm, nso, nya, oci, olo, orm, oss, pag, pam, pan, pap, pcd, pdc, pfl, pih, pli, pms, pnb, pnt, que, rmy, roh, rue, run, rup, rus, sag, sah, sat, scn, sco, sgs, sme, smo, sna, sot, srd, srn, ssw, stq, sun, szl, tah, tat, tcy, tet, tgk, tir, ton, tpi, tsn, tso, tuk, tum, twi, tyv, udm, vec, ven, vep, vls, vol, vro, war, wln, wol, wuu, xal, xmf, yor, yue, zea, zha, zul expand
20221001 Wikipedia CC BY-NC-SA 3.0
Yes WikiMatrix Link 85 languages Wikipedia Automatically curated CC-BY-SA
Yes Workshop on NER for South and South East Asian Languages Link ben, ori, urd Annotated Data can be freely used for non-profit research work under the Creative Commons License.
XhosaNavy Link xho, eng South African Navy parallel corpus
Yes XLSum Link aze, guj, ibo, orm, run, tir, yor BBC CC BY-NC-SA 4.0

↑ top

Training and Evalutaion Code

Prerequisites

We use two settings due to package conflict:

  • Major: Python 3.9, requirements.txt
  • Evaluation: Python 3.6, evaluation/requirements.txt

Data preparation

For training both tokenizer and model of Glot500-m, we need to prepare a balanced corpus covering all languages.

Go to 'preprocessing/' and run:

bash merge_files.sh

Specify --data_directory with the directory to data for each language and --save_directory with the directory for putting the merged file. For Glot500, we set --scale 1 for training tokenizer, --scale 30 for continued pretraining the model.

Vocabulary Extension

Go to 'tokenization/' and run:

bash train.sh

Specify --input_fname with the merged data file for training the tokenizer and --save_directory with the directory for saving the final tokenizer.

Continued Pretraining

Go to 'modeling/' and run:

bash train_bash.sh

Specify train_file with the merged data file for continued pretraining the model, --tokenizer_name with the trained Huggingface-style tokenizer, --output_dir with the directory for saving logs and checkpoints during training, and --cache_dir with the directory for saving Huggingface cache.

↑ top

Evaluation

Download Datasets

For downloading datasets for NER, POS, and Sentence Retrieval Tatoeba, first go to 'evaluation/download_data' and create a download folder with mkdir -p download. You then need to manually download panx_dataset (for NER) from here (note that it will download as AmazonPhotos.zip) to the download directory. Finally, run the following command under 'evaluation/download_data' to download and process the datasets:

bash download_data.sh

For downloading datasets for Sentence Retrieval Bible, Round-Trip Alignment, you can contact Michael Cysouw, Philipps University of Marburg, to request access to the Parallel Bible Corpus for academic purposes.

Sequence Labeling

For NER evaluation, go to 'evaluation/tagging' and run:

bash evaluate_ner.sh

Specify DATA_DIR with the directory for NER dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

For POS evaluation, go to 'evaluation/tagging' and run:

bash evaluate_pos.sh

Specify DATA_DIR with the directory for POS dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

Sentence Retrieval

For Sentence Retrieval Tatoeba evaluation, go to 'evaluation/retrieval' and run:

bash evaluate_retrieval_tatoeba.sh

Specify DATA_DIR with the directory for Sentence Retrieval Tatoeba dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

For Sentence Retrieval Bible evaluation, go to 'evaluation/retrieval' and run:

bash evaluate_retrieval_bible.sh

Specify DATA_DIR with the directory for Sentence Retrieval Bible dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

Round-Trip Alignment

For Round-Trip Alignment evaluation, go to 'evaluation/round-trip' and run:

python evaluate_roundtrip.py

↑ top

Citation

If you find our model, data or the overview of data useful for your research, please cite:

@inproceedings{imanigooghari-etal-2023-glot500,
	title        = {Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
	author       = {ImaniGooghari, Ayyoob  and Lin, Peiqin  and Kargaran, Amir Hossein  and Severini, Silvia  and Jalili Sabet, Masoud  and Kassner, Nora  and Ma, Chunlan  and Schmid, Helmut  and Martins, Andr{\'e}  and Yvon, Fran{\c{c}}ois  and Sch{\"u}tze, Hinrich},
	year         = 2023,
	month        = jul,
	booktitle    = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
	publisher    = {Association for Computational Linguistics},
	address      = {Toronto, Canada},
	pages        = {1082--1117},
	url          = {https://aclanthology.org/2023.acl-long.61}
}

Acknowledgements

This repository is built on top of transformers and xtreme.

glot500's People

Contributors

ayyoobimani avatar kargaranamir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

glot500's Issues

NER

How to reproduce the NER evaluation?

glot500-large

Are there plans to train a large glot500 model?
Thanks for your work so far!

Access to Glot500-c

Hello,

I came across your work through a virtual talk by Prof. Schütze and found it to be a valuable resource. I'm particularly interested in the Glot500-c(Glot500 corpus) data.

At the moment, your README mentions that access to the corpus will be given after filling an online form, and the form will be available soon. Is there any tentative date for the release of the form? The multilingual corpus would greatly assist my research group with our research on language models.

Thank you for maintaining such a helpful repository, your work is greatly appreciated 😊

Dataset not deduplicated

I've re-trained a Mistral (~ Llama) language-specific tokenizer on the training portion of the Yoruba samples and noticed strange tokens. As an example

{"Ìròyìn▁tó▁ṣe▁kókó▁Àbámọ̀▁ni▁yóò▁gbẹ̀yin▁ẹgbẹ́▁Association▁of▁Stingy▁Men▁tí▁kò▁fẹ́▁náwó▁fóbìnrin-▁Akeugbagoldwákàtí▁9▁sẹ́yìn▁Gbọ́,▁Ìṣẹ́jú▁kan▁BBC▁07:00▁UTCwákàtí▁kan▁sẹ́yìn▁Wo▁ohun▁tí▁a▁mọ̀▁nípa▁gbèdéke▁ti▁Sunday▁Igboho▁fún▁àwọn▁Fulani▁ní▁Ibarapa▁àti▁èsì▁tí▁wọ́n▁fún▁unwákàtí▁3▁sẹ́yìn▁Ìwádìí▁kíkún▁lóríi▁kókó▁ìròyìn▁Amad▁Diallo▁darapọ̀▁mọ́▁Manchester▁United8▁Sẹ́rẹ́▁2021▁Èsíò!": 29494}

which is a token which occurs 1832 times in the training split (rg $STRING yoruba_textified.txt | wc -l) in ever so slightly different contexts (i.e., near duplicates).

I hence checked for duplicates in the dataset and found that there are abundant hard duplicates among 4.5M lines which reduce to 1.16M unique lines. I understand that datasets for low-resource languages are noisy, but I presume users expect that hard duplicates do not occur.

To reproduce:

from datasets import load_dataset
from collections import Counter
import numpy as np
import pandas as pd

dataset = load_dataset("cis-lmu/Glot500", "yor_Latn", split="train")
counter = Counter(dataset["text"])
c = sorted(counter.items(), key=lambda counts: counts[1])
_, counts = zip(*c)
counts = np.array(counts)
print(
    pd.DataFrame(counts)
    .round(0)
    .describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])
    .T
)
# original counts is 4.5M
#        count     mean       std  min  10%  25%  50%  75%  90%  95%  99%   max
#    1167327.0  3.84903  1.379304  2.0  2.0  2.0  5.0  5.0  5.0  5.0  5.0  10.0

I have briefly checked train splits for some other languages, which also to varying degree comprise duplicates.

kin_Latn: nearly correct

Original length:  415405
Deduplicated length:  401856
      count      mean       std  min  10%  25%  50%  75%  90%  95%  99%  max
0  401856.0  1.033716  0.180498  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0

uzb_Latn: OK

Original length:  3182175
Deduplicated length:  3182175
       count  mean  std  min  10%  25%  50%  75%  90%  95%  99%  max
0  3182175.0   1.0  0.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0

ibo_Latn: bad

Original length:  5608630
Deduplicated length:  1526812
       count      mean       std  min  10%  25%  50%  75%  90%  95%  99%   max
0  1526812.0  3.673425  0.739383  2.0  2.0  4.0  4.0  4.0  4.0  4.0  4.0  20.0

wol_Latn: nearly correct

Original length:  92358
Deduplicated length:  92357
     count      mean       std  min  10%  25%  50%  75%  90%  95%  99%  max
0  92357.0  1.000011  0.003291  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  2.0

Inconsistent columns in arrow files on Hugging Face datasets

tl;dr: some shards of languages below (potentially more) have the extra column "__index_level_0__". The dataset thus cannot be fully loaded.

Thanks for providing a potentially super cool dataset for multilingual NLP research!

While my request for access to the full Glot500-c is still awaiting processing, I thought I would try to use what's available on Hugging Face and quickly ran into the issue already documented here https://huggingface.co/datasets/cis-lmu/Glot500/discussions/3

While I am still loading the dataset, of the train split of the first 139 languages, the arrow files of

afr_Latn
amh_Ethi
ara_Arab
en_Latn
fra_Latn
hau_Latn
mlg_Latn
nya_Latn
sna_Latn
som_Latn
sot_Latn
swa_Latn
zul_Latn

have inconsistent column names. That is, some shards have "__index_level_0__" as an extra column. The below python file slowly but eventually should fix the problem.

# not mega pretty but gets the job done
from datasets import load_dataset, concatenate_datasets
from pathlib import Path
from datasets.exceptions import DatasetGenerationError

CWD = Path.cwd() # inside Glot500 folder
BACKUP = Path("../Glot500_backup")
if not BACKUP.exists():
    BACKUP.mkdir()
SPLIT = "train"
langs = [p for p in CWD.glob("*") if p.is_dir() and "_" in str(p)]


def fix(lang: str, lang_split_dir: str, paths: list[Path]):
    datasets = []
    original_dir = BACKUP / lang / SPLIT
    if not original_dir.exists():
        original_dir.mkdir(parents=True)

    for path in paths:
        new_path = original_dir.joinpath(path.name)
        path.rename(new_path)

    for path in original_dir.glob("*.arrow"):
        datasets.append(
            load_dataset("arrow", data_files={"train": str(path)}, split="train")
        )
    col = "__index_level_0__"
    datasets_ = []
    counter = 0
    for d in datasets:
        if col in d.features:
            d_ = d.remove_columns(col)
            counter += 1
        else:
            d_ = d
        datasets_.append(d_)
    print(f"Cleaned up {counter} shards for {SPLIT} of {lang}")
    dataset = concatenate_datasets(datasets_)
    dataset.save_to_disk(lang_split_dir)


datasets = {}
for i, lang in enumerate(langs):
    print(f"Processing {i}/{len(langs)}: {lang}")
    lang_train = lang / "train"
    lang_train_arrow = list(map(str, lang_train.glob("*.arrow")))
    try:
        datasets[lang] = load_dataset(
            "arrow", data_files={"train": lang_train_arrow}, split="train"
        )
    except DatasetGenerationError as e:
        print(f"Fixing {lang}")
        fix(lang.stem, str(lang_train), list(lang_train.glob("*.arrow")))
        lang_train_arrow = list(map(str, lang_train.glob("*.arrow")))
        datasets[lang] = load_dataset(
            "arrow", data_files={"train": lang_train_arrow}, split="train"
        )

Proposed fix: I suppose it would be relatively straightforward for you to run a variant of the above script and reupload the fully loadable dataset.

I would highly appreciate also getting full access to the dataset :)

Thanks a lot in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.