Coder Social home page Coder Social logo

transnewguineaorg's Introduction

CLDF dataset derived from Greenhill's "TransNewGuinea.org" from 2015

CLDF validation

How to cite

If you use these data please cite

  • the original source

    Greenhill, Simon J. (2015): TransNewGuinea.org: An Online Database of New Guinea Languages. PLoS ONE 10.10: e0141563.

  • the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY-4.0 license

Available online at http://transnewguinea.org

Conceptlists in Concepticon:

Statistics

CLDF validation Glottolog: 99% Concepticon: 95% Source: 100% BIPA: 99% CLTS SoundClass: 99%

  • Varieties: 1,017
  • Concepts: 1,166
  • Lexemes: 146,463
  • Sources: 151
  • Synonymy: 1.16
  • Invalid lexemes: 0
  • Tokens: 705,910
  • Segments: 415 (4 BIPA errors, 4 CLTS sound class errors, 410 CLTS modified)
  • Inventory size (avg): 27.65

Contributors

Name GitHub user Description Role
Simon J. Greenhill @SimonGreenhill patron Author
Robert Forkel @xrotwang author Other
Tiago Tresoldi @tresoldi profile Other

CLDF Datasets

The following CLDF datasets are available in cldf:

transnewguineaorg's People

Contributors

simongreenhill avatar chrzyki avatar johenglisch avatar lingulist avatar schweikhard avatar dependabot[bot] avatar

Watchers

 avatar James Cloos avatar Robert Forkel avatar  avatar  avatar  avatar Mei-Shin Wu-Urbanek avatar

Forkers

hansonmenghan

transnewguineaorg's Issues

Wrong Glottocode for Proto-Central-Sogeram

As raised in lexibank/lexibank-analysed/issues/35, there is a wrong Glottocode for Proto-Central-Sogeram. The lexibank script is downloading the language data directly from http://transnewguinea.org. We can probably fix this with a short replacement table:

OLD Glottocode NEW Glottocode
cent2257

I couldn't find the right Glottocode for the doculect (and the source link doesn't seem to work on my computer). So, if there aren't any suggestions on it, we can use this table.

Maybe, @SimonGreenhill, @xrotwang, or @tresoldi, any of you could kindly help me with this issue for the new release of Lexibank.

pybtex code needs to be updated or pybtex version fixed to 0.22.2

The dataset can't be build with the most recent pybtex. Either the code that builds the bibliography needs to be updated or the version of pybtex needs to be set to 0.22.2 in setup.py (I assume we prefer the former?).

for source in sorted(sources):
# this is ugly, I wish pybtex made this easier!
bib = parse_string(sources[source]["bibtex"], "bibtex")
old_key = bib.entries.keys()[0]
bib.entries[old_key].key = source
bib.entries = OrderedCaseInsensitiveDict(
[(source, bib.entries[old_key])]
)
args.writer.add_sources(bib)

https://docs.pybtex.org/history.html

pybtex's parse_string no gives a KeysView that can't be indexed.

Should it be "nd" or "n d" ? Inconsistency across datasets

In joophonosemantic, the sequence "nd" is segmented as a single segment:

https://github.com/lexibank/joophonosemantic/blob/master/etc/orthography.tsv#L192

Example:

Enga-33_one-1,,Enga,33_one,m.e.nd.ɑ.i,m.e.nd.ɑ.i,m e nd ɑ i,,,,,^ m . e . nd . ɑ . i $,default 

However, the same sequence is segmented as "n d" in this dataset:

enga-wapi-one-1,164073,enga-wapi,one,mendai,mendai,m e n d a i,,davies_and_comrie1985,,,^ m e n d a i $,default

Is one of the two notation better, and is it possible to normalize ? This creates inconsistencies in the sound correspondence study.

See the same issue on joophonosemantic:

lexibank/joophonosemantic#13 (comment)

@LinguList noted there that the decision probably needs to be made also for mb, ng, etc.

@SimonGreenhill, what do you think ?

Make CLDF creation idempotent

When running makecldf on my machine, basically all rows in all files were changed - I suspect due to different ordering of dict or iterdir. This should be fixed by explicitly sorting.

Concept mapping foot/calf

  • proto-trans-new-guinea-calf-1 (*kondC) is mapped to calf but probably is calf of leg?
  • siawi-calf-1 (*gʌǏaǏi) is mapped to calf but probably is calf of leg?

Thanks for spotting this in CLICS to @AnnikaTjuka.

Should transnewguinea.org be a provider

i.e. provide multiple datasets (one per source).

This would mean that

  1. concept lists can be lifted into concepticon (these will need to be checked with their original sources for e.g. item ordering, so that would be some work).
  2. orthography/segmentation etc is easier.
  3. cognates are easier (i.e. some sources have published cognates that we could include).

...but there's be maintenance issues of an extra ~150 sources (although I guess many are not interesting?)

Wrong Language Family name

I just spotted a wrong family name for the following languages where, instead of Lower Sepik-Ramu, it should be Lower Sepik like it's in Glottolog. Since you solved issue #23, can I kindly ask you to fix upstream for the upcoming Lexibank release, @SimonGreenhill?

ID Name Glottocode Glottolog_Name ISO639P3code Macroarea Latitude Longitude Family
 angoram Angoram ango1255 Angoram aog Papunesia -4.07758 144.028 Lower Sepik-Ramu
  angoram-kambrindo Angoram (Kambrindo Dialect) ango1255 Angoram aog Papunesia -4.07758 144.028 Lower Sepik-Ramu
  angoram-kanduanum Angoram (Kanduanum Dialect) ango1255 Angoram aog Papunesia -4.07758 144.028 Lower Sepik-Ramu
  angoram-karau Murik (Karau Dialect) muri1260 Murik (Papua New Guinea) mtf Papunesia -3.86959 144.166 Lower Sepik-Ramu
  angoram-magendo Angoram (Magendo Dialect) ango1255 Angoram aog Papunesia -4.07758 144.028 Lower Sepik-Ramu
  angoram-marbuk Angoram (Marbuk Dialect) ango1255 Angoram aog Papunesia -4.07758 144.028 Lower Sepik-Ramu
  angoram-wagamut Murik (Wagamut Dialect) muri1260 Murik (Papua New Guinea) mtf Papunesia -3.86959 144.166 Lower Sepik-Ramu
  aruamu Aruamu arua1260 Aruamu msy Papunesia -4.28957 144.842 Lower Sepik-Ramu
  awar Awar awar1249 Awar aya Papunesia -4.13554 144.836 Lower Sepik-Ramu
  banaro Banaro bana1292 Banaro byz Papunesia -4.56615 144.329 Lower Sepik-Ramu
  bosmun Bosmun bosn1248 Bosngun bqs Papunesia -4.16201 144.647 Lower Sepik-Ramu
  chambri Chambri cham1313 Chambri can Papunesia -4.27831 143.097 Lower Sepik-Ramu
  chambri-kilimbit Chambri (Kilimbit Dialect) cham1313 Chambri can Papunesia -4.27831 143.097 Lower Sepik-Ramu
  kanggape Kanggape kang1291 Kanggape igm Papunesia -4.4153 144.816 Lower Sepik-Ramu
  kayan Kayan kaia1245 Kaian kct Papunesia -4.06543 144.756 Lower Sepik-Ramu
  kire Kire kire1240 Kire geb Papunesia -4.25927 144.711 Lower Sepik-Ramu
  kopar Kopar kopa1248 Kopar xop Papunesia -3.98259 144.466 Lower Sepik-Ramu
  kopar-singarin Kopar (Singarin Dialect) kopa1248 Kopar xop Papunesia -3.98259 144.466 Lower Sepik-Ramu
  marangis Marangis wata1253 Watam wax Papunesia -4.02591 144.583 Lower Sepik-Ramu
mbore Mbore bore1247 Borei gai Papunesia -4.08697 144.715 Lower Sepik-Ramu
  murik Murik muri1260 Murik (Papua New Guinea) mtf Papunesia -3.86959 144.166 Lower Sepik-Ramu
  proto-lower-sepik Proto-Lower-Sepik lowe1437 Lower Sepik-Ramu         Lower Sepik-Ramu
  proto-wag Proto-Watam-Awar-Gamay wagg1235 Ottilien         Lower Sepik-Ramu
  rao Rao raoo1244 Rao rao Papunesia -4.85243 144.511 Lower Sepik-Ramu
  tabriak Tabriak tabr1243 Tabriak tzx Papunesia -4.49162 143.593 Lower Sepik-Ramu
  tanggu Tanggu tang1355 Tanggu tgu Papunesia -4.4621 144.916 Lower Sepik-Ramu
  yimas Yimas yima1243 Yimas yee Papunesia -4.71731 143.572 Lower Sepik-Ra

vowels should be represented as diphthongs

we have many aspects where we have a case like "b o u ŋ", where ou should be represented as diphtong.

A way to handle this is to assemble all vowel sounds and define all possible combinations between them. Another way is to parse the data with CLTS and check for vowel sounds in the profiles and extract them. In any case, with this dataset, it seems very reasonable to do so.

Many occurences of [y] seem like they might actually be [j]

Hi,

From this dataset, we get many cases of [y] in intervocalic contexts in the sound correspondence study. Erich and I suspect that some of these [y] (if not all) might actually be [j]. Can someone who worked on the orthographic profile have a look at it and make the necessary changes ?

We find 2577 rows with intervocalic [y]:

yaqay-one-1,23022,yaqay,one,kayaqamaere,kayaqamaere,k a y a q a m a e r e,,voorhoeve-1975,,,^ k a y a q a m a e r e $,default
yawiyo-wosawari-smoke-1,196768,yawiyo-wosawari,smoke,tiyam,tiyam,t i y a m,,conrad-and-dye-1975,,,^ t i y a m $,default
siawi-snake-1,199527,siawi,snake,wiyɛmi,wiyɛmi,w i y ɛ m i,,conrad-and-dye-1975,,,^ w i y ɛ m i $,default

1016 rows with Vy in final:

kalam-stick-3,195818,kalam,stick,mon-day,mon-day,m o n + d a y,,pawley-2013,,,^ m o n - d a y $,default
kalam-to-swallow-1,195826,kalam,to-swallow,kalay-,kalay-,k a l a y,,pawley-2013,,,^ k a l a y -$,default
proto-awyu-dumut-fish-2,229008,proto-awyu-dumut,fish,*rɔxay,*rɔxay,r ɔ x a y,,healey-1970,,,^* r ɔ x a y $,default

and 4208 rows with yV in initial:

proto-dumut-their-1,229572,proto-dumut,their,*yagi,*yagi,y a g i,,healey-1970,,,^* y a g i $,default
yawiyo-wosawari-no-not-1,196753,yawiyo-wosawari,no-not,yasʌ safiye,yasʌ_safiye,y a s ʌ + s a f i y e,,conrad-and-dye-1975,,,^ y a s ʌ _ s a f i y e $,default
worin-star-1,157000,worin,star,yɔmbɔŋ gire,yɔmbɔŋ_gire,y ɔ m b ɔ ŋ + g i r e,,mcelhanon_and_voorhoeve1970,,,^ y ɔ m b ɔ ŋ _ g i r e $,default

There are even more, as I ignored markers such as "+" in the search. Moreover, I can not find this grapheme at all in interconsonantic position in the forms. Comparatively, we find only 300 occurences of [j] in the entire dataset:

kuot-good-1,165850,kuot,good,mur,mur,m u r,adj,lindstrom2008,,,^ m u r $,default
kuot-wet-2,165893,kuot,wet,sərap,sərap,s ə r a p,adj,lindstrom2008,,,^ s ə r a p $,default
klon-halerman-seven-1,35152,klon-halerman,seven,us'uuj,us'uuj,u s uː j,,stokhof-1975,,,^ u s ' uu j $,default
kobon-empty-2,196055,kobon,empty,ij,ij,i j,+,pawley-2013,,,^ i j $,default
wambon-dog-4,214216,wambon,dog,ʔɑɴɢɑj,ʔɑɴɢɑj,ʔ ɑ ɴ ɢ ɑ j,,hughes-2009,,,^ ʔ ɑ ɴ ɢ ɑ j $,default

Someone who knows the languages might also want to check whether "j" in orthographic form always maps to [j] (and not, for example, [ʒ]).

The problematic rows in the orthography profile are around here:
https://github.com/lexibank/transnewguineaorg/blob/master/etc/orthography.tsv#L569

tagging @LinguList @SimonGreenhill

Add a bracket cleaner and a splitter

I just ran a test on orthography profiles, using lingpy profile -i cldf-metadata.json -o orthography.tsv --clts --cldf --column=form --context. This reveals a rather long list of problems, which would not occur if the form was created by using a consistent bracket remover, as well as a check for splitters, like comma, semi-colon, etc. (although they may be handled).

Profile is here:

orthography.txt

A further problem is: a couple of some 5 to 10 strings (if you sort in Excel the forms) appear to be empty, but they are not captured, probably because the — is not recognized as symbol for empty strings.

If these are captured, I think a preliminary orthography profile could be possible.

Ready for release?

Hey Simon,

Data update done in: #21

Is there anything else you'd like to see getting done for the data set? I can do some more small refinements (scaffolding etc.) and some housekeeping but as far as I can tell it should be ready for release.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.