lexibank / transnewguineaorg Goto Github PK

View Code? Open in Web Editor NEW

0.0 7.0 1.0 30.02 MB

CLDF dataset derived from Greenhill's "TransNewGuinea.org" from 2015

License: Creative Commons Attribution 4.0 International

Python 17.50% TeX 82.50%

clics3 lexibank1

transnewguineaorg's Introduction

CLDF dataset derived from Greenhill's "TransNewGuinea.org" from 2015

How to cite

If you use these data please cite

the original source

Greenhill, Simon J. (2015): TransNewGuinea.org: An Online Database of New Guinea Languages. PLoS ONE 10.10: e0141563.
the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY-4.0 license

Available online at http://transnewguinea.org

Conceptlists in Concepticon:

Greenhill-2015-2525

Statistics

Varieties: 1,017
Concepts: 1,166
Lexemes: 146,463
Sources: 151
Synonymy: 1.16
Invalid lexemes: 0
Tokens: 705,910
Segments: 415 (4 BIPA errors, 4 CLTS sound class errors, 410 CLTS modified)
Inventory size (avg): 27.65

Contributors

Name	GitHub user	Description	Role
Simon J. Greenhill	@SimonGreenhill	patron	Author
Robert Forkel	@xrotwang	author	Other
Tiago Tresoldi	@tresoldi	profile	Other

CLDF Datasets

The following CLDF datasets are available in cldf:

CLDF Wordlist at cldf/cldf-metadata.json

transnewguineaorg's People

Contributors

Watchers

Forkers

hansonmenghan

transnewguineaorg's Issues

forms need to be thoroughly checked again

@tresoldi, if you check teh forms thoroughly, you will see that there are many which say "-verb" or "give", etc. These are treated as if they were sound sequences now. This needs a very careful checking.

Wrong Glottocode for Proto-Central-Sogeram

As raised in lexibank/lexibank-analysed/issues/35, there is a wrong Glottocode for Proto-Central-Sogeram. The lexibank script is downloading the language data directly from http://transnewguinea.org. We can probably fix this with a short replacement table:

OLD Glottocode	NEW Glottocode
cent2257

I couldn't find the right Glottocode for the doculect (and the source link doesn't seem to work on my computer). So, if there aren't any suggestions on it, we can use this table.

Maybe, @SimonGreenhill, @xrotwang, or @tresoldi, any of you could kindly help me with this issue for the new release of Lexibank.

pybtex code needs to be updated or pybtex version fixed to 0.22.2

The dataset can't be build with the most recent pybtex. Either the code that builds the bibliography needs to be updated or the version of pybtex needs to be set to 0.22.2 in setup.py (I assume we prefer the former?).

transnewguineaorg/lexibank_transnewguineaorg.py

Lines 67 to 75 in 24fe745

    
           for source in sorted(sources): 
        
               # this is ugly, I wish pybtex made this easier! 
        
               bib = parse_string(sources[source]["bibtex"], "bibtex") 
        
               old_key = bib.entries.keys()[0] 
        
               bib.entries[old_key].key = source 
        
               bib.entries = OrderedCaseInsensitiveDict( 
        
                   [(source, bib.entries[old_key])] 
        
               ) 
        
               args.writer.add_sources(bib)

https://docs.pybtex.org/history.html

pybtex's parse_string no gives a KeysView that can't be indexed.

Should it be "nd" or "n d" ? Inconsistency across datasets

In joophonosemantic, the sequence "nd" is segmented as a single segment:

https://github.com/lexibank/joophonosemantic/blob/master/etc/orthography.tsv#L192

Example:

Enga-33_one-1,,Enga,33_one,m.e.nd.ɑ.i,m.e.nd.ɑ.i,m e nd ɑ i,,,,,^ m . e . nd . ɑ . i $,default

However, the same sequence is segmented as "n d" in this dataset:

enga-wapi-one-1,164073,enga-wapi,one,mendai,mendai,m e n d a i,,davies_and_comrie1985,,,^ m e n d a i $,default

Is one of the two notation better, and is it possible to normalize ? This creates inconsistencies in the sound correspondence study.

See the same issue on joophonosemantic:

lexibank/joophonosemantic#13 (comment)

@LinguList noted there that the decision probably needs to be made also for mb, ng, etc.

@SimonGreenhill, what do you think ?

Make CLDF creation idempotent

When running makecldf on my machine, basically all rows in all files were changed - I suspect due to different ordering of dict or iterdir. This should be fixed by explicitly sorting.

error with concept: affix-body-part

@LinguList Is that a problem that needs fixing?

$ cldfbench lexibank.makecldf lexibank_*.py
[…]
INFO    error with concept: affix-body-part

INFO    found 1 errors in concepts
[…]

re-run due to dot in title

Concept mapping foot/calf

proto-trans-new-guinea-calf-1 (*kondC) is mapped to calf but probably is calf of leg?
siawi-calf-1 (*gʌǏaǏi) is mapped to calf but probably is calf of leg?

Thanks for spotting this in CLICS to @AnnikaTjuka.

Should transnewguinea.org be a provider

i.e. provide multiple datasets (one per source).

This would mean that

concept lists can be lifted into concepticon (these will need to be checked with their original sources for e.g. item ordering, so that would be some work).
orthography/segmentation etc is easier.
cognates are easier (i.e. some sources have published cognates that we could include).

...but there's be maintenance issues of an extra ~150 sources (although I guess many are not interesting?)

check concepts by the CALC group

Add orthography profile

Wrong Language Family name

I just spotted a wrong family name for the following languages where, instead of Lower Sepik-Ramu, it should be Lower Sepik like it's in Glottolog. Since you solved issue #23, can I kindly ask you to fix upstream for the upcoming Lexibank release, @SimonGreenhill?

ID	Name	Glottocode	Glottolog_Name	ISO639P3code	Macroarea	Latitude	Longitude	Family
angoram	Angoram	ango1255	Angoram	aog	Papunesia	-4.07758	144.028	Lower Sepik-Ramu
angoram-kambrindo	Angoram (Kambrindo Dialect)	ango1255	Angoram	aog	Papunesia	-4.07758	144.028	Lower Sepik-Ramu
angoram-kanduanum	Angoram (Kanduanum Dialect)	ango1255	Angoram	aog	Papunesia	-4.07758	144.028	Lower Sepik-Ramu
angoram-karau	Murik (Karau Dialect)	muri1260	Murik (Papua New Guinea)	mtf	Papunesia	-3.86959	144.166	Lower Sepik-Ramu
angoram-magendo	Angoram (Magendo Dialect)	ango1255	Angoram	aog	Papunesia	-4.07758	144.028	Lower Sepik-Ramu
angoram-marbuk	Angoram (Marbuk Dialect)	ango1255	Angoram	aog	Papunesia	-4.07758	144.028	Lower Sepik-Ramu
angoram-wagamut	Murik (Wagamut Dialect)	muri1260	Murik (Papua New Guinea)	mtf	Papunesia	-3.86959	144.166	Lower Sepik-Ramu
aruamu	Aruamu	arua1260	Aruamu	msy	Papunesia	-4.28957	144.842	Lower Sepik-Ramu
awar	Awar	awar1249	Awar	aya	Papunesia	-4.13554	144.836	Lower Sepik-Ramu
banaro	Banaro	bana1292	Banaro	byz	Papunesia	-4.56615	144.329	Lower Sepik-Ramu
bosmun	Bosmun	bosn1248	Bosngun	bqs	Papunesia	-4.16201	144.647	Lower Sepik-Ramu
chambri	Chambri	cham1313	Chambri	can	Papunesia	-4.27831	143.097	Lower Sepik-Ramu
chambri-kilimbit	Chambri (Kilimbit Dialect)	cham1313	Chambri	can	Papunesia	-4.27831	143.097	Lower Sepik-Ramu
kanggape	Kanggape	kang1291	Kanggape	igm	Papunesia	-4.4153	144.816	Lower Sepik-Ramu
kayan	Kayan	kaia1245	Kaian	kct	Papunesia	-4.06543	144.756	Lower Sepik-Ramu
kire	Kire	kire1240	Kire	geb	Papunesia	-4.25927	144.711	Lower Sepik-Ramu
kopar	Kopar	kopa1248	Kopar	xop	Papunesia	-3.98259	144.466	Lower Sepik-Ramu
kopar-singarin	Kopar (Singarin Dialect)	kopa1248	Kopar	xop	Papunesia	-3.98259	144.466	Lower Sepik-Ramu
marangis	Marangis	wata1253	Watam	wax	Papunesia	-4.02591	144.583	Lower Sepik-Ramu
mbore	Mbore	bore1247	Borei	gai	Papunesia	-4.08697	144.715	Lower Sepik-Ramu
murik	Murik	muri1260	Murik (Papua New Guinea)	mtf	Papunesia	-3.86959	144.166	Lower Sepik-Ramu
proto-lower-sepik	Proto-Lower-Sepik	lowe1437	Lower Sepik-Ramu					Lower Sepik-Ramu
proto-wag	Proto-Watam-Awar-Gamay	wagg1235	Ottilien					Lower Sepik-Ramu
rao	Rao	raoo1244	Rao	rao	Papunesia	-4.85243	144.511	Lower Sepik-Ramu
tabriak	Tabriak	tabr1243	Tabriak	tzx	Papunesia	-4.49162	143.593	Lower Sepik-Ramu
tanggu	Tanggu	tang1355	Tanggu	tgu	Papunesia	-4.4621	144.916	Lower Sepik-Ramu
yimas	Yimas	yima1243	Yimas	yee	Papunesia	-4.71731	143.572	Lower Sepik-Ra

vowels should be represented as diphthongs

we have many aspects where we have a case like "b o u ŋ", where ou should be represented as diphtong.

A way to handle this is to assemble all vowel sounds and define all possible combinations between them. Another way is to parse the data with CLTS and check for vowel sounds in the profiles and extract them. In any case, with this dataset, it seems very reasonable to do so.

Many occurences of [y] seem like they might actually be [j]

Hi,

From this dataset, we get many cases of [y] in intervocalic contexts in the sound correspondence study. Erich and I suspect that some of these [y] (if not all) might actually be [j]. Can someone who worked on the orthographic profile have a look at it and make the necessary changes ?

We find 2577 rows with intervocalic [y]:

yaqay-one-1,23022,yaqay,one,kayaqamaere,kayaqamaere,k a y a q a m a e r e,,voorhoeve-1975,,,^ k a y a q a m a e r e $,default
yawiyo-wosawari-smoke-1,196768,yawiyo-wosawari,smoke,tiyam,tiyam,t i y a m,,conrad-and-dye-1975,,,^ t i y a m $,default
siawi-snake-1,199527,siawi,snake,wiyɛmi,wiyɛmi,w i y ɛ m i,,conrad-and-dye-1975,,,^ w i y ɛ m i $,default

1016 rows with Vy in final:

kalam-stick-3,195818,kalam,stick,mon-day,mon-day,m o n + d a y,,pawley-2013,,,^ m o n - d a y $,default
kalam-to-swallow-1,195826,kalam,to-swallow,kalay-,kalay-,k a l a y,,pawley-2013,,,^ k a l a y -$,default
proto-awyu-dumut-fish-2,229008,proto-awyu-dumut,fish,*rɔxay,*rɔxay,r ɔ x a y,,healey-1970,,,^* r ɔ x a y $,default

and 4208 rows with yV in initial:

proto-dumut-their-1,229572,proto-dumut,their,*yagi,*yagi,y a g i,,healey-1970,,,^* y a g i $,default
yawiyo-wosawari-no-not-1,196753,yawiyo-wosawari,no-not,yasʌ safiye,yasʌ_safiye,y a s ʌ + s a f i y e,,conrad-and-dye-1975,,,^ y a s ʌ _ s a f i y e $,default
worin-star-1,157000,worin,star,yɔmbɔŋ gire,yɔmbɔŋ_gire,y ɔ m b ɔ ŋ + g i r e,,mcelhanon_and_voorhoeve1970,,,^ y ɔ m b ɔ ŋ _ g i r e $,default

There are even more, as I ignored markers such as "+" in the search. Moreover, I can not find this grapheme at all in interconsonantic position in the forms. Comparatively, we find only 300 occurences of [j] in the entire dataset:

kuot-good-1,165850,kuot,good,mur,mur,m u r,adj,lindstrom2008,,,^ m u r $,default
kuot-wet-2,165893,kuot,wet,sərap,sərap,s ə r a p,adj,lindstrom2008,,,^ s ə r a p $,default
klon-halerman-seven-1,35152,klon-halerman,seven,us'uuj,us'uuj,u s uː j,,stokhof-1975,,,^ u s ' uu j $,default
kobon-empty-2,196055,kobon,empty,ij,ij,i j,+,pawley-2013,,,^ i j $,default
wambon-dog-4,214216,wambon,dog,ʔɑɴɢɑj,ʔɑɴɢɑj,ʔ ɑ ɴ ɢ ɑ j,,hughes-2009,,,^ ʔ ɑ ɴ ɢ ɑ j $,default

Someone who knows the languages might also want to check whether "j" in orthographic form always maps to [j] (and not, for example, [ʒ]).

The problematic rows in the orthography profile are around here:
https://github.com/lexibank/transnewguineaorg/blob/master/etc/orthography.tsv#L569

tagging @LinguList @SimonGreenhill

Add a bracket cleaner and a splitter

I just ran a test on orthography profiles, using lingpy profile -i cldf-metadata.json -o orthography.tsv --clts --cldf --column=form --context. This reveals a rather long list of problems, which would not occur if the form was created by using a consistent bracket remover, as well as a check for splitters, like comma, semi-colon, etc. (although they may be handled).

Profile is here:

orthography.txt

A further problem is: a couple of some 5 to 10 strings (if you sort in Excel the forms) appear to be empty, but they are not captured, probably because the — is not recognized as symbol for empty strings.

If these are captured, I think a preliminary orthography profile could be possible.

place concepts in etc/concepts.csv

Concepticon links are currently in json, they will be easier to check if added to etc/concepts.csv

Ready for release?

Hey Simon,

Data update done in: #21

Is there anything else you'd like to see getting done for the data set? I can do some more small refinements (scaffolding etc.) and some housekeeping but as far as I can tell it should be ready for release.

	for source in sorted(sources):
	# this is ugly, I wish pybtex made this easier!
	bib = parse_string(sources[source]["bibtex"], "bibtex")
	old_key = bib.entries.keys()[0]
	bib.entries[old_key].key = source
	bib.entries = OrderedCaseInsensitiveDict(
	[(source, bib.entries[old_key])]
	)
	args.writer.add_sources(bib)

lexibank / transnewguineaorg Goto Github PK

transnewguineaorg's Introduction

CLDF dataset derived from Greenhill's "TransNewGuinea.org" from 2015

How to cite

Description

Statistics

Contributors

CLDF Datasets

transnewguineaorg's People

Contributors

Watchers

Forkers

transnewguineaorg's Issues

Recommend Projects

Recommend Topics

Recommend Org