lexibank / northeuralex Goto Github PK

View Code? Open in Web Editor NEW

1.0 7.0 2.0 37.57 MB

CLDF dataset derived from Dellert et al.'s "NorthEuraLex" from 2020

License: Creative Commons Attribution 4.0 International

TeX 96.80% Python 3.20%

clics3 lexibank1

northeuralex's Introduction

CLDF dataset derived from Dellert et al.'s "NorthEuraLex (Version 0.9)" from 2020

How to cite

If you use these data please cite

the original source

Dellert, J., Daneyko, T., Münch, A. et al (2020). NorthEuraLex (Version 0.9). Lang Resources and Evaluation. https://doi.org/10.1007/s10579-019-09480-6
the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY-4.0 license

Available online at http://www.northeuralex.org

Conceptlists in Concepticon:

Dellert-2017-1016

Notes

This large database covers several languages of Northern Eurasia. For the conversion to CLDF, we considerably adjusted the IPA in the source.

Statistics

Varieties: 107 (linked to 107 different Glottocodes)
Concepts: 1,016 (linked to 954 different Concepticon concept sets)
Lexemes: 121,611
Sources: 1
Synonymy: 1.15
Invalid lexemes: 0
Tokens: 699,892
Segments: 678 (0 BIPA errors, 0 CLTS sound class errors, 676 CLTS modified)
Inventory size (avg): 52.43

Contributors

Name	GitHub user	Description	Role
Tiago Tresoldi	@tresoldi	patron	Other
Julius Steuer	@justeuer	orthographic profile	Other
Johann-Mattis List	@LinguList	code, integration	Editor
Robert Forkel	@xrotwang	code, integration	Editor
Johannes Dellert		editor	DataCurator, DataManager, Author
Pavel Sofroniev	@pavelsof	original team cdlf curation	DataCurator, DataManager

CLDF Datasets

The following CLDF datasets are available in cldf:

CLDF Wordlist at cldf/cldf-metadata.json

northeuralex's People

Contributors

Stargazers

Watchers

Forkers

xachab justeuer

northeuralex's Issues

Diphthongs transcribed as sequences of simple vowels for Manchu (and others)

The problem seems to be present in many languages. For some (as Japanese) I would not expect to find diphthongs, but for Manchu and English this is surely inadequate. This code snippet shows all languages that don't have diphthongs in their forms:

from cltoolkit import Wordlist
from lexibank_northeuralex import Dataset as NorthEuraLex
from pycldf import Dataset
from pyclts import CLTS

clts = CLTS()
bipa = clts.bipa

word_list = Wordlist([Dataset.from_metadata(NorthEuraLex().cldf_dir.joinpath('cldf-metadata.json'))])

for i, language in enumerate(word_list.languages):
    sound_types = []
    for form in language.forms:
        segments = form.data["Segments"]
        for segment in segments:
            sound_type = bipa[segment].type
            sound_types.append(sound_type)
    print("{} {}\t{}\t{}".format(i+1, language.id.split('-')[1], 'diphthong' in sound_types, language.name))

Partial output:

52 eng  False   English
75 mnc  False   Manchu

change citation to the most recent paper by Dellert et al.

https://doi.org/10.1007/s10579-019-09480-6

Identify sources

@LinguList says " sources: since to my knowledge there's no link from source to language, we're left alone and cannot tell which one belongs to which source. We could ask Johannes on some of these points"

first release for CLICS

fix the language names

they appear as ISO only, which looks ugly.

Invalid segments

@tresoldi, would you mind having a look those?

ID	LANGUAGE	CONCEPT	FORM	SEGMENTS
khk-EssenN-1	khk	EssenN	ʲite	^ j i t e
khk-GelachterN-1	khk	GelachterN	ʲinetem	^ j i n e t e m
khk-GriffN-1	khk	GriffN	ʲiʃ	^ j i ʃ
khk-KanteN-1	khk	KanteN	ʲirmeɡ	^ j i r m e g
khk-SpeiseN-1	khk	SpeiseN	ʲite	^ j i t e
khk-StammN-1	khk	StammN	ʲiʃ	^ j i ʃ
khk-essenV-1	khk	essenV	ʲitex	^ j i t e x
khk-geratenV-1	khk	geratenV	ʲiɮrex	^ j i ɮ r e x
khk-glaubenV-1	khk	glaubenV	ʲitʰɡex	^ j i tʰ g e x
khk-groA-1	khk	groA	ʲix	^ j i x
khk-kommenV-1	khk	kommenV	ʲirex	^ j i r e x
khk-lachenV-1	khk	lachenV	ʲinex	^ j i n e x
khk-pfeifenV-1	khk	pfeifenV	ʲisɡerex	^ j i s g e r e x
khk-scharfA-2	khk	scharfA	ʲirtʰej	^ j i r tʰ e j
khk-schickenV-1	khk	schickenV	ʲiɮɡex	^ j i ɮ g e x
khk-soADV-2	khk	soADV	ʲinɡet	^ j i n g e t
khk-zunehmenV-1	khk	zunehmenV	ʲixsex	^ j i x s e x
lit-BuchtN-1	lit	BuchtN	ʲiːɫɐnkɐ	^ j iː lˠ ɐ n k ɐ
lit-eintretenV-1	lit	eintretenV	ʲiːʒɛnˑɡtʲɪ	^ j iː ʒ ɛ nˑ g tʲ ɪ
lit-einwickelnV-1	lit	einwickelnV	ʲiːʋʲiːnʲɪoːtʲɪ	^ j iː ʋʲ iː nʲ ɪ oː tʲ ɪ
lit-hinaufADV-2	lit	hinaufADV	ʲiː ʋʲɪrˑʃuː	^ j iː + ʋʲ ɪ rˑ ʃ uː
lit-hineingehenV-1	lit	hineingehenV	ʲiːɛʲɪˑtʲɪ	^ j iː ɛ j ɪˑ tʲ ɪ
lit-steckenV-2	lit	steckenV	ʲiːkʲɪʃtʲɪ	^ j iː kʲ ɪ ʃ tʲ ɪ
lit-verschiedenA-1	lit	verschiedenA	ʲiːʋɐʲɪˑrʊs	^ j iː ʋ ɐ j ɪˑ r ʊ s

correct the orthography profile

The orthography profile is automatic, nobody corrected it manually, but this should be done.

finalize and submit new version with modified diphthongs

add Julius Steuer to contributors.md
re-run with fresh virtual env

I will do this later.

lexibank / northeuralex Goto Github PK

northeuralex's Introduction

CLDF dataset derived from Dellert et al.'s "NorthEuraLex (Version 0.9)" from 2020

How to cite

Description

Notes

Statistics

Contributors

CLDF Datasets

northeuralex's People

Contributors

Stargazers

Watchers

Forkers

northeuralex's Issues

Recommend Projects

Recommend Topics

Recommend Org