lexibank / diacl Goto Github PK

View Code? Open in Web Editor NEW

2.0 7.0 0.0 21.07 MB

CLDF dataset derived from Carling's "Diachronic Atlas of Comparative Linguistics" from 2017

License: Creative Commons Attribution 4.0 International

Python 13.15% TeX 86.85%

clics3 lexibank1

diacl's Introduction

CLDF dataset derived from Carling's "Diachronic Atlas of Comparative Linguistics" from 2017

How to cite

If you use these data please cite

the original source

Carling, Gerd (ed.) 2017. Diachronic Atlas of Comparative Linguistics Online. Lund: Lund University. (DOI/URL: https://diacl.ht.lu.se/). Accessed on: 2019-02-07.
the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY-4.0 license

Available online at https://diacl.ht.lu.se/

Conceptlists in Concepticon:

Notes

From Diachronic Atlas of Comparative Linguistics (DiACL)—A database for ancient language typology

An important additional resource of the database DiACL is constituted by basic vocabulary lists, consisting of a Swadesh 100-list, analysed by cognacy and with loans removed. Nearly all languages for Eurasia that are in the data set DiACL Typology/ Eurasia have complementary sets of basic vocabulary, with the exception of North-East and North-West Caucasian languages, for which cognacy analysis is not available. The basic vocabulary data set has been compiled according to the same basic principles as the typological set: we aim towards symmetry between extinct and contemporary languages (i.e., concerning polymorphism), and all data points are sourced in reliable literature. The basic vocabulary data set is a useful resource, for instance for testing typological against lexical change, or for establishing a lexical phylogenetic tree, against which gain and loss rates of typological data can be measured. The basic vocabulary data can be retrieved from the following URL: https://diacl.ht.lu.se/WordList/Index.

Statistics

Varieties: 422
Concepts: 542
Lexemes: 60,206
Sources: 357
Synonymy: 1.38

Contributors

Name	GitHub user	Description	Role
Robert Forkel	@xrotwang	digitization, code	Other
Christoph Rzymski	@chrzyki	patron	Other
Gerd Carling			Author, Distributor, Editor

CLDF Datasets

The following CLDF datasets are available in cldf:

CLDF Wordlist at cldf/cldf-metadata.json

diacl's People

Contributors

Stargazers

Watchers

diacl's Issues

Update language mappings (of Romani dialects) when Glottolog 3.5 is released

Glottolog 3.5 will contain a lot more Romani dialects (see glottolog/glottolog#354), so the mappings in etc/languages.csv can be completed once the next version of Glottolog is released.

Glottolog mappings

smart splitters of linguistic entries

Diacl is particularly ambitious for automatic parsing from value to form: they have up to 5 entries, at times even more, separated by different entries, in the cells. We thus need to define the splitting function smartly, and I recommend to only take the first of the forms in the data, not because I think that is the best form, but since it's more principled than doing a random selection or something else. What is clear: we can't handle this manually, but we need to make a thorough check of the data to avoid that, for example, all splitters are identified.

Towards pylexibank 2.0

See b43368a

Still working on the code (imports, tqdm, concept lists) but could someone please check the way I handle the forms & whether this makes sense? This uses replacements as proposed in lexibank/pylexibank#142. Specifically:

@attr.s
class FormSpec(object):
    brackets = attr.ib(
        default={"(": ")"},
        validator=attr.validators.instance_of(dict))
    separators = attr.ib(default=";/,")
    missing_data = attr.ib(
        default=('?', '-'),
        validator=attr.validators.instance_of((tuple, list)))
    strip_inside_brackets = attr.ib(
        default=True,
        validator=attr.validators.instance_of(bool))
    replacements = attr.ib(
        default={},
        validator=attr.validators.instance_of(dict))
    first_form_only = attr.ib(
        default=False,
        validator=attr.validators.instance_of(bool))

    def clean(self, form, item=None):
        """
        Called when a row is added to a CLDF dataset.

        :param form:
        :return: None to skip the form, or the cleaned form as string.
        """
        for s, t in sorted(self.replacements.items(), key=lambda x: len(x[0]), reverse=True):
            form = form.replace(s, t)

        if form not in self.missing_data:
            if self.strip_inside_brackets:
                return text.strip_brackets(form, brackets=self.brackets)
            return form

    def split(self, item, value, lexemes=None):
        lexemes = lexemes or {}
        if value in lexemes:
            log.debug('overriding via lexemes.csv: %r -> %r' % (value, lexemes[value]))
            value = lexemes[value]
        if self.first_form_only:
            return misc.nfilter(
                self.clean(form, item=item)
                for form in text.split_text_with_context(
                    value, separators=self.separators, brackets=self.brackets))[:1]
        else:
            return misc.nfilter(
                self.clean(form, item=item)
                for form in text.split_text_with_context(
                    value, separators=self.separators, brackets=self.brackets))

Identify DIACL concepts which map to the same concept set?

DIACL contains concepts from multiple concept lists, but does not merge or identify these. The lexibank CLDF could either do this (once we have concepticon mappings for all lists), or just do what DIACL does. Since a CLDF Wordlist does not have a many-to-many relation between lexemes and concepts, this would result in a multiplication of mostly identical forms - just associated to different concepts (which are not so different again). Also, this multiplication of forms would lead to a multiplication of cognate sets.

So using the DIACL data would typically have to start with filtering everything by concept list. But I guess that's still preferable to more and complex code doing the merging when creating the CLDF?

Problems in Serbian

There are some mismappings, as they have like 6 words for DEER in the data. We were informed by somebody who wrote to Joshua Jackson, who then wrote to me:

Cow would be what is translated as Deer. 
Krava = Cow
Vo = Ox
Bik = Bull
Jelen = Deer
Jelena is one of the common names in Serbia (likely related to Helen rather than Jelen, though)
Right beneath is KRV, which is BLOOD and definitely not the meat. MESO is meat. 
Above that KORA = it is a bark, but leather is KOŽA
Am I missing something important? 
Jegulja is an EEL, not a snake, ZMIJA is a snake. 
Konj is a male horse, kobila is a mare. 
Jare is NOT a lamb. Jagnje (janje) is a lamb, jare is a baby goat, not a sheep. 
Jagoda is a strawberry, not a grape. Grožđe is a term for the grapes, grozd is singular.”

I suggest we manually correct these cases via Lexemes. I would also inform the DIACL editors about this.

Or, @chrzyki, @xrotwang, is it possible that the error (something swapped here) is on the side of the pylexibank script?

lexibank / diacl Goto Github PK

diacl's Introduction

CLDF dataset derived from Carling's "Diachronic Atlas of Comparative Linguistics" from 2017

How to cite

Description

Notes

Statistics

Contributors

CLDF Datasets

diacl's People

Contributors

Stargazers

Watchers

diacl's Issues

Recommend Projects

Recommend Topics

Recommend Org