Coder Social home page Coder Social logo

lexibank / diacl Goto Github PK

View Code? Open in Web Editor NEW
2.0 7.0 0.0 21.07 MB

CLDF dataset derived from Carling's "Diachronic Atlas of Comparative Linguistics" from 2017

License: Creative Commons Attribution 4.0 International

Python 13.15% TeX 86.85%
clics3 lexibank1

diacl's Introduction

CLDF dataset derived from Carling's "Diachronic Atlas of Comparative Linguistics" from 2017

CLDF validation

How to cite

If you use these data please cite

  • the original source

    Carling, Gerd (ed.) 2017. Diachronic Atlas of Comparative Linguistics Online. Lund: Lund University. (DOI/URL: https://diacl.ht.lu.se/). Accessed on: 2019-02-07.

  • the derived dataset using the DOI of the particular released version you were using

Description

This dataset is licensed under a CC-BY-4.0 license

Available online at https://diacl.ht.lu.se/

Conceptlists in Concepticon:

Notes

From Diachronic Atlas of Comparative Linguistics (DiACL)—A database for ancient language typology

An important additional resource of the database DiACL is constituted by basic vocabulary lists, consisting of a Swadesh 100-list, analysed by cognacy and with loans removed. Nearly all languages for Eurasia that are in the data set DiACL Typology/ Eurasia have complementary sets of basic vocabulary, with the exception of North-East and North-West Caucasian languages, for which cognacy analysis is not available. The basic vocabulary data set has been compiled according to the same basic principles as the typological set: we aim towards symmetry between extinct and contemporary languages (i.e., concerning polymorphism), and all data points are sourced in reliable literature. The basic vocabulary data set is a useful resource, for instance for testing typological against lexical change, or for establishing a lexical phylogenetic tree, against which gain and loss rates of typological data can be measured. The basic vocabulary data can be retrieved from the following URL: https://diacl.ht.lu.se/WordList/Index.

Statistics

CLDF validation Glottolog: 94% Concepticon: 100% Source: 100%

  • Varieties: 422
  • Concepts: 542
  • Lexemes: 60,206
  • Sources: 357
  • Synonymy: 1.38

Contributors

Name GitHub user Description Role
Robert Forkel @xrotwang digitization, code Other
Christoph Rzymski @chrzyki patron Other
Gerd Carling Author, Distributor, Editor

CLDF Datasets

The following CLDF datasets are available in cldf:

diacl's People

Contributors

chrzyki avatar johenglisch avatar simongreenhill avatar xrotwang avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

diacl's Issues

smart splitters of linguistic entries

Diacl is particularly ambitious for automatic parsing from value to form: they have up to 5 entries, at times even more, separated by different entries, in the cells. We thus need to define the splitting function smartly, and I recommend to only take the first of the forms in the data, not because I think that is the best form, but since it's more principled than doing a random selection or something else. What is clear: we can't handle this manually, but we need to make a thorough check of the data to avoid that, for example, all splitters are identified.

Towards pylexibank 2.0

See b43368a

Still working on the code (imports, tqdm, concept lists) but could someone please check the way I handle the forms & whether this makes sense? This uses replacements as proposed in lexibank/pylexibank#142. Specifically:

@attr.s
class FormSpec(object):
    brackets = attr.ib(
        default={"(": ")"},
        validator=attr.validators.instance_of(dict))
    separators = attr.ib(default=";/,")
    missing_data = attr.ib(
        default=('?', '-'),
        validator=attr.validators.instance_of((tuple, list)))
    strip_inside_brackets = attr.ib(
        default=True,
        validator=attr.validators.instance_of(bool))
    replacements = attr.ib(
        default={},
        validator=attr.validators.instance_of(dict))
    first_form_only = attr.ib(
        default=False,
        validator=attr.validators.instance_of(bool))

    def clean(self, form, item=None):
        """
        Called when a row is added to a CLDF dataset.

        :param form:
        :return: None to skip the form, or the cleaned form as string.
        """
        for s, t in sorted(self.replacements.items(), key=lambda x: len(x[0]), reverse=True):
            form = form.replace(s, t)

        if form not in self.missing_data:
            if self.strip_inside_brackets:
                return text.strip_brackets(form, brackets=self.brackets)
            return form

    def split(self, item, value, lexemes=None):
        lexemes = lexemes or {}
        if value in lexemes:
            log.debug('overriding via lexemes.csv: %r -> %r' % (value, lexemes[value]))
            value = lexemes[value]
        if self.first_form_only:
            return misc.nfilter(
                self.clean(form, item=item)
                for form in text.split_text_with_context(
                    value, separators=self.separators, brackets=self.brackets))[:1]
        else:
            return misc.nfilter(
                self.clean(form, item=item)
                for form in text.split_text_with_context(
                    value, separators=self.separators, brackets=self.brackets))

Identify DIACL concepts which map to the same concept set?

DIACL contains concepts from multiple concept lists, but does not merge or identify these. The lexibank CLDF could either do this (once we have concepticon mappings for all lists), or just do what DIACL does. Since a CLDF Wordlist does not have a many-to-many relation between lexemes and concepts, this would result in a multiplication of mostly identical forms - just associated to different concepts (which are not so different again). Also, this multiplication of forms would lead to a multiplication of cognate sets.

So using the DIACL data would typically have to start with filtering everything by concept list. But I guess that's still preferable to more and complex code doing the merging when creating the CLDF?

Problems in Serbian

There are some mismappings, as they have like 6 words for DEER in the data. We were informed by somebody who wrote to Joshua Jackson, who then wrote to me:

Cow would be what is translated as Deer. 
Krava = Cow
Vo = Ox
Bik = Bull
Jelen = Deer
Jelena is one of the common names in Serbia (likely related to Helen rather than Jelen, though)
Right beneath is KRV, which is BLOOD and definitely not the meat. MESO is meat. 
Above that KORA = it is a bark, but leather is KOŽA
Am I missing something important? 
Jegulja is an EEL, not a snake, ZMIJA is a snake. 
Konj is a male horse, kobila is a mare. 
Jare is NOT a lamb. Jagnje (janje) is a lamb, jare is a baby goat, not a sheep. 
Jagoda is a strawberry, not a grape. Grožđe is a term for the grapes, grozd is singular.”

I suggest we manually correct these cases via Lexemes. I would also inform the DIACL editors about this.

Or, @chrzyki, @xrotwang, is it possible that the error (something swapped here) is on the side of the pylexibank script?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.