Coder Social home page Coder Social logo

concepticon / concepticon-data Goto Github PK

View Code? Open in Web Editor NEW
32.0 32.0 35.0 132.37 MB

The curation repository for the data behind Concepticon.

Home Page: https://concepticon.clld.org

Python 5.57% TeX 91.56% HTML 0.76% JavaScript 1.57% RouterOS Script 0.55%
concepts cross-linguistic-data linguistics

concepticon-data's Introduction

concepticon-data's People

Contributors

anaphory avatar annikatjuka avatar blag avatar carolinhu avatar chrzyki avatar cysouw avatar evoling avatar fredericblum avatar ilchec avatar kristina-pianykh avatar laiyunfan avatar lannin avatar lingulist avatar macyl avatar marthuis avatar martino-vic avatar mathildavz avatar mottaam avatar muffinlinwist avatar natalia-morozova avatar phylostar avatar schweikhard avatar simongreenhill avatar stasreichert avatar tresoldi avatar wu-urbanek avatar xrotwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

concepticon-data's Issues

Sidwell-2015-20X

Paul Sidwell just published (or soon publishes) his data on Austro-Asiatic officially, so we can quote it and map it. I have the list, also the draft and the quotation (should re-confirm it is the same as in the data-file). It is interesting, since it gives a South-East-Asian perspective on stable words, similar to the lists by Matisoff and others.

Lists for cross-linguistic naming tests

Interestingly, historical linguistics have now another potential collaboration partner: neurologists. Since neurologists use "naming tests" to assess the degree of aphasia and the like, and they have also realized, that it might be interesting to look ad Swadesh lists:

http://www.sciencedirect.com/science/article/pii/S088761770700011X

The list contains forty items, but this time chosen for practical criteria important in neurology.

Data is already converted to table form. All that's needed is to link it (and this will be quick). The resources also has a nice collection of forty photographies (apparently for aphasia situations, where the doctor ask the patients to tell what they see), which we can link in the URL.

Concept list of more than 1000 concepts by Pelkey-2011

In the SIL study on Phula languages (Sino-Tibetan family), two lists are given, one master-list of 1100 and more concepts (difficult to map), and one list of etyma, for which according to the author a 660 item masterlist from another source was taken (apparently based on Bradley 1979). How well this list can be added depends especially on how well the concepts can be extracted from the PDF and how nicely they are described.

match concepticon general with STEDT taxononomy?

this will be tedious, but they have this nice historically informed taxonomy in STEDT:

maybe, having the conceptlicon mapped to it, would be nice.

Problem is the size, and again the question of what to do, if something is just too big, so that we can only take a small part for the concepticon? Is that still the same kind of "concept list", or is it something else? I am thinking of major linkings to

  • wordnet (we have an indirect one, but no "real")
  • STEDT
  • some association data (age of acquisition or whatever)
  • wikipedia (most definitions for plants and animals were now directly taken from there)

But I have the feeling that we should not call these things "concept lists", and we won't be able to get full coverage for the whole bunch of about 2500 concept sets we have at the moment.

Generaly there are two possibilities:

  • (a) decide upon a subset of a certain number of concepts and link this subset as a concept list
  • (b) open a new category for meta-data which was assembled directly for the concept sets

I think that (b) would be better, also for users to find what they are looking for...

Tests the Concepticon

When working on the concepticon, more things come to mind, which we need tests for, and I suppose we summarize them here, to not forget about this.

  • tests for concepticon.tsv should check for the number of categories, like person/thing, but also for the rough semantic field (animals, the body), etc., since spelling differences may create errors here, and the categories and fields are actually fixed

Of course, more tests are surely needed, but this is what I can think about at the moment.

new lists by Anvita Abbi to be scanned and added

There are, apparently, proposed regional swadesh lists for indian languages in the book by A. Abbi (I looked at the TOC, and there seem to be some 2 or 3 new lists of different size):

Abbi, Anvita (2001). A Manual of Linguistic Field Work and Structures of Indian
Languages. München: Lincom Europa.

Is there a copy of that book in Marburg, @cysouw, and if so, could you send a student to scan it (or the relevant parts?).

lee and hasegawa (japonic)

don't know yet how many concepts, but supplementary is available (pdf, unfortunately, so will have to type it off)

Refine kinship terms and add parabank list

We should refine the kinship terms, by following what they have in ParaBank. question is of how to handle the relations, but maybe we can just put this to the parabank-list to be added.

Provide mapping to babelnet, where possible

Babelnet is a very nice resource that defines its own synsets and maps them to omegawiki, princeton wordnet, multwordnet, and wikipedia. We should try to map our concepts to it, where possible. I am currently preparing an automatic pre-mapping using their API. It is probable that we will have unmappable items, but if we could provide some coverage of about 80% (which seems realistic), it would be very nice. To test babelnet, check out the babelfy website, where one can just insert words and see which synsets they infer for them.

If the mapping can be provided to, say, 80% of our concepts, we can delete the omegawiki-links, since they should be included in babelnet (and it is unlikely we are able to add more than the ones we already have, given that the api of omegawiki is working so slowly...).

Verify that concepts meaning "thin" are now correctly handled

In our online-alpha-version, the words for "thin" all mean different things (as can be seen from the Chinese labels). The concepticon major list has already been adjusted in this sense, offering enough concepts here, but it needs to be confirmed that the links are actually applied to all the data now.

Provide mapping example using the python script on the website or to be linked on github

When launching the concepticon in version 1.0, there should be one testing example, showing how users can map their own lists. This requires some little information, and some reference to LingPy, since the code for comparing glosses is implemented in the meaning module of LingPy, but it seems worthwhile to have a brief description and alink on the main page, so that who wants can use the resource to quickly link a list.

list by Bengtson and Ruhlen

This list has just been typed off by me, it contains 27 items with GLOBAL etymologies, so they say they represent some proto-world. Suprisingly, there are two words for "leg". Anyway, the concepts are a bit strange, containing many semantic-shift meanings (breast-milk), but this is interesting in the context of comparing it with other areal lists, where they merge meanings, like meat/animal, and the like.

Handling identical or almost identical lists across multiple publications

Following up from the discussion in the PR #58, I thought it was useful to turn this into an issue.

It seems that with the lists by Shirō Hattori, we have an excellent example for identical concept lists across different publications. There is still the question of how to handle it. A simple solution would be to list several references, and indicate in the note-column of conceptlists.tsv which additional data-point stems from which list (suppose a list which had only English, but has added Japanese in a later identical edition). This is feasible, since there are not too many lists, where this needs to be done. Another possibility of what we could think would be to add yet another column to conceptlists.tsv which might be called USEDBY or something similar, indicating whether this very list was used in further publications. The list by Dyen et al. 1992/1997 would be a usecase for this, since it is exactly identical with the list by Swadesh-1952-200, but they use uppercase where Swadesh used normal orthography. If we, otherwise keep on following a policy by which we say that theoretically, no two lists are the same (and one can fight about this), this would mean that Dyen 1992 should also be added, and Hattori's lists should be split into three or four.

We basically have a small problem of ontology and epistemology already now in the concepticon, since it is clear that we cannot just trust the papers if they say they used the list by Swadesh, for example, since they often use new concept labels, but claim they are based on some list. So it might be the most coherent way to add all lists we can get, but this may then turn out to be redundant, so the "ALSO-USED-By" (or whatever better label) may be a compromise solution.

But I'm by no means completely convinced by either of the solutions mentioned above...

dixon's list has wrong numbers

In the original Dixon list, the number 14 is missing in the source itself (apparently an entry for ear (1) which is simply not there). Furthermore, our current entry "12" is entry 11a in Dixon's list. This should be quickly updated after ID's have been changed.

Add links to wikidata

I didn't know of wikidata before, but now that I checked it, it seems it could give us valuable information in many respects, not for all concept sets, of course, but probably for quite a few. Here's, for example, the wikidata for hand:

They also seem to have a good API, so one may be able to search more or less automatically for good matches...

Missing concept lists

Currently we have metadata on 58 concept lists in conceptlists.tsv but only 50 concept lists in conceptlists/. The missing ones are

  • Kassian-2010-110
  • Gabelentz-1891-120
  • Snider-2004-1700
  • Zorc-1974-100
  • Shevoroshkin-1991-23
  • Swadesh-1960-100
  • Marsden-1782-50
  • Pallas-1785-441

We should either add these - maybe even as empty stubs - or remove them from conceptlists.tsv, I think. In case we add empty concept list files for these, we might need a flag signaling the process status in conceptlists.tsv, to prevent them from being imported in the concepticon app.

unify column names in concept lists

Concept lists should only have a GLOSS column, if English is not among the lists source languages (or if an English gloss used in the source was corrected, in which case the column ENGLISH would list the gloss as found in the source, and GLOSS the corrected version). Otherwise, the column GLOSS should be renamed to ENGLISH.

Concepticon data (original stuff by Good et al.) doesn't provide links to the concepticon concepts

I have already converted the Good-data containing IDS, WOLD, and one further mapping (Usher-Whitehouse) to CSV. However, we have a problem with the URLs there, since they do create an error on the website:

Should we still leave those urls in the file, or just discard them? The concepticon IDs, which are referenced in the data, seem to be OK, as far as I checked.

Sutton and Walsh -- 1987

I don't have access to this book, but it could be useful to include it, since it seems to offer new concept lists for Australian region:

Sutton, P. and M. Walsh (1987). Wordlist for Australian Languages. Canberra:
Australian Institute of Aboriginal Studies.

Ardila's 2007 wordlist on cross-linguistic naming tests

This is a list proposed for neurology, based on Swadesh lists (humanities feeds science) and in which the author proposes to use the Swadesh lists to test aphasia or strokes etc. List has already been typed of but needs to be mapped.

Uralex Data should best be added before next release

Given how closely we work with Uralex in lexibank and the like, we should try to have an official uralex version for the next release (this is quickly to do, but we need the good list with authorisation of the uralex people).

Blust-1981-200

Robert Blust was so nice to give me the first alternative Swadesh list he created (presented in a talk from 1981). This list is thus already digital, but stil needs to be linked (since it is much more precise than the list of the ABVD).

LOGOS children dictionary with flashcards

The Logos Children dictionary has translations for some 1000 concepts into some 60 languages. It may be interesting to link these concepts, especially since they offer images for each of the concepts, which may be useful to have for certain purposes.

BLESS-Data on semantic associations

The bless dataset offers some interesting semantic associations (hyperonomy, etc.) for 200 concrete words.

The data may be interesting for the concepticon, since it offers additional accounts on semantic relations between words/concepts.

Missing russian source concepts

The following three concept lists list Russian as source language, but do not have a corresponding column RUSSIAN:

  • Starostin-1991-110
  • Jachontov-1991-100
  • Jachontov-1991-65

Conceptlist from Grollemund et al. 2015

In Bantu expansion shows that habitat alters the route and pace of human dispersals Grollemund et al. use wordlists for ~420 Bantu Languages containing words for 100 concepts, sampled from 159 concepts in

Hombert J-M, Van der Veen L, Medjo Mve P (2011) ALGAB, Atlas Linguistique du GABon (Laboratoire Dynamique du Langage, Lyon).
Available at http://www.ddl.ish-lyon.cnrs.fr/equipes/index.asp?Langue=FR&Equipe=8&Page=Action&ActionNum=48
Accessed November 10, 2014.

Testing routine for wrongly assigned links based on distribution of concept labels

There's a simple test we can make and which enables us to find obviously wrong links:

  • assemble all concept labels in a dictionary
  • count and store the different concepticon-ids which are assigned to the same label
  • list the labels which are potentially problematic

The point is the following: when mapping, one may overview a very good concept set and link instead to another one, as I probably did with "no, not" which I inconsistently linked to either "NO" or "NO OR NOT". So here, the above-stated procedure would show which lists have the same concept labels, but link them differently.

Note however, that this procedure is not fully automatizable. In some cases, I linked words based on the information of concept labels in other languages (compare "dull" linked to "dull (of knife)" and "dull (stupid)", because in some lists, I have Chinese translations and know therefore better, what is the concept that was intended.

So the routine can either be integrated in the testing itself, but it should only throw warnings, no errors, or, it can be used separately, as a tool for those who link a new list to the concepticon.

Badges for the Concept-Lists

In order to allow for quality control and the like, it would be great if we could create badges for all concept lists processed. This would answer questions like:

  1. are there mergers inside the list, that is, are two or more concepts linked to the same concept sets?
  2. [maybe silly, but might be interesting] how often do the concepts in the very list occur in other lists on average, that is, how "unique" is the list regarding its inventory
  3. [also not necessary, but good for proof-checking] how large is the levenstheint distance on average between the concept labels in the list and the concept labels of the concepticon
  4. which lists are most similar in terms of overlaps in concepts to the given list

I think this may be some interesting information we could assemble automatically whenever parsing the concepticon-data in, and it would be worthwhile to show the information. It may be enough to write a script that computes the values and to add them automatically to the file conceptlists.tsv, since it would probably here, where the information would be displayed afterwards...

MRC Psycholinguistic Database might give interesting subsets

The MRC Psycholinguistic Database contains tons of metadata which may be interesting for conceptual studies. It is only difficult to link it and it should be limited to a subset of the data (maybe a concept list itself that offers the meta-data for a specific purpose).

This is no issue of concrete hurry, but I consider it useful to collect conceptual metadata that could be partially linked at some point...

Buck -- 1949

This list has been digitized by many people, but they all differ from the original. i have now digitized the full list myself (with a little help of OCR). It still needs re-editing, but afterwards, it should be mapped as completely as possible to the Concepticon, since thsi list is historically quite important: it was proposed independently from Swadesh, and it was used in a couple of projects thereafter.

We could then also link this list to all available conlang lists, which are also interesting, since they made their own mapping which may be interesting to be compared with those made by the concepticon.

The lists can all be found here and they offer them for download in CSV.

Note, that the Buck-1949 in the conlang archive is not identical with Buck original. I made some quick tests and found that they do not retain the original wording and sometimes add certain words to the labels. If we quote Buck, it should be as narrow as possible to the original.

Typo: rain

Hello from Tübingen!

First of all, thank you all for building and maintaining this project!

Here are a couple of typos that I found in the concept names:

  • RAIN (PRECIPATION) --> RAIN (PRECIPITATION) // ID: 658
  • FRONT TOOTH (INEISOR_ --> FRONT TOOTH (INCISOR) // ID: 442

The latter has been fixed here in the repo, but apparently the fix has not reached the web server.

Kind regards,
Pavel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.