Coder Social home page Coder Social logo

diachrony_for_russian's Introduction

Tracing semantic shifts for Russian

This repository features the code and the dataset related to the paper "Tracing cultural diachronic semantic shifts in Russian using word embeddings: test sets and baselines" by Vadim Fomin, Daria Bakshandayeva, Julia Rodina and Andrey Kutuzov (accepted to Dialog-2019). The slides of the presentation used during Dialog-2019 are available at dialogue_slides.pdf.

Dataset

The micro.csv file in the datasets directory contains Russian adjectives manually annotated for temporal semantic shifts in the time span from 2000 to 2014.

It consists of 280 entries. For each entry, three annotators considered a word from the column WORD (e.g., свиной 'of a swine, related to a swine') and decided to what degree the word in question has changed its meaning from year in the BASE_YEAR column (e.g., 2009, the year when swine flu was widely discussed in media) to the next year. Individual annotator scores can be found in the ASSESSOR1, ASSESSOR2, and ASSESSOR3 columns. Scores are on the scale from 0 to 2; to calculate the final score, a simple arithmetic mean of the scores was taken (the ASSESSOR_MEAN column) and rounded to the nearest integer. The rounded value was considered to be the ground truth (the GROUND TRUTH column).

The macro.csv file in the same directory contains 215 Russian words. 43 of them (35 nouns and 5 adjectives) are manually picked words that have undergone semantic changes from pre-Soviet through Soviet times. There also are four fillers per each target word (152 nouns and 20 adjectives). The target words are marked as 1 and the fillers are marked as 0.

See the paper for further details of the dataset creation.

Code

The algos directory contains our implementation of the semantic shift detection algorithms used to trace semantic shifts in Russian words:

  • Jaccard distance
  • Kendall tau distance
  • Procrustes alignment
  • Global Anchors

Using the code

Given two embedding models you can evaluate what is the degree of semantic change for any given word X (must be present in both models).

Run the score_word.py script as follows:

python3 score_word.py -w X -m1 2000.model -m2 2014.model

This will print out the scores according to each of the 4 algorithms (higher score means higher similarity between the word meaning in two models):

KendallTau score: -0.05795918367346939 (from -1 to 1)
Jaccard score: 0.0 (from 0 to 1)
Global Anchors score: 0.36681556701660156 (from -1 to 1)
Procrustes aligner score: 0.17986169457435608 (from -1 to 1)

Historical embedding models for Russian

The diachronic word embedding models we used in the paper are available for downloading at https://rusvectores.org/news_history/diachrony_russian/.

BibTex

@article{fomin-et-al-2019,
  title={Tracing cultural diachronic semantic shifts in {R}ussian using word embeddings: test sets and baselines},
  author={Fomin, Vadim and Bakshandaeva, Daria and Rodina, Julia and Kutuzov, Andrey},
  journal={Komp'yuternaya Lingvistika i Intellektual'nye Tekhnologii: Dialog conference},
  pages={203--218},
  url={http://www.dialog-21.ru/media/4598/fominvplusetal-116.pdf},
  year={2019}
}

diachrony_for_russian's People

Contributors

akutuzov avatar anya-bel avatar wadimiusz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.