Coder Social home page Coder Social logo

akutuzov / russian_wsi Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 0.0 5.78 MB

Russian Word Sense Induction system (for the RUSSE'18 shared task)

License: GNU General Public License v3.0

Python 100.00%
word-sense-induction word-sense-disambiguation word-embeddings russian nlp

russian_wsi's Introduction

Word Sense Induction for Russian

This is an implementation of Russian Word Sense induction system designed for the RUSSE'18 shared task. This system ranked 2nd among 19 participants for the wiki-wiki task. The paper describing it is accepted to the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue’2018).

The approach is based on clustering averaged word vectors ("semantic fingerprints") of contexts for ambiguous words. Word vectors can be extracted from any word embedding model.

Input data must correspond to the RUSSE'18 data format, as exemplified by the files in the RUSSE_data subdirectory in this repository. Note that the words in the word and context columns of the input dataset should match entries in your word embedding model: for example, regarding lemmatization and PoS-tags. We provide each RUSSE'18 dataset in 3 versions: raw text, lemmas, lemmas + PoS tags.

Word embedding models for Russian can be downloaded, for example, from RusVectōrēs web service.

Usage:

python3 wsi.py [-h] --input INPUT --model MODEL [--weights] [--2stage] [--test]
  -h, --help     show this help message and exit
  --input INPUT  Path to input file with contexts
  --model MODEL  Path to word2vec model
  --weights      Use word weights?
  --2stage       2-stage clustering?
  --test         Make predictions for test file with no gold labels?

For example,

python3 wsi.py --input RUSSE_data/wiki-wiki/train_tagged.csv --model ruscorpora_upos_skipgram_300_5_2018.vec.gz

will use the word embedding model ruscorpora_upos_skipgram_300_5_2018.vec.gz to induce senses for all ambiguous words from train_tagged.csv, cluster the contexts according to these senses, and evaluate the resulting clustering against gold labels in the dataset. The new dataset containing cluster assignments will be saved to train_tagged_predictions.csv.

The --test argument is used when gold labels are not available (like in test_tagged.csv files). In this case, the script will make predictions, but not evaluation. Off by default.

The --weights argument will additionally weight words by their frequencies when creating semantic fingerprints of context utterances. We highly recommend this setting, as it greatly improves the performance. However, for this to work, you need a word embedding model saved in native Gensim format (to store word frequencies in the original training corpus). That's why it is off by default.

The --2stage argument tells the script to perform clustering in 2 stages: first, induce the number of clusters using Affinity Propagation and then do actual clustering with the induced number using Spectral Clustering. This increased performance for some datasets, but not all, therefore it is off by default.

russian_wsi's People

Contributors

akutuzov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.