Coder Social home page Coder Social logo

cross-nvc's Introduction

Crosslingual Neural Vector Conceptualization (NVC)

Applying NVC to aligned word vectors of English and Chinese.

Accompanying code for the paper:

@inproceedings{raithel_cross_nvc_2019,
  title = {Crosslingual Neural Vector Conceptualization},
  booktitle = {NLPCC Workshop on Explainable Artificial Intelligence},
  author = {Raithel, Lisa and Schwarzenberg, Robert},
  year = {2019}
  }

Installation

Create and activate an environment with Python 3.6.

conda create --name CNVC python=3.6
source activate CNVC

After cloning the repository, install the requirements.

pip install -r requirements.txt

Data

Word Vectors:

The current underlying word vectors were learned with the fastText model [2]. Please download the pre-trained word vectors for English and Chinese (or any other second language for that matter) to the data directory:

wget -P ~/cross_nvc/src/data/ "https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.en.align.vec"
wget -P ~/cross_nvc/src/data/ "https://dl.fbaipublicfiles.com/fasttext/vectors-aligned/wiki.zh.align.vec"

Microsoft Concept Graph

For using the Microsoft Concept Graph (MCG) data from scratch, please download the data here:

wget -P ~/cross_nvc/src/data/ "https://concept.research.microsoft.com/Home/StartDownload"

The MCG data dump does only comprise concepts, instances and associated counts, no probabilities or REP values. These need to be calculated before training. Therefore, see the script utils/ms_concept_graph_scoring.py in the notebook which calculates all needed probabilities and writes the data to a TSV file.

This takes some time, therefore we recommend to download the already preprocessed data from here and unzip into

./src/data/

This dump includes

  • the data from the MCG as TSV file with REP values (data-concept-instance-relations-with-rep.tsv)
  • a JSON file of the same data, but filtered for instances that also occur in the fastText vocabulary (raw_data_dict_fasttext_maxlenFalse.json)
  • the filtered data including only concepts that have at least 100 instance with a REP > -10 (filtered_data_i100_v-10_fasttext_maxlenFalse)
  • Chinese translations of the English test dataset (translations_aligned)
  • the test set on which we reported the scores in the paper (test_set.csv)

We recommend to use the preprocessed data input_data/raw_data_dict_fasttext_maxlenFalse.json which includes all concepts and instances with their corresponding REP values in a JSON file.

Run and Replicate Experiments with the Notebook

The jupyter notebook demo_cross_nvc.ipynb demonstrates how to use our (pre-)trained neural vector conceptualization (NVC) model to display the reported activation profiles.

Navigate to src/ and start the notebook:

jupyter notebook demo_cross_nvc.ipynb 

or

export CUDA_VISIBLE_DEVICES=DEVICE_NUM jupyter notebook demo_cross_nvc.ipynb

if you want to run it on GPU DEVICE_NUM.

You can run two versions of the notebook:

  1. Use our pre-trained NVC model
  2. Train a new model

Use the pre-trained model

If you run all cells of the notebook in the given order and without changing anything except the necessary file paths, our pre-trained cross-NVC model is applied to a given filtered dataset and the results are reported at the end of the notebook.

Train a new model

Follow the instructions in the notebook to train and evaluate a new model.

[1] A. Joulin, P. Bojanowski, T. Mikolov, H. Jegou, E. Grave, Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion [2] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

cross-nvc's People

Contributors

rbtsbg avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

rbtsbg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.