Coder Social home page Coder Social logo

crosslingual-cca's Introduction

Cross-lingual Word Vectors Projection Using CCA

Manaal Faruqui, [email protected]

This tool can be used to project vectors of two different languages in the same space where they are maximally correlated. This tool is associated with (Faruqui and Dyer, 2014). These projected vectors are found to be much better than the original vectors on a variety of lexical semantic evaluation tasks.

Requirements:-

  1. Python 2.7
  2. Matlab accessible from the shell

Data you need:-

  1. Language1 Word Vector File
  2. Language2 Word Vector File
  3. Word Alignment File

Each vector file should have one word vector per line as follows (space delimited):-

the -1.0 2.4 -0.3 ...

The word alignment file should have the following format (one word pair per line):-

lang1word ||| lang2word

Look at the en-sample.txt de-sample.txt (uncompress them) and align-sample.txt

Projecting the embeddings in both languages to a shared space:

./project_vectors.sh Lang1VectorFile Lang2VectorFile WordAlignFile OutFile Ratio

./project_vectors.sh en-sample.txt de-sample.txt align-sample.txt out 0.5

where, Ratio is a float from 1 to 0. It is the fraction of the original vector length that you want your projected vectors to have.

Output

Two files of names: OutFile_orig1_projected.txt, OutFile_orig2_projected.txt

which are you new projected word vectors, enjoy ! :D

Projecting the embeddings of language 1 to the vector space of language 2:

./project_vectors_to_lang2.sh Lang1VectorFile Lang2VectorFile WordAlignFile ProjectionFromLang1SpaceToLang2Space Lang1WordEmbeddingsProjectedToLang2Space

./project_vectors.sh en-sample.txt de-sample.txt align-sample.txt en-de-projection projected-en-word-embeddings

Unlike project_vectors.sh, the number of columns (i.e., size of word embeddings) in Lang1VectorFile and Lang2VectorFile must match when using project_vectors_to_lang2.sh. The number of rows (i.e., vocabulary size) may be different. Otherwise, the input files to project_vectors_to_lang2.sh are identical to those of project_vectors.sh.

Output

ProjectionFromLang1SpaceToLang2Space is a serialization of a squared matrix with each dimension equal to the word embeddings length in Lang1VectorFile (or Lang2VectorFile; they must match). The standard canonical correlation analysis returns two matrices (A, B) which represent the linear transformation from language 1 vector space to the shared space, and from language 2 vector space to the shared space, respectively. The matrix in this file is the result of AB-1.

Lang1WordEmbeddingsProjectedToLang2Space consists of word embeddings for language 1 words (as read from Lang1VectorFile), projected to the vector space in which language 2 vectors live.

Reference

@InProceedings{faruqui-dyer:2014:EACL,
  author    = {Faruqui, Manaal  and  Dyer, Chris},
  title     = {Improving Vector Space Word Representations Using Multilingual Correlation},
  booktitle = {Proceedings of EACL},
  year      = {2014}
}

crosslingual-cca's People

Contributors

dhgarrette avatar mfaruqui avatar wammar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

crosslingual-cca's Issues

Sample Output

Is it possible to include example intermediate results / final results? I'm interested in re-implementing the project_vectors.m in python, but I don't have MATLAB so it'd be hard to validate my implementation.

Trailing spaces

The code doesn't work properly when there are trailing white spaces at the end of every line in the word vector file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.