Coder Social home page Coder Social logo

geomm's Introduction

Geometry-aware Multilingual Embedding

Code for learning multilingual embeddings using the method reported in:

Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra. Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach. Transaction of the Association for Computational Linguistics (TACL), Volume 7, p.107-120, 2019.

Environment Setup

Do the following steps in order:

  1. Clone the repository

  2. Create a python virtual environment without Tensorflow (if TF is present Pymanopt gives out of memory errors).

  3. pip install numpy scipy ipdb

  4. pip install git+https://github.com/pymanopt/pymanopt.git --upgrade

  5. In Pymanopt code(located at C:\Anaconda\envs\ENVRNMT_NAME\Lib\site-packages\pymanopt\tools\autodiff for Windows or the Linux equivalent), at line 46,49,101,104 add a parameter to the call of theano.function, allow_input_downcast=True

  6. conda install theano pygpu

  7. In Users\USER_NAME make a file .theanorc.txt with following content:

     [global]
     device = cuda
     floatX = float32
    
  8. Install cupy based on your CUDA version

  9. Two GPUs are needed

Note: While using this setup with Pymanopt, make sure to import cupy before importing theano, as sometimes theano throws an error that it is unable to find the correct CUDA version. However, the use of Cupy before this fixes the issue.

Datasets

The datasets can be downloaded by running the following commands in vecmap_data/ and muse_data/

./get_vecmap_data.sh
./get_muse_data.sh

Reproducing Results

The results that have been reported in Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach can be reproduced by running the following scripts:

  • Results of the GeoMM algorithm reported in Table 1, 2, and 6:

      ./geomm_results.sh
    
  • Results of the GeoMM-Multi algorithm reported in Table 1, 2, and 6:

      ./geomm_multi_results.sh
    
  • Results of the GeoMM-Semi algorithm reported in Table 7:

      ./geomm_semi_results.sh
    

Note: Since our code makes use of CUDA and FP32 precision, it may not be possible to reproduce our results exactly, due to minor numerical variations in GPU operations. However, the effect on the final results is negligible, as we have observed the variations usually lie within an error margin of 0.1 or 0.2.

Note: Added geomm_optimized.py which can replace geomm.py in all use-cases. Reduces time-taken for en-es pair from 188.5 second to 6.5 second.

GeoMM Embeddings

We provide GeoMM bilingual and multilingual embeddings. These are normalized embeddings in the latent space, . The embeddings are made available under the following license: Creative Commons Attribution-NonCommercial 4.0 International License.

MUSE Dataset

These embeddings have been trained jointly using en-XX MUSE bilingual dictionaries and Wikipedia FastText embeddings.

de en es fr ru zh

VecMap Dataset

These embeddings have been trained jointly using en-XX bilingual dictionaries and embeddings from the VecMap dataset.

de en es fi it

English-Indian language bilingual embeddings

These bilingual embeddings have been trained using the CommonCrawl+Wikipedia FastText Embeddings and the MUSE bilingual dictionaries.

en-hi en-bn en-ta

Acknowledgements

The data-processing part of our code was taken from Mikel Artetxe's Vecmap Repository.

References

Please cite Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach if you found the resources in this repository useful.

@article{jawanpuria2018learning,
  title={Learning multilingual word embeddings in latent metric space: a geometric approach},
  author={Jawanpuria, Pratik and Balgovind, Arjun and Kunchukuttan, Anoop and Mishra, Bamdev},
  journal={Transaction of the Association for Computational Linguistics (TACL)},
  volume={7},
  pages={107--120},
  year={2019}
}

geomm's People

Contributors

ankunchu avatar arjunbalgovind avatar anoopkunchukuttan avatar bamdevm avatar mayank127 avatar pratikjawanpuria avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.