Coder Social home page Coder Social logo

monoses's Introduction

Monoses

This is an open source implementation of our unsupervised machine translation system, described in the following papers:

In addition, it also includes tools to induce bilingual lexica through unsupervised machine tranlation as described in the following paper:

If you use this software for academic research, please cite the relevant paper(s).

Requirements

  • Python 3 with PyTorch (tested with v0.4) and editdistance, available from your PATH
  • Java
  • Moses v4.0, compiled under third-party/moses/
  • FastAlign, compiled under third-party/fast_align/build/
  • Phrase2vec, compiled under third-party/phrase2vec/
  • VecMap, available under third-party/vecmap/
  • Fairseq (tested with v0.6), available under third-party/fairseq/
  • Subword-NMT, available under third-party/subword-nmt/
  • SacreBLEU, available under third-party/sacrebleu/.

A script is provided to download all the dependencies under third-party/:

./get-third-party.sh

Note, however, that the script only downloads their source code, and you will need to compile Moses (including contrib/sigtest-filter and moses2), FastAlign and Phrase2vec yourself, and install Fairseq's dependencies. Please refer to the original documentation of each tool for detailed instructions on how to accomplish this.

In addition, you will also need to compile the tuning module in Java (which is based on Z-MERT) as follows:

cd training/tuning/zmert
make

Usage

The following command trains an unsupervised machine translation system from monolingual corpora using the exact same settings described in our most recent paper:

python3 train.py --src SRC.MONO.TXT --src-lang SRC \
                 --trg TRG.MONO.TXT --trg-lang TRG \
                 --working MODEL-DIR

The parameters in the above command should be provided as follows:

  • SRC.MONO.TXT and TRG.MONO.TXT are the source and target language monolingual corpora. You should just provide the raw text, and the training script will take care of all the necessary preprocessing (tokenization, deduplication etc.).
  • SRC and TRG are the source and target language codes (e.g. 'en', 'fr', 'de'). These are used for language-specific corpus preprocessing using standard Moses tools.
  • MODEL-DIR is the directory in which to save the output model.

By default, training uses 4 GPUs (with IDs 0, 1, 2 and 3) and takes about one week in our server. Once training is done, you can use the resulting model for translation as follows:

python3 translate.py MODEL-DIR --src SRC --trg TRG < INPUT.TXT > OUTPUT.TXT

In addition, you can also evaluate the model in the same settings as in our paper using the evaluate.py script.

For more details and additional options, run the above scripts with the --help flag.

Bilingual Lexicon Induction

The following command induces a bilingual dictionary starting from a set of cross-lingual word embeddings using the exact same settings described in our paper:

python3 bli/induce-dictionary.py --embeddings SRC.EMB TRG.EMB \
                                 --corpus SRC.TOK.TXT TRG.TOK.TXT \
                                 --working OUTPUT-DIR

The parameters in the above command should be provided as follows:

  • SRC.EMB and TRG.EMB are the input cross-lingual word embeddings. In our paper, these were obtained by training monolingual fastText embeddings and mapping them using the unsupervised mode in VecMap.
  • SRC.TOK.TXT and TRG.TOK.TXT are the source and target language (monolingual) corpora used to train the embeddings above. You should provide the exact same preprocessed version used to train the embeddings.
  • OUTPUT-DIR is the output directory in which to save the induced dictionaries (src2trg.dic and trg2src.dic) as well as the underlying machine translation model.

Publications

If you use this software for academic research, please cite the relevant paper(s) as follows (in case of doubt, please cite artetxe2019acl-umt, and/or artetxe2019acl-bli if you use the bilingual lexicon induction code):

@inproceedings{artetxe2019acl-umt,
  author    = {Artetxe, Mikel  and  Labaka, Gorka  and  Agirre, Eneko},
  title     = {An Effective Approach to Unsupervised Machine Translation},
  booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  month     = {July},
  year      = {2019},
  address   = {Florence, Italy},
  publisher = {Association for Computational Linguistics},
  pages     = {194--203},
  url       = {https://www.aclweb.org/anthology/P19-1019}
}

@inproceedings{artetxe2018emnlp,
  author    = {Artetxe, Mikel  and  Labaka, Gorka  and  Agirre, Eneko},
  title     = {Unsupervised Statistical Machine Translation},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  month     = {November},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {3632--3642},
  url       = {https://www.aclweb.org/anthology/D18-1399}
}

@inproceedings{artetxe2019acl-bli,
  author    = {Artetxe, Mikel  and  Labaka, Gorka  and  Agirre, Eneko},
  title     = {Bilingual Lexicon Induction through Unsupervised Machine Translation},
  booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  month     = {July},
  year      = {2019},
  address   = {Florence, Italy},
  publisher = {Association for Computational Linguistics},
  pages     = {5002--5007},
  url       = {https://www.aclweb.org/anthology/P19-1494}
}

License

Copyright (C) 2018-2020, Mikel Artetxe

Licensed under the terms of the GNU General Public License, either version 3 or (at your option) any later version. A full copy of the license can be found in LICENSE.txt.

The tuning module under training/tuning/zmert/ is based on Z-MERT.

monoses's People

Contributors

artetxem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

monoses's Issues

Segmentation fault: STEP 7 unsupervised_tuning

Hi, thank you for your great work. I run this code and encounter error, segmentation fault, during STEP 7.
More specifically, the error occurred in code is

bash(quote(MOSES + '/scripts/training/mert-moses.pl') +

and part of log file is log.txt
Could you give some advice about this problem? Thank you.

Step 8: Probing Table:: Bucket full error

Hey everyone,
I am trying to reproduce your result but can't get past step 7. I am using the WMT English-Polish language but facing error in step 8.
Attaching my error.
IMG-20200807-WA0003

Thank u in advance...

Using Phrase2Vec for Bilingual Lexicon Induction

Hi, I'm interested in your paper Bilingual Lexicon Induction through Unsupervised Machine Translation. I understand that the paper is based on the code in this repository, but I think there might be a minor difference. In the paper, frequent n-grams are assigned the centroid of the words they contain, whereas the SMT paper uses phrase2Vec, correct? Is there a difference in performance if phrase2Vec is used for BLI?

ignoring toknizer

I got this warning for both kazakh and turkish with monoses, WARNING: No known abbreviations for language 'kz', attempting fall-back to English version..., this because Moses does not support kazakh and trukish, here I did not know how to ignore tokenizer, please could you point me?

Divide by zero error

Hi, I got this error during running train.py

monoses/third-party/vecmap/map_embeddings.py:301: RuntimeWarning: divide by zero encountered in true_divide

It seems a divide by zero error during the mapping step. Any ideas why this happen?

Thank you!

Can it run on gpu?

Hello,Truly appreciated with work.I am using a corpora with 10lakh sentences.In the "induce-phrase-table.py" file there is default option of using cuda,,but after checking gpu usage its 0 %.So pls can out clarify what does induce-phrase-table is actually doing ,or how can I increase traning speed.

Thanks a lot.

Step6_zero_phrase_filtering problem

While training monoses I got an error in Step 7 which is------

Traceback (most recent call last):
File "/home/xyz/monoses/training/tuning/tune.py", line 335, in
main()
File "/home/xyz/monoses/training/tuning/tune.py", line 322, in main
extract_zmert_params(tmp + '/dcfg.txt.ZMERT.final'))
File "/home/xyz/monoses/training/tuning/tune.py", line 73, in extract_zmert_params
with open(path, encoding='utf-8', errors='surrogateescape') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpv1m8y_i1/dcfg.txt.ZMERT.final'
clean-corpus.perl: processing /home/xyz/models/monoses/src-tgt/tmpzbtcque6/train.bt & .trg to /home/xyz/models/monoses/src-tgt/tmpzbtcque6/train-supervised/clean, cutoff 3-80, ratio 9

From the log file and intermediate results, I find out that

  • It successfully generated phrase tables.
  • However, in step 6 it filtered 0% which I suspect.

P(f|e) filter limit: 100
Filtering using P(e|f) only. n=100

..................................................[n:500000]
..................................................[n:1000000]
..................................................[n:1500000]
..................................................[n:2000000]
..................................................[n:2500000]
..................................................[n:3000000]
..................................................[n:3500000]
..................................................[n:4000000]
..................................................[n:4500000]
..................................................[n:5000000]
..................................................[n:5500000]
..................................................[n:6000000]
..................................................[n:6500000]
..................................................[n:7000000]
..................................................[n:7500000]
..................................................[n:8000000]
..................................................[n:8500000]
..................................................[n:9000000]
..................................................[n:9500000]
..................................................[n:10000000]

unfiltered phrases pairs: 10000000

 P(f|e) filter [first]: 0   (0%)
   significance filter: 0   (0%)
        TOTAL FILTERED: 0   (0%)

FILTERED phrase pairs: 10000000   (100%)
  • Then, in Step 7 while running decoder, it printed -

Call to decoder returned 1; was expecting 0.
Z-MERT exiting prematurely (MertCore returned 30)...

Step 7 unspervised_tuning problem with moses2

Your system is very good, I am learning to use it, but now I have a problem! In step 7 unspervised_tuning, I can't find the moses2 file in the bin directory. Is the moses2 in train.py is the bin/moses in the updated moses system? If not, can you send me a moses2 file?

Training too slow : Only a few cores used during STEP 4: Map Embeddings

@artetxem Hi Mikel, Great work. I am actually having a problem where in STEP 4: Map Embeddings, all the cores are not being used. Instead, less than half of the available cores are being used despite specifying the number of available threads as double of the available number of cores. This is when I train on a small corpus with a 8 core CPU.

If I use a system with 96 cores, the number of cores utilised even in previous steps such as 1,2 and 3 don't use even 20% of the available cores.

Do you have any pointers for the same?

Thanks

Features are 0: Step 7 unsupervised_tuning

@artetxem Great work on the project. I am trying to reproduce your results but cant get past step 7
I am using wmt14 news.2010 french and english text. After the first iteration of step 7 the features in the features list are all 0. Therefore throwing an error of Illegal division by zero at moses/scripts/training/mert-moses.pl line 1290

I have tried with other datasets but still getting the same error

Step5 induce_phrase_table problem

Thank you for your sharing.I trained to the fifth step and there was a problem.

Traceback (most recent call last):
File "monoses/training/induce-phrase-table.py", line 153, in
main()
File "monoses/training/induce-phrase-table.py", line 140, in main
.format((epoch + j/n)/args.epochs, t.detach().cpu().numpy(), loss.detach().cpu().numpy()),
TypeError: unsupported format string passed to numpy.ndarray.format
sort: cannot read: MODEL-DIR/tmpvkx1vuhd/src2trg.phrase-table: No such file or directory

FileNotFoundError

Hi,
Sorry to trouble you, but I had this error:

ERROR: compile contrib/sigtest-filter at /home/zhang/data-short/monoses/third-party/moses/scripts/generic/binarize4moses2.perl line 34.
ERROR: compile contrib/sigtest-filter at /home/zhang/data-short/monoses/third-party/moses/scripts/generic/binarize4moses2.perl line 34.
Using SCRIPTS_ROOTDIR: /home/zhang/data-short/monoses/third-party/moses/scripts
Not executable: /home/zhang/data-short/monoses/third-party/moses/bin/moses2 at /home/zhang/data-short/monoses/third-party/moses/scripts/training/mert-moses.pl line 466.
Traceback (most recent call last):
File "/home/zhang/.pyenv/versions/3.7.2/lib/python3.7/shutil.py", line 563, in move
os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: 'cj-word-smt/tmpou08sq8q/mert/moses.ini' -> 'cj-word-smt/step7/src2trg.it1.moses.ini'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 484, in
main()
File "train.py", line 478, in main
unsupervised_tuning(args)
File "train.py", line 366, in unsupervised_tuning
shutil.move(args.tmp + '/mert/moses.ini', config[(src, trg)])
File "/home/zhang/.pyenv/versions/3.7.2/lib/python3.7/shutil.py", line 577, in move
copy_function(src, real_dst)
File "/home/zhang/.pyenv/versions/3.7.2/lib/python3.7/shutil.py", line 263, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/home/zhang/.pyenv/versions/3.7.2/lib/python3.7/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'cj-word-smt/tmpou08sq8q/mert/moses.ini'

Is that the moses2 didn't been compiled?

the meaning of Phrase table

Hi, thanks for your great work. I noticed that , in the step of Induce phrase-table, the first score is direct phrase translation probability, the second is inverse phrase translation probability. However, moses 's phrases table is that the first is inverse phrase translation probability. the third is direct phrase translation probability.
Unfortunately, I am not familiar with statistical machine translation. I look forward to your reply.

pre-trained models

Hello @artetxem thank you for this work! 💯 🥇
Do you have any pre-trained model (dummy or not) in order to check the decoding before starting a new training from scratch?
Thanks a lot!

compiling Phrase2vec

hello,

I did not know how to compile Phrase2vec, this site https://github.com/artetxem/phrase2vec you pointed, but there is any instruction how to install it, so because that I got these bash: /media/adminonur/bee/monoses/third-party/phrase2vec/word2vec: No such file or directory
head: cannot open 'MODEL-DIR/tmpk7yaqfp0/vocab-full.txt' for reading: No such file or directory
bash: /media/adminonur/bee/monoses/third-party/phrase2vec/word2vec: No such file or directory
bash: /media/adminonur/bee/monoses/third-party/phrase2vec/word2vec: No such file or directory
head: cannot open 'MODEL-DIR/tmpk7yaqfp0/vocab-full.txt' for reading: No such file or directory
bash: /media/adminonur/bee/monoses/third-party/phrase2vec/word2vec: No such file or directory

please could you point me?

Compilation bug with moses2 (and my solution)

Hi, thanks for publishing this code -- just wanted to point out that I had issues with compiling moses2. These were strange compilation problems relating to HypothesisColl.h and an unordered set and whether or not to import it from boost or std. It turns out that since the version 4 release of Moses (which is used in this repo), they have rolled back changes in this file relating to this very issue.

I finally fixed the problem by checking out the latest version of moses. (As of writing, that commit hash was: 187a75cb5596c8e4362c66c62de395e2b7d3a64a)

no evaluate.py script

In the readme it says, "In addition, you can also evaluate the model in the same settings as in our paper using the evaluate.py script." -- note that there is no evaluate.py script in this repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.