Coder Social home page Coder Social logo

microsoft / multilingual-model-transfer Goto Github PK

View Code? Open in Web Editor NEW
74.0 8.0 24.0 60 KB

In this project we develop new deep learning models for bootstrapping language understanding models for languages with no labeled data using labeled data from other languages.

License: MIT License

Perl 9.03% Python 89.65% Shell 1.32%

multilingual-model-transfer's Introduction

Zero-Resource Multilingual Model Transfer

This repo contains the source code for our ACL 2019 paper:

Multi-Source Cross-Lingual Model Transfer: Learning What to Share
Xilun Chen, Ahmed Hassan Awadallah, Hany Hassan, Wei Wang, Claire Cardie
The 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019)
paper, arXiv, bibtex

Introduction

Modern NLP applications have enjoyed a great boost utilizing neural networks models. Such deep neural models, however, are not applicable to most human languages due to the lack of annotated training data for various NLP tasks. Cross-lingual transfer learning (CLTL) is a viable method for building NLP models for a low-resource target language by leveraging labeled data from other (source) languages. In this work, we focus on the multilingual transfer setting where training data in multiple source languages is leveraged to further boost target language performance.

Unlike most existing methods that rely only on language-invariant features for CLTL, our approach coherently utilizes both language-invariant and language-specific features at instance level. Our model leverages adversarial networks to learn language-invariant features, and mixture-of-experts models to dynamically exploit the similarity between the target language and each individual source language. This enables our model to learn effectively what to share between various languages in the multilingual setup. Moreover, when coupled with unsupervised multilingual embeddings, our model can operate in a zero-resource setting where neither target language training data nor cross-lingual resources (e.g. parallel corpora or Machine Translation systems) are available. Our model achieves significant performance gains over prior art, as shown in an extensive set of experiments over multiple text classification and sequence tagging tasks including a large-scale industry dataset.

Requirements

  • Python 3.6
  • PyTorch 0.4
  • PyTorchNet (for confusion matrix)
  • tqdm (for progress bar)

File Structure

.
├── LICENSE
├── README.md
├── conlleval.pl                            (official CoNLL evaluation script)
├── data_prep                               (data processing scripts)
│   ├── bio_dataset.py                      (processing the CoNLL dataset)
│   └── multi_lingual_amazon.py             (processing the Amazon Review dataset)
├── data_processing_scripts                 (auxiliary scripts for dataset pre-processing)
│   └── amazon
│       ├── pickle_dataset.py
│       └── process_dataset.py
├── layers.py                               (lower-level helper modules)
├── models.py                               (higher-level modules)
├── options.py                              (hyper-parameters aka. all the knobs you may want to turn)
├── scripts                                 (scripts for training and evaluating the models)
│   ├── get_overall_perf_amazon.py          (evaluation script for Amazon Reviews)
│   ├── get_overall_perf_ner.py             (evaluation script for CoNLL NER)
│   ├── train_amazon_3to1.sh                (training script for Amazon Reviews)
│   └── train_conll_ner_3to1.sh             (training script for CoNLL NER)
├── train_cls_man_moe.py                    (training code for text classification)
├── train_tagging_man_moe.py                (training code for sequence tagging)
├── utils.py                                (helper functions)
└── vocab.py                                (building the vocabulary)

Dataset

The CoNLL 2002, 2003 and Amazon datasets, as well as the multilingual word embeddings (MUSE, VecMap, UMWE) are all publicly available online.

Run Experiments

CoNLL Named Entity Recogintion

./scripts/train_conll_ner_3to1.sh {exp_name}

The following script can print out a compiled dev/test F1 scores for all languages:

python scripts/get_overall_perf_ner.py save {exp_name}

Multilingual Amazon Reviews

./scripts/train_amazon_3to1.sh {exp_name}

The following script can print out a compiled dev/test F1 scores for all languages:

python scripts/get_overall_perf_amazon.py save {exp_name}

Citation

If you find this project useful for your research, please kindly cite our ACL 2019 paper:

@InProceedings{chen-etal-acl2019-multi-source,
    author = {Chen, Xilun and Hassan Awadallah, Ahmed and Hassan, Hany and Wang, Wei and Cardie, Claire},
    title = {Multi-Source Cross-Lingual Model Transfer: Learning What to Share},
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

multilingual-model-transfer's People

Contributors

ccsasuke avatar microsoft-github-policy-service[bot] avatar microsoftopensource avatar msftgits avatar svenspa avatar tanmayparekh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multilingual-model-transfer's Issues

not enough memory

Merry Christmas!

In the beginning, I run the repo on 1080TI with 11GB memory, the out of memory error has been shown. Then, I switch to Titan RTX with 24GB memory. This situation is the same. After that, I choose to run on CPU. Memory is not enough neither.

RuntimeError: $ Torch: not enough memory: you tried to allocate 38GB. Buy new RAM! at /opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/TH/THGeneral.cpp:201

Could you please tell me how much memory do this repo need? And how to reduce memory consumption? Such as changing the batch size of other operations? Thanks.

read_bio_samples latin characters

I'm trying to replicate your code but I get an error when I load the CoNLL dataset for either ESP or NED, I get the error that 'utf-8' codec can't decode byte .... in position .....

This I can solve by specifying encoding in open() but I am curious if you did any preprocessing of the CoNLL files such that you don't get the same error. I took the CoNLL 2002 data from the official website https://www.clips.uantwerpen.be/conll2002/ner/

Thank you!

about the multilingual word embeddings

Hello, thank you for the source code. I am very interested in your thesis and source code, but I am still a newbie. For the data download, I have some questions about the download of multilingual word embeddings. For the download of MUSE, VecMap, UMWE three multilingual word embeddings, it has been a long time to explore for me but it is still unreasonable. If you are convenient, can you give me some guidance on how to download these multilingual word embeddings? Grateful!

CUDNN_STATUS_SUCCESS

I managed to collect and construct data to run the repo. Now I am facing the following issues, when I executed ./scripts/train_conll_ner_3to1.sh CoNLLE1

INFO:__main__:Done Loading Datasets.
Traceback (most recent call last):
  File "train_tagging_man_moe.py", line 512, in <module>
    main()
  File "train_tagging_man_moe.py", line 502, in main
    cv = train(vocabs, char_vocab, tag_vocab, train_sets, dev_sets, test_sets, unlabeled_sets)
  File "train_tagging_man_moe.py", line 148, in train
    F_s, C, D = F_s.to(opt.device) if F_s else None, C.to(opt.device), D.to(opt.device) if D else None
  File "/home/xxx/miniconda2/envs/py36msmmt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 379, in to
    return self._apply(convert)
  File "/home/xxx/miniconda2/envs/py36msmmt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/home/xxx/miniconda2/envs/py36msmmt/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 112, in _apply
    self.flatten_parameters()
  File "/home/xxx/miniconda2/envs/py36msmmt/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 105, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS

Which version of cuda and cudnn do you use? And in what way do you install PyTorchNet?

Thanks.

Word and character embeddings are not being fixed

We have options --no_fix_emb and --no_fix_charemb for not fixing the word and character embeddings respectively. But the argument parser misses the "dest" option due to which the embeddings are always fixed despite specifying the options.

Listing below the line numbers in the options file corresponding to the same.

parser.add_argument('--fix_charemb', action='store_true', default=True)
parser.add_argument('--no_fix_charemb', action='store_false')

parser.add_argument('--fix_emb', action='store_true', default=True)
parser.add_argument('--no_fix_emb', action='store_false')

what does '{exp_name}' mean

hello,the following question has been bothering me
'./scripts/train_conll_ner_3to1.sh {exp_name}'

what is the specific '{exp_name}' example?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.