Coder Social home page Coder Social logo

ysymyth / ec-nl Goto Github PK

View Code? Open in Web Editor NEW
30.0 2.0 5.0 22 KB

[ICLR 2022] Linking Emergent and Natural Languages via Corpus Transfer

License: MIT License

Python 97.80% Shell 2.20%
emergent-communication natural-language-processing language-model machine-learning iclr iclr2022

ec-nl's Introduction

EC-NL

Code and data for paper Linking Emergent and Natural Languages via Cospus Transfer at ICLR 2022 (spotlight).

@inproceedings{yao2022linking,
  title = {Linking Emergent and Natural Languages via Corpus Transfer},
  author = {Yao, Shunyu and Yu, Mo and Zhang, Yang and Narasimhan, Karthik and Tenenbaum, Joshua and Gan, Chuang},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year = {2022},
  html = {https://openreview.net/pdf?id=49A1Y6tRhaq},
}

Dependencies

  • PyTorch 1.8
  • SciPy 1.4
  • Transformers 4.4.2
  • (Optional) Wandb

Data

Google Drive includes

  • image_features: Image features of coco-2014 (coco.pt) and Conceptual Captions (cc.pt) datasets from a pre-trained ResNet, to be used in EC pre-training.

  • lm_corpora: Corpora used for language modeling transfer experiments.

Name Usage Comment
cc.pt pre-train Emergent language
paren-zipf.pt pre-train Regular language of nesting parentheses
wiki-es.pt pre-train Spanish (IE-Romance) Wikipedia
wiki-da.pt fine-tune Danish (IE-Germanic) Wikipedia
wiki-eu.pt fine-tune Basque (Basque) Wikipedia
wiki-ja.pt fine-tune Japanese (Japanese) Wikipedia
wiki-ro.pt fine-tune Romanian (IE-Romance) Wikipedia
wiki-fi.pt fine-tune Finnish (Uralic) Wikipedia
wiki-id.pt fine-tune Indonesian (Austronesian) Wikipedia
wiki-kk.pt fine-tune Kazakh (Turkic) Wikipedia
wiki-he.pt fine-tune Hebrew (Afro-Asiatic) Wikipedia
wiki-ur.pt fine-tune Urdu (IE-Indic) Wikipedia
wiki-fa.pt fine-tune Persian (IE-Iranian) Wikipedia

Experiments

Emergent Communication (EC) Game

This part aims to generate emergent langauge corpus for downstream tasks. Download image_features from Google Drive to ./ec-pretrain/data. To run the emergent communication training,

cd ec-game
python train.py

Some major options:

  • --dataset: use Conceptual Captions (cc) or MS-COCO (coco_2014) dataset.
  • --vocab_size: Vocabulary size (default 4035).
  • --seq_len: Sequence length limit (default 15).

Such a game training automatically stores EC agents (e.g. ./ckpt/cc_vocab_4035_seq_15_reset_-1_nlayers_1/run77926/model_90.6_1000_4035.pt) and emergent language corpora (e.g. ./ckpt/cc_vocab_4035_seq_15_reset_-1_nlayers_1/run77926/model_90.6_1000_4035.pt-cc.pt, which can be used in place of lm_corpora/cc.pt from Google Drive) from different training steps. In the example, 90.6_1000_4035 represents game accuracy, game training steps, and game vocabulary size respectively.

Language Modeling Transfer

This part aims to reproduce Figure 2 of the paper. Download lm_corpora from Google Drive to ./ec-pretrain/data.

To run the pre-training,

export size=2 # 2,5,10,15,30
export pt_name="wiki-es" # "paren-zipf", "cc"
. pretrain.sh

To run the fine-tuning,

export size=2 # 2,5,10,15,30
export pt_name="wiki-es" # "paren-zipf", "cc"
export ft_name="wiki-ro"
export ckpt=3000
. finetune.sh

Meaning of variables above:

  • size: Token size (million) of pre-training corpus ([2, 5, 10, 15, 30]).
  • pt_name: Name of pre-training corpus (["wiki-es", "paren-zipf", "cc"]).
  • ft_name: Name of fine-tuning corpus (["wiki-ro", "wiki-da.pt]).
  • ckpt: Which pre-training checkpoint to use for fine-tuning (default 3000).

Acknowledgements

The EC part of the code is based on ECNMT, which was partly based on Translagent.

The LM part of the code is based on Huggingface run_clm.py.

The datasets for our EC experiments include MS COCO and Conceptual Captions.

The datasets for our LM experiments derive from tilt-transfer.

Please cite these resources accordingly. For any question, contact Shunyu.

ec-nl's People

Contributors

ysymyth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ec-nl's Issues

Error loading coco files.

Hi,

I am trying to replicate your codebase. The Readme instruction says we should put the files downloaded from the image_features google drive folder into ./ec-pretrain/data. I created a directory called ec-pretrain/data and added the files there but whenever I run train.py for the for the Emergent Communication Game, I receive this error below:

_Entering Main
Traceback (most recent call last):
File "train.py", line 73, in
data = torch.load(feat_path)
File "/home/mogunleye/miniconda3/envs/ec-nl1/lib/python3.6/site-packages/torch/serialization.py", line 579, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/mogunleye/miniconda3/envs/ec-nl1/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/mogunleye/miniconda3/envs/ec-nl1/lib/python3.6/site-packages/torch/serialization.py", line 211, in init
super(open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './data/coco.pt'

Then I thought maybe I should put the data in ./data/ instead of ./ec-pretrain/data but I still get the same error.

I was wondering if I have done something wrong?

Thanks and looking forward to hearing from you.

Multi-GPU support

How can I train the model (EC game - used for generating the emergent language) on the multiple GPUs?

What are the necessary changes that need to be made to the code?

I really appreciate any help you can provide.

Add license to project

Would it be possible license this code under a free software license (https://opensource.org/licenses)? Currently, there is no license displayed in this project, so it is not clear if incorporating this code into other projects is allowed. Thanks.

Dataset coverage during training

Hello @ysymyth ,
If I have understood correctly, during each epoch out of all the N (=total images in the train dataset) only a set of n (=batch size*number of distracting images) is used as listener images here. And n' (=batch size) is used as speaker images here. Hence it is a possibility that some images might be used repetitively and some may even not be used. How can we make sure that each of the image in training data is used as a speaker image at least once?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.