Code and data for paper Linking Emergent and Natural Languages via Cospus Transfer at ICLR 2022 (spotlight).
@inproceedings{yao2022linking,
title = {Linking Emergent and Natural Languages via Corpus Transfer},
author = {Yao, Shunyu and Yu, Mo and Zhang, Yang and Narasimhan, Karthik and Tenenbaum, Joshua and Gan, Chuang},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2022},
html = {https://openreview.net/pdf?id=49A1Y6tRhaq},
}
- PyTorch 1.8
- SciPy 1.4
- Transformers 4.4.2
- (Optional) Wandb
Google Drive includes
-
image_features
: Image features of coco-2014 (coco.pt
) and Conceptual Captions (cc.pt
) datasets from a pre-trained ResNet, to be used in EC pre-training. -
lm_corpora
: Corpora used for language modeling transfer experiments.
Name | Usage | Comment |
---|---|---|
cc.pt | pre-train | Emergent language |
paren-zipf.pt | pre-train | Regular language of nesting parentheses |
wiki-es.pt | pre-train | Spanish (IE-Romance) Wikipedia |
wiki-da.pt | fine-tune | Danish (IE-Germanic) Wikipedia |
wiki-eu.pt | fine-tune | Basque (Basque) Wikipedia |
wiki-ja.pt | fine-tune | Japanese (Japanese) Wikipedia |
wiki-ro.pt | fine-tune | Romanian (IE-Romance) Wikipedia |
wiki-fi.pt | fine-tune | Finnish (Uralic) Wikipedia |
wiki-id.pt | fine-tune | Indonesian (Austronesian) Wikipedia |
wiki-kk.pt | fine-tune | Kazakh (Turkic) Wikipedia |
wiki-he.pt | fine-tune | Hebrew (Afro-Asiatic) Wikipedia |
wiki-ur.pt | fine-tune | Urdu (IE-Indic) Wikipedia |
wiki-fa.pt | fine-tune | Persian (IE-Iranian) Wikipedia |
This part aims to generate emergent langauge corpus for downstream tasks.
Download image_features
from Google Drive to ./ec-pretrain/data
.
To run the emergent communication training,
cd ec-game
python train.py
Some major options:
--dataset
: use Conceptual Captions (cc
) or MS-COCO (coco_2014
) dataset.--vocab_size
: Vocabulary size (default4035
).--seq_len
: Sequence length limit (default15
).
Such a game training automatically stores EC agents (e.g. ./ckpt/cc_vocab_4035_seq_15_reset_-1_nlayers_1/run77926/model_90.6_1000_4035.pt
) and emergent language corpora (e.g. ./ckpt/cc_vocab_4035_seq_15_reset_-1_nlayers_1/run77926/model_90.6_1000_4035.pt-cc.pt
, which can be used in place of lm_corpora/cc.pt
from Google Drive) from different training steps. In the example, 90.6_1000_4035
represents game accuracy, game training steps, and game vocabulary size respectively.
This part aims to reproduce Figure 2 of the paper.
Download lm_corpora
from Google Drive to ./ec-pretrain/data
.
To run the pre-training,
export size=2 # 2,5,10,15,30
export pt_name="wiki-es" # "paren-zipf", "cc"
. pretrain.sh
To run the fine-tuning,
export size=2 # 2,5,10,15,30
export pt_name="wiki-es" # "paren-zipf", "cc"
export ft_name="wiki-ro"
export ckpt=3000
. finetune.sh
Meaning of variables above:
size
: Token size (million) of pre-training corpus ([2, 5, 10, 15, 30]
).pt_name
: Name of pre-training corpus (["wiki-es", "paren-zipf", "cc"]
).ft_name
: Name of fine-tuning corpus (["wiki-ro", "wiki-da.pt]
).ckpt
: Which pre-training checkpoint to use for fine-tuning (default3000
).
The EC part of the code is based on ECNMT, which was partly based on Translagent.
The LM part of the code is based on Huggingface run_clm.py.
The datasets for our EC experiments include MS COCO and Conceptual Captions.
The datasets for our LM experiments derive from tilt-transfer.
Please cite these resources accordingly. For any question, contact Shunyu.
ec-nl's People
ec-nl's Issues
Error loading coco files.
Hi,
I am trying to replicate your codebase. The Readme instruction says we should put the files downloaded from the image_features google drive folder into ./ec-pretrain/data
. I created a directory called ec-pretrain/data and added the files there but whenever I run train.py for the for the Emergent Communication Game, I receive this error below:
_Entering Main
Traceback (most recent call last):
File "train.py", line 73, in
data = torch.load(feat_path)
File "/home/mogunleye/miniconda3/envs/ec-nl1/lib/python3.6/site-packages/torch/serialization.py", line 579, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/mogunleye/miniconda3/envs/ec-nl1/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/mogunleye/miniconda3/envs/ec-nl1/lib/python3.6/site-packages/torch/serialization.py", line 211, in init
super(open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './data/coco.pt'
Then I thought maybe I should put the data in ./data/
instead of ./ec-pretrain/data
but I still get the same error.
I was wondering if I have done something wrong?
Thanks and looking forward to hearing from you.
Multi-GPU support
How can I train the model (EC game - used for generating the emergent language) on the multiple GPUs?
What are the necessary changes that need to be made to the code?
I really appreciate any help you can provide.
Add license to project
Would it be possible license this code under a free software license (https://opensource.org/licenses)? Currently, there is no license displayed in this project, so it is not clear if incorporating this code into other projects is allowed. Thanks.
Dataset coverage during training
Hello @ysymyth ,
If I have understood correctly, during each epoch out of all the N (=total images in the train dataset) only a set of n (=batch size*number of distracting images) is used as listener images here. And n' (=batch size) is used as speaker images here. Hence it is a possibility that some images might be used repetitively and some may even not be used. How can we make sure that each of the image in training data is used as a speaker image at least once?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.