Implementation of the paper "The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages" by Dingquan Wang and Jason Eisner. TACL 2016
The data release is here, this yields 37 × 38 × 38 = 53,428 languages in 70+G compressed files (700+G after extraction) in total.
-
To download and extract the entire dataset (Please make sure you have enough disk space):
GALACTIC_ROOT=$(pwd) bin/gd-fetch
-
If you want to download only a subset, for example, synthetic languages substrated by English or German without extraction:
GALACTIC_ROOT=$(pwd) bin/gd-fetch --substrate GD_English GD_German --pipeline download
-
For more options, please use:
bin/gd-fetch --help
-
Compile the code from the command line:
mvn compile
-
To build a single jar with all the dependencies included:
mvn compile assembly:single
-
To train a permutation model of, for example, NOUN from a given treebank (toy/sample.conllu):
GALACTIC_ROOT=$(pwd) bin/gd-train-permute --input toy/sample.conllu --node N
-
To permute a given treebank (toy/sample.collu) using the given permuatation models to a synthetic language, for example, en
fr@Nhi@V:GALACTIC_ROOT=$(pwd) bin/gd-translate --input toy/sample.conllu --spec en~fr@N~hi@V
-
For more options, please use:
bin/gd-train-permute --help bin/gd-translate --help
Note: The given model files are generated from a slightly older Pacaya version, which is no longer used in the current release. So the models reproduced from the current version might be slightly different from what are given.
@article{galactic16,
author = {Dingquan Wang and Jason Eisner},
title = {The {G}alactic {D}ependencies Treebanks: Getting More Data by Synthesizing New Languages},
journal = {Transactions of the ACL},
year = {2016},
note = {In review}
}