adversarial-multi-criteria-learning-for-CWS

The implementation of paper https://arxiv.org/abs/1704.07556, ACL 2017

Dependencies

Tensorflow: ==1.0.0

Pandas: >= 0.18.1

numpy: >=1.12.1

File Tree

|-- AdvMulti_model.py

|-- AdvMulti_train.py

|-- Baseline_model.py

|-- Baseline_train.py

|-- config.py

|-- data_as

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_cityu

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_ckip

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_ctb

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_helpers.py

|-- data_msr

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_ncc

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_pku

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_sxu

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- data_weibo

|   |-- dev

|   |-- test

|   |-- test_gold

|   |-- train

|   |-- words

|   `-- words_for_training

|-- models

|   |-- cws_msr

|   |   `-- checkpoints

|   |-- cws_ncc

|   |   `-- checkpoints

|   |-- cws_sxu

|   |   `-- checkpoints

|   |-- multi_model9

|   |   `-- checkpoints

|   |-- train_words

|   `-- vec100.txt

|-- prepare_data_index.py

|-- prepare_train_words.py

`-- voc.py

Data Format

For dev, train, test in each data_directory, its format is:

１９９５#<NUM>#B_NT

The first one is the original char(1995), the second one is the processed char(<NUM>), the last one is the segmentation tag and POS(B_NT). The POS information is not needed in the paper, its just for the convenience of the expand use.

For words in each data_directory, it is a dict for words:

平定费尔南多·安特萨纳北京索有文化传播有限公司

For words_for_training in each data_directory, it format is:

LC 过后 28

LC is POS, ‘过后’ is the bigram we extracted, 28 means its frequency in the specific dataset. The POS information is not needed in the paper, its just for the convenience of the expand use.

For vec100.txt is the embeding file generated by word2vec toolkit

Here is the link: https://pan.baidu.com/s/1jHHdzmA

Code Usage

prepare_data_index.py is used produce .csv that is used as direct input

prepare_train_words.py is used for generating words (need to be trained) beyond specific frequency in Multi-task learning.

AdvMulti_model.py & AdvMulti_train.py are paired model and train file

Baseline_model.py & Baseline_train.py are paired model and train file

Run

The hyper parameters are defined in config.py and tf.FLAGS

When you have all necessary files:

For baseline train:

CUDA_VISIBLE_DEVICES=0 python Baseline_train.py

For adversarial multi_task train:

CUDA_VISIBLE_DEVICES=0 python AdvMulti_train.py

fudannlp / adversarial-multi-criteria-learning-for-cws Goto Github PK