DAZER

The Tensorflow implementation of our ACL 2018 paper:
A Deep Relevance Model for Zero-Shot Document Filtering, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen Paper url: http://aclweb.org/anthology/P18-1214

Requirements

Python 3.5
Tensorflow 1.2
Numpy
Traitlets

Guide To Use

Prepare your dataset: first, prepare your own data. See Data Preparation

Configure: then, configure the model through the config file. Configurable parameters are listed here

See the example: sample.config

In additional, you need to change the zero-shot label settings in get_label.py

(You need make sure both get_label.py and model.py are put in same directory)

Training : pass the config file, training data and validation data as

python model.py config-file\
    --train \
    --train_file: path to training data\
    --validation_file: path to validation data\
    --checkpoint_dir: directory to store/load model checkpoints\ 
    --load_model: True or False(depends on existing or not). Start with a new model or continue training

See example: sample-train.sh

Testing: pass the config file and testing data as

python model.py config-file\
    --test \
    --test_file: path to testing data\
    --test_size: size of testing data (number of testing samples)\
    --checkpoint_dir: directory to load trained model\
    --output_score_file: file to output documents score\

Relevance scores will be output to output_score_file, one score per line, in the same order as test_file.

Data Preparation

All seed words and documents must be mapped into sequences of integer term ids. Term id starts with 1.

Training Data Format

Each training sample is a tuple of (seed words, postive document, negative document)

seed_words \t postive_document \t negative_document

Example: 334,453,768 \t 123,435,657,878,6,556 \t 443,554,534,3,67,8,12,2,7,9

Testing Data Format

Each testing sample is a tuple of (seed words, document)

seed_words \t document

Example: 334,453,768 \t 123,435,657,878,6,556

Validation Data Format

The format is same as training data format

Label Dict File Format

Each line is a tuple of (label_name, seed_words)

label_name/seed_words

Example: alt.atheism/atheist christian atheism god islamic

Word2id File Format

Each line is a tuple of (word, id)

word id

Example: world 123

Embedding File Format

Each line is a tuple of (id, embedding)

id embedding

Example: 1 0.3 0.4 0.5 0.6 -0.4 -0.2

Configurations

Model Configurations

BaseNN.embedding_size: embedding dimension of word
BaseNN.max_q_len: max query length
BaseNN.max_d_len: max document length
DataGenerator.max_q_len: max query length. Should be the same as BaseNN.max_q_len
DataGenerator.max_d_len: max query length. Should be the same as BaseNN.max_d_len
BaseNN.vocabulary_size: vocabulary size
DataGenerator.vocabulary_size: vocabulary size
BaseNN.batch_size: batch size
BaseNN.max_epochs: max number of epochs to train
BaseNN.eval_frequency: evaluate model on validation set very this epochs
BaseNN.checkpoint_steps: save model very this epochs

Data

DAZER.emb_in: path of initial embeddings file
DAZER.label_dict_path: path of label dict file
DAZER.word2id_path: path of word2id file

Training Parameters

DAZER.epsilon: epsilon for Adam Optimizer
DAZER.embedding_size: embedding dimension of word
DAZER.vocabulary_size: vocabulary size of the dataset
DAZER.kernal_width: width of the kernel
DAZER.kernal_num: num of kernel
DAZER.regular_term: weight of L2 loss
DAZER.maxpooling_num: num of K-max pooling
DAZER.decoder_mlp1_num: num of hidden units of first mlp in relevance aggregation part
DAZER.decoder_mlp2_num: num of hidden units of second mlp in relevance aggregation part
DAZER.model_learning_rate: learning rate for model instead of adversarial calssifier
DAZER.adv_learning_rate: learning rate for adversarial classfier
DAZER.train_class_num: num of class in training time
DAZER.adv_term: weight of adversarial loss when updating model's parameters
DAZER.zsl_num: num of zero-shot labels
DAZER.zsl_type: type of zero-shot label setting ( you may have multiply zero-shot settings in same number of zero-shot label, this indicates which type of zero-shot label setting you pick for experiemnt, see get_label.py for more details )

whuir / dazer Goto Github PK

dazer's Introduction

DAZER

Requirements

Guide To Use

Data Preparation

Configurations

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent