Coder Social home page Coder Social logo

whuir / dazer Goto Github PK

View Code? Open in Web Editor NEW
33.0 3.0 13.0 144 KB

The Tensorflow implementation of accepted ACL 2018 paper "A deep relevance model for zero-shot document filtering", Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen, http://aclweb.org/anthology/P18-1214

Python 99.49% Shell 0.51%
tensorflow zero-shot document-classification document-filtering deeplearning document-ranking

dazer's Introduction

DAZER

The Tensorflow implementation of our ACL 2018 paper:
A Deep Relevance Model for Zero-Shot Document Filtering, Chenliang Li, Wei Zhou, Feng Ji, Yu Duan, Haiqing Chen Paper url: http://aclweb.org/anthology/P18-1214

Requirements

  • Python 3.5
  • Tensorflow 1.2
  • Numpy
  • Traitlets

Guide To Use

Prepare your dataset: first, prepare your own data. See Data Preparation

Configure: then, configure the model through the config file. Configurable parameters are listed here

See the example: sample.config

In additional, you need to change the zero-shot label settings in get_label.py

(You need make sure both get_label.py and model.py are put in same directory)

Training : pass the config file, training data and validation data as

python model.py config-file\
    --train \
    --train_file: path to training data\
    --validation_file: path to validation data\
    --checkpoint_dir: directory to store/load model checkpoints\ 
    --load_model: True or False(depends on existing or not). Start with a new model or continue training

See example: sample-train.sh

Testing: pass the config file and testing data as

python model.py config-file\
    --test \
    --test_file: path to testing data\
    --test_size: size of testing data (number of testing samples)\
    --checkpoint_dir: directory to load trained model\
    --output_score_file: file to output documents score\

Relevance scores will be output to output_score_file, one score per line, in the same order as test_file.

Data Preparation

All seed words and documents must be mapped into sequences of integer term ids. Term id starts with 1.

Training Data Format

Each training sample is a tuple of (seed words, postive document, negative document)

seed_words \t postive_document \t negative_document

Example: 334,453,768 \t 123,435,657,878,6,556 \t 443,554,534,3,67,8,12,2,7,9

Testing Data Format

Each testing sample is a tuple of (seed words, document)

seed_words \t document

Example: 334,453,768 \t 123,435,657,878,6,556

Validation Data Format

The format is same as training data format

Label Dict File Format

Each line is a tuple of (label_name, seed_words)

label_name/seed_words

Example: alt.atheism/atheist christian atheism god islamic

Word2id File Format

Each line is a tuple of (word, id)

word id

Example: world 123

Embedding File Format

Each line is a tuple of (id, embedding)

id embedding

Example: 1 0.3 0.4 0.5 0.6 -0.4 -0.2

Configurations

Model Configurations

  • BaseNN.embedding_size: embedding dimension of word
  • BaseNN.max_q_len: max query length
  • BaseNN.max_d_len: max document length
  • DataGenerator.max_q_len: max query length. Should be the same as BaseNN.max_q_len
  • DataGenerator.max_d_len: max query length. Should be the same as BaseNN.max_d_len
  • BaseNN.vocabulary_size: vocabulary size
  • DataGenerator.vocabulary_size: vocabulary size
  • BaseNN.batch_size: batch size
  • BaseNN.max_epochs: max number of epochs to train
  • BaseNN.eval_frequency: evaluate model on validation set very this epochs
  • BaseNN.checkpoint_steps: save model very this epochs

Data

  • DAZER.emb_in: path of initial embeddings file
  • DAZER.label_dict_path: path of label dict file
  • DAZER.word2id_path: path of word2id file

Training Parameters

  • DAZER.epsilon: epsilon for Adam Optimizer
  • DAZER.embedding_size: embedding dimension of word
  • DAZER.vocabulary_size: vocabulary size of the dataset
  • DAZER.kernal_width: width of the kernel
  • DAZER.kernal_num: num of kernel
  • DAZER.regular_term: weight of L2 loss
  • DAZER.maxpooling_num: num of K-max pooling
  • DAZER.decoder_mlp1_num: num of hidden units of first mlp in relevance aggregation part
  • DAZER.decoder_mlp2_num: num of hidden units of second mlp in relevance aggregation part
  • DAZER.model_learning_rate: learning rate for model instead of adversarial calssifier
  • DAZER.adv_learning_rate: learning rate for adversarial classfier
  • DAZER.train_class_num: num of class in training time
  • DAZER.adv_term: weight of adversarial loss when updating model's parameters
  • DAZER.zsl_num: num of zero-shot labels
  • DAZER.zsl_type: type of zero-shot label setting ( you may have multiply zero-shot settings in same number of zero-shot label, this indicates which type of zero-shot label setting you pick for experiemnt, see get_label.py for more details )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.