Document modeling with Generative Adversarial Networks

Setup Guide:

The below steps need to be run to generate documents using this GAN model.

Requirements

Ensure that Python 3 is installed before installing remaining dependencies
Create a virtual environment (conda or pip3) to run the solution

Create a pip3 virtual environment to run the model, using below commands:

$ python3 -m venv dmgan

$ source dmgan/bin/activate
Create a conda virtual environment to run the model, using below commands:

$ conda create -y --name dmgan python=3.6

$ conda activate dmgan

$ conda install pip

Install the remaining dependencies and libraries, by running below command:

$ pip install -r requirements.txt

Data Population

Run below command to prepare the raw input dataset (consisting of 18,846 documents), and split over train-test-validation datasets:

$ python prepare.py
3 new files (training.csv (13,192 documents); validation.csv (1,884 documents); test.csv (3,769 documents)) are populated in the /data folder. In each CSV file, the 1st column is the label and 2nd column is the raw text document body.

Data Preprocessing

Run below command to pre-process the input raw data to the vectorized format expected by the model:

$ python preprocess.py --input data --output preprocessed_data --vocab data/20newsgroups.vocab

where: input is path to input dataset; output is path to preprocessed output dataset; vocab is path to vocab file

4 new files (training.csv; validation.csv; test.csv; labels.txt) are populated in the /preprocessed_data folder. In each CSV file, the 1st column is the label and 2nd column is the vectorized document body. The text file consists of the 20 groups of 20NewsGroups corpus.

Model Training

Run below command to train the GAN model:

$ python train.py --dataset preprocessed_data --model results

where: dataset is path to preprocessed dataset; model is path to model output directory

To view Tensorboard graphs, plots, etc., run below command in new terminal and open the generated URL link:

$ tensorboard --logdir results/logs/

where: logdir is path to results logs directory

To view additional parameters:

$ python train.py --help

Evaluating results

Run below command to evaluate the retrieval results:

$ python evaluate.py --dataset preprocessed_data --model results

where: dataset is path to preprocessed dataset; model is path to trained model directory

Extracting document vectors

Run below command to extract document vectors which will be saved in NumPy text format to the model directory:

$ python vectors.py --dataset preprocessed_data --model results

where: dataset is path to preprocessed dataset; model is path to trained model directory

Note: [Base code inspired from https://github.com/AYLIEN/adversarial-document-model]

harshirao / gan_doc_model Goto Github PK

gan_doc_model's Introduction

Document modeling with Generative Adversarial Networks

Setup Guide:

Requirements

Data Population

Data Preprocessing

Model Training

Evaluating results

Extracting document vectors

gan_doc_model's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent