Coder Social home page Coder Social logo

priest671 / ai-writer_data2doc Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gau820827/ai-writer_data2doc

0.0 2.0 0.0 23.97 MB

PyTorch Implementation of NBA game summary generator.

License: Apache License 2.0

Python 69.88% Lua 29.54% Shell 0.58%

ai-writer_data2doc's Introduction

AI writter - Data2Doc

This is the project for automatically generating summarizatons given NBA game box-score.

Requirements

  1. Python 3.6
  2. PyTorch 0.2

Data set

We use Rotowire dataset for training as in Challenges in Data-to-Document Generation (Wiseman, Shieber, Rush; EMNLP 2017). This dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores.

Basic Usage

  1. First extract the dataset, using tar -jxvf boxscore-data/rotowire.tar.bz2.
  2. Go to directory train/, using cd train
  3. (Optional) Train the model, using python3 train.py.
  4. Generate some text, using python3 small_evaluate.py

Some pre-trained model files could be found here. Extract them under model/ directory.

  • Sample Generate Summary using BiLSTM encoder
The Chicago Bulls defeated the Orlando Magic , 102 - 92 , at
United Center on Saturday . The Bulls ( 10 - 2 ) were expected
to win this game rather easily and they did n’t disappoint in the
fourth quarter . In fact , they outscored the Magic by 12 points in
the fourth quarter to pull out the win . Defense was key for the
Bulls , as they held the Magic to 37 percent from the field and 25
percent from three - point range . The Magic also dominated the
rebounding , winning that battle , 44 - 32 . The Magic ( 8 - 18 )
had to play this game without Joel Embiid and Nikola Vucevic ,
as they simply did n’t have enough to get out of hand . Jimmy
Butler was the player of the game , as he tallied 23 points , five
rebounds and four assists . Aaron Gordon was the only other
starter with more than 10 points , as he totaled 15 points , five
rebounds and four assists. . . ...

Arguments

The train/train.py accepts the following arguments.

(The default configurations for each argument can be found in train/settings.py.)

  # Parameters for model
  -embed EMBEDDING_SIZE,    the hidden size for embedding,, default = 600
  -lr LR,                   initial learning rate, default = 0.01
  -batch BATCH_SIZE,        batch size, default = 2
  
  # Parameters for model
  -encoder ENCODER_STYLE,   type of encoder NN (LIN, BiLSTM, RNN, BiLSTMMax, HierarchicalRNN,
                            HierarchicalBiLSTM, HierarchicalLIN)
  -decoder DECODER_STYLE,   type of decoder NN (RNN, HierarchcialRNN)
  -copy,                    if apply pointer-generator network(True, False), default = False
 
  # Parameters for training
  -gradclip GRAD_CLIP,      gradient clipping, default = 2
  -pretrain PRETRAIN,       file name of pretrained model (must assign with iternum)
  -iternum ITER_NUM,        file name of pretraiend model (must assign with pretrain)
  -layer LAYER_DEPTH,       the depth of recurrent units, default = 2; no depth for linear units
  -copyplayer COPY_PLAYER,  if include player's information in data, default = False 
  -epoch EPOCH_TIME,        maximum epoch time for training
  -maxlength MAX_LENGTH,    maximum words for each sentence
  -maxsentece MAX_SENTECE,  limit the maximum length for training. Set lower (e.g 5) for faster training speed. If not specify, program will train entire corpus.

  # Parameters for display
  -getloss GET_LOSS,        print out average loss every `GET_LOSS` iterations.
  -epochsave SAVE_MODEL,    save the model every `SAVE_MODEL` epochs.
  -outputfile OUTPUT_FILE,  starting name for saving the model. During training, encoder and decoder would be saved as `[OUTPUT_FILE]_[encoder|decoder]_[iter_time]` every `SAVE_MODEL` epochs.

With above arguments, a variety of configurations could be trained:

python train.py # This will train using default settings in `train/settings.py`

python train.py -embed 300 -lr 0.01 -batch 3 -getloss 20 -encoder HierarchicalRNN 
                -decoder HierarchicalRNN -epochsave 12 -copy True -copyplayer False 
                -gradclip 2 -layer 2 -epoch 3 -outputfile pretrain_copy 
                -pretrain hbilstm -iternum 200
                # -pretrain and -iternum must be specified together
                # the corresponding pretrained model name will be in the format:
                #    [pretrain]_[encoder|decoder|optim]_[iter_num]

python train.py -embed 720 -lr 0.02 -batch 3 -getloss 10 -encoder HierarchicalLIN 
                -decoder HierarchicalRNN -epochsave 5 -copy True -copyplayer True 
                -gradclip 3 -maxsentence 800  -epoch 3

python train.py -embed 512 -lr 0.03 -batch 3 -getloss 10 -encoder BiLSTM 
                -decoder RNN -epochsave 12 -copy True -copyplayer True 
                -gradclip 3 -maxsentence 230 -layer 2 -epoch 3

python train.py -embed 600 -lr 0.02 -batch 3 -getloss 10 -encoder LIN 
                -decoder RNN -epochsave 12 -copy True -copyplayer False 
                -gradclip 5 -maxsentence 800 -layer 2 -epoch 3

More Details

Train

In train/ directory is the part of data2text generation. The files are for this part include:

  • dataprepare.py -- word2index map class, storing the vocabulary and relations
  • model.py -- The file that contains the implementation of several encoder, decoder and embedding model class
  • preprocessing.py -- mainly for read of parse the data
  • train.py -- the file that defines the training processes
  • util.py -- utility functions for time, showing etc.
  • setting.py -- store the hyper-parameter, file location etc.

Evaulate

In evaluate/ directory is the extraction evaluation system, based on Challenges in Data-to-Document Generation (Wiseman, Shieber, Rush; EMNLP 2017) and part of their codes. The files from their repo contains:

  • data_utils.py -- for the data parsing and cleaning
  • extractor.lua -- the main script for relation classifier
  • non_rg_metrics.py -- CS and RO computation

Thanks to the dataset and code from Wiseman et. al.

ai-writer_data2doc's People

Contributors

gau820827 avatar nerokapa avatar o9812 avatar weikaipan avatar yinhsunhuang avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.