Hierarchical Attention Networks

This repository contains an implementation of Hierarchical Attention Networks for Document Classification in keras and another implementation of the same network in tensorflow.

Hierarchical Attention Networks consists of the following parts:

Embedding layer
Word Encoder: word level bi-directional GRU to get rich representation of words
Word Attention:word level attention to get important information in a sentence
Sentence Encoder: sentence level bi-directional GRU to get rich representation of sentences
Sentence Attention: sentence level attention to get important sentence among sentences
Fully Connected layer + Softmax

These models have 2 levels of attention: one at the word level and one at the sentence level thereby allowing the model to pay less or more attention to individual words and sentences accordingly when constructing the represenation of a document.

DataSet:

I have used the IMDB Movies dataset from Kaggle, labeledTrainData.tsv which contains 25000 reviews with labels

Preprocessing on the Data:

I have done minimal preprocessing on the input reviews in the dataset following these basic steps:

Remove html tags
Replace non-ascii characters with a single space
Split each review into sentences

Then I create the character set with a max sentence length of 512 chars and set an upper bound of 15 for the max number of sentences per review. The input X is indexed as (document, sentence, char) and the target y has the corresponding sentiments.

Attention Layer Implementation

Attention mechanism layer which reduces Bi-RNN outputs with Attention vector (adapted from the paper)
Args:
    inputs: The Attention inputs.             
            In case of Bidirectional RNN, this must be a tuple (outputs_fw, outputs_bw) containing 
            the forward and the backward RNN outputs `Tensor`.
                If time_major == False (default),
                    outputs_fw is a `Tensor` shaped:
                    `[batch_size, max_time, cell_fw.output_size]`
                    and outputs_bw is a `Tensor` shaped:
                    `[batch_size, max_time, cell_bw.output_size]`.
                If time_major == True,
                    outputs_fw is a `Tensor` shaped:
                    `[max_time, batch_size, cell_fw.output_size]`
                    and outputs_bw is a `Tensor` shaped:
                    `[max_time, batch_size, cell_bw.output_size]`.
    attention_size: Linear size of the Attention weights.
    time_major: The shape format of the `inputs` Tensors.
        If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`.
        If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`.
        Using `time_major = True` is a bit more efficient because it avoids
        transposes at the beginning and end of the RNN calculation.  However,
        most TensorFlow data is batch-major, so by default this function
        accepts input and emits output in batch-major form.
    return_alphas: Whether to return attention coeef variable along with layer's output.
        Used for visualization purpose.
Returns:
    The Attention output `Tensor`.

maxmarketit / hierarchicalattentionnetworks Goto Github PK