Hierarchical classification of text with capsule networks

Capsule networks have been shown to demonstrate good performance on structured data in the area of visual inference. This repository enables the application of and comparison between simple shallow capsule networks for hierarchical multi-label text classification and other traditional neural networks, such as CNNs and LSTMs, and non-neural network architectures such as SVMs. For our experiments, we use the established Web of Science (WOS) dataset and introduce a new real-world scenario dataset, the BlurbGenreCollection (BGC).

Our results confirm the hypothesis that capsule networks are especially advantageous for rare events and structurally diverse categories, which we attribute to their ability to combine latent encoded information. Details on the experiments and results as well as an extensive analysis can be found in the following scientific publication:

Rami Aly, Steffen Remus, Chris Biemann (2019): Hierarchical Multi-label Classification of Text with Capsule Networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy. Association for Computational Linguistics

The dataset published with this scientific work, namely BlurbGenreCollection, consists of book blurbs and their respective hierarchically structured writing genres. The datset can be downloaded on the Language Technology page of the Universität Hamburg.

If you use the code in this repository, e.g. as a baseline in your experiment or simply want to refer to this work, we kindly ask you to use the following citation:

@inproceedings{aly-etal-2019-hmc-caps,
    title = "Hierarchical Multi-label Classification of Text with Capsule Networks",
    author = {Aly, Rami  and
      Remus, Steffen  and
      Biemann, Chris},
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-2045",
    pages = "323--330"
}

System Requirement

The system was tested on Debian/Ubuntu Linux with a GTX 1080TI and TITAN X.

Installation

Clone repository:

https://github.com/Raldir/BlurbGenreCollection_Classification.git

Install a dataset
1. Either the BlurbGenreCollection-Dataset:
```
cd BlurbGenreCollection_Classification && wget https://fiona.uni-hamburg.de/ca89b3cf/blurbgenrecollectionen.zip && unzip blurbgenrecollectionen.zip -d datasets
```
2. Or install your own Dataset:
  
  The abstract class loader_abstract needs to be extended by your custom class that loads your dataset. Please adjust the return values of the methods to match the descriptions. The method load_data_multiLabel() should return a list of three sets: train, dev and test. Each collection is a list of tuples with each tuple being (String, Set of Strings) for the text and its respective set of labels.

The method read_relations() only needs to be implemented if a hierarchy exists. It should contain two sets -- the first consists of relation-pairs (parent, child) as Strings and the second set contains genres that have neither a parent nor a child. Furthermore, replace the following line with the name of your new loader_class: data_helpers.py: Line 15. For further reference, please take a look at loader.py which loads the BlurbGenreCollection dataset. Finally, read_all_genres stores co_occurences in a file to make the loading process quicker -- if the dataset changes please adjust the name so that the correct co_occurences are being loaded (only for label hierarchy relevant).

Install project packages:

pip install -r code/requirements.txt

Further packages needed:

pip install stop-words
python -m spacy download en
python -m spacy download en_core_web_sm

Install word embeddings for the English language, e.g.:

mkdir resources && cd resources && wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec

We recommend to put them into a ./resources folder. Please ensure to adjust the path and filename in case you decide to use different embeddings/path.

Hierarchical Multi-label Classification

Running the main.py will run the complete Pipeline if in train mode: Loading the data, preprocessing and training the classifier. The preprocessed data is stored in the resources folder, to save time in sequential runs. Same applies to the computation of the embedding matrix, which is stored for a fixed sequence length.

Option	Description	Default
--mode	Mode, e.g. train and test on validation or test on test set (train_test)	train_validation
--classifier	Select between CNN, LSTM and capsule	capsule
--lang	Datset to be used	EN
--level	Max Genre Level of the hierarchy	1

The level setting can only be used if the program is provided with a hierarchy, otherwise the networks handle the data as a traditional multi-label classification task.

General Settings:

Option	Description	Default
--sequence_length	Maximum sequence imput length of text	100
--epochs	Number of epochs to train the classifier	60
--use_statc	Whether the embedding layer should not be trainable	False
--use_early_stop	Uses early stopping during training	False
--batch_size	Set minibatch size	32
--learning_rate	The learning rate of the classifier	0.0005
--learning_decay	Whether to use learning decay, 1 indicates no decay, 0 max.	1
--init_layer	Whether to initialize the final layer with label co-occurence.	False
--iterations	How many classifiers to be trained, only relevant for train_n_models_final	3
--activation_th	Activation threshold of the final layer	0.5
--adjust_hierarchy	Postprocessing hierarchy correction	None
--correction_th	Threshold for threshold-label correction method	False

Please note, that --init_layer, --correction_th --adjust_hierarchy are only usable, if the hierarchy of a dataset is given as input as well.

Capsule settings:

Option	Description	Default
--dense_capsule_dim	Dimensionality of capsules on final layer	16
--n_channels	Number of capsules per feature map	50

LSTM settings:

Option	Description	Default
--lstm_units	Number of units in the lstm	700

CNN settings:

Option	Description	Default
--num_filters	Number of filters for each window size	500

Example:
python3.5 main.py --mode train_validation --classifier cnn --lang EN --sequence_length 100 --learning_rate 0.001 --learning_decay 1

For further inquries: [email protected]

ValueError: Can not do batch_dot

Hi,

I'm trying to run the capsulenet classifier using the command below:
python main.py --mode train_validation --classifier capsule --lang EN --sequence_length 100 --learning_rate 0.001 --learning_decay 1

However, the create_model method throws an exception when constructing the model. The Traceback is as follows

Traceback (most recent call last):
  File "main.py", line 400, in <module>
    main()
  File "main.py", line 299, in main
    run()
  File "main.py", line 321, in run
    model = create_model(dev = True, preload = False)
  File "main.py", line 371, in create_model
    return model_capsule(dev, preload)
  File "main.py", line 258, in model_capsule
    args.dense_capsule_dim, args.n_channels, 3, dev)
  File "/home/daan_vandennest/git/BlurbGenreCollection_Classification/code/networks.py", line 50, in create_model_capsule
    name='digitcaps')(primarycaps)
  File "/home/daan_vandennest/miniconda3/envs/capsnet/lib/python3.6/site-packages/keras/engine/base_layer.py", line 451, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/daan_vandennest/git/BlurbGenreCollection_Classification/code/capsulelayers.py", line 119, in call
    b += K.batch_dot(outputs, inputs_hat, [2, 3])
  File "/home/daan_vandennest/miniconda3/envs/capsnet/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1261, in batch_dot
    'y.shape[%d] (%d != %d).' % (axes[0], axes[1], d1, d2))
ValueError: Can not do batch_dot on inputs with shapes (None, 131, 131, 2805, 16) and (None, 131, None, 2805, 16) with axes=[2, 3]. x.shape[2] != y.shape[3] (131 != 2805).

I'v made no changes to the code. The only difference is that I'm not using tensorflow-gpu, but plain tensorflow.
Do you have any idea what might be causing this?

For completeness' sake I've added the output of pip freeze below:

absl-py==0.9.0
astor==0.8.1
beautifulsoup4==4.6.0
bleach==1.5.0
blis==0.2.4
boto==2.49.0
boto3==1.12.14
botocore==1.15.14
certifi==2019.11.28
chardet==3.0.4
cycler==0.10.0
cymem==2.0.3
cysignals==1.10.2
Cython==0.29.15
decorator==4.4.2
docutils==0.15.2
en-core-web-sm==2.1.0
future==0.18.2
gast==0.3.3
gensim==3.8.0
GPy==1.9.5
GPyOpt==1.2.5
graphviz==0.8.3
grpcio==1.27.2
h5py==2.8.0
html5lib==0.9999999
idna==2.9
jmespath==0.9.5
Keras==2.2.5
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.2.1
matplotlib==2.2.2
murmurhash==1.0.2
numpy==1.16.5
pandas==0.23.4
paramz==0.9.4
pathlib==1.0.1
pipenv==2018.11.26
plac==0.9.6
plumbum==1.6.6
preshed==2.0.1
protobuf==3.11.3
pydot==1.2.3
pyfasttext==0.4.5
pyparsing==2.4.6
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.3
regex==2017.4.5
requests==2.23.0
s3transfer==0.3.3
scikit-learn==0.19.1
scipy==1.1.0
six==1.14.0
smart-open==1.9.0
spacy==2.1.8
spyder==2.3.8
srsly==1.0.2
stop-words==2015.2.23.1
tensorboard==1.7.0
tensorflow==1.7.0
termcolor==1.1.0
thinc==7.0.8
tqdm==4.43.0
treetaggerwrapper==2.2.4
ujson==1.35
urllib3==1.25.8
virtualenv-clone==0.5.3
wasabi==0.6.0
Werkzeug==1.0.0

uhh-lt / blurbgenrecollection-hmc Goto Github PK

blurbgenrecollection-hmc's Introduction

Hierarchical classification of text with capsule networks

System Requirement

Installation

Hierarchical Multi-label Classification

blurbgenrecollection-hmc's People

Contributors

Stargazers

Watchers

Forkers

blurbgenrecollection-hmc's Issues

Add compatibility to TensorFlow versions >=1.13.1

CompQ_Loader

Add low-freq experiments from paper and modification to WOS dataset

ValueError: Can not do batch_dot

Refactor data_helper

请问运行代码时出现这种问题怎么解决？

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent