Coder Social home page Coder Social logo

uhh-lt / blurbgenrecollection-hmc Goto Github PK

View Code? Open in Web Editor NEW
83.0 17.0 10.0 94 KB

Hierarchical multi-label text classification of the BlurbGenreCollection using capsule networks.

Home Page: https://www.aclweb.org/anthology/P19-2045/

License: Apache License 2.0

Jupyter Notebook 19.29% Python 80.71%
capsule-networks hierarchy datset neural-networks multi-label-classification acl2019 text-classification cnn lstm keras

blurbgenrecollection-hmc's Introduction

Hierarchical classification of text with capsule networks

Capsule networks have been shown to demonstrate good performance on structured data in the area of visual inference. This repository enables the application of and comparison between simple shallow capsule networks for hierarchical multi-label text classification and other traditional neural networks, such as CNNs and LSTMs, and non-neural network architectures such as SVMs. For our experiments, we use the established Web of Science (WOS) dataset and introduce a new real-world scenario dataset, the BlurbGenreCollection (BGC).

Our results confirm the hypothesis that capsule networks are especially advantageous for rare events and structurally diverse categories, which we attribute to their ability to combine latent encoded information. Details on the experiments and results as well as an extensive analysis can be found in the following scientific publication:

Rami Aly, Steffen Remus, Chris Biemann (2019): Hierarchical Multi-label Classification of Text with Capsule Networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy. Association for Computational Linguistics

The dataset published with this scientific work, namely BlurbGenreCollection, consists of book blurbs and their respective hierarchically structured writing genres. The datset can be downloaded on the Language Technology page of the Universität Hamburg.

If you use the code in this repository, e.g. as a baseline in your experiment or simply want to refer to this work, we kindly ask you to use the following citation:

@inproceedings{aly-etal-2019-hmc-caps,
    title = "Hierarchical Multi-label Classification of Text with Capsule Networks",
    author = {Aly, Rami  and
      Remus, Steffen  and
      Biemann, Chris},
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-2045",
    pages = "323--330"
}

System Requirement

The system was tested on Debian/Ubuntu Linux with a GTX 1080TI and TITAN X.

Installation

  1. Clone repository:
https://github.com/Raldir/BlurbGenreCollection_Classification.git
  1. Install a dataset

    1. Either the BlurbGenreCollection-Dataset:

      cd BlurbGenreCollection_Classification && wget https://fiona.uni-hamburg.de/ca89b3cf/blurbgenrecollectionen.zip && unzip blurbgenrecollectionen.zip -d datasets
      
    2. Or install your own Dataset:

      The abstract class loader_abstract needs to be extended by your custom class that loads your dataset. Please adjust the return values of the methods to match the descriptions. The method load_data_multiLabel() should return a list of three sets: train, dev and test. Each collection is a list of tuples with each tuple being (String, Set of Strings) for the text and its respective set of labels.

The method read_relations() only needs to be implemented if a hierarchy exists. It should contain two sets -- the first consists of relation-pairs (parent, child) as Strings and the second set contains genres that have neither a parent nor a child. Furthermore, replace the following line with the name of your new loader_class: data_helpers.py: Line 15. For further reference, please take a look at loader.py which loads the BlurbGenreCollection dataset. Finally, read_all_genres stores co_occurences in a file to make the loading process quicker -- if the dataset changes please adjust the name so that the correct co_occurences are being loaded (only for label hierarchy relevant).

  1. Install project packages:
pip install -r code/requirements.txt
  1. Further packages needed:
pip install stop-words
python -m spacy download en
python -m spacy download en_core_web_sm
  1. Install word embeddings for the English language, e.g.:
mkdir resources && cd resources && wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec

We recommend to put them into a ./resources folder. Please ensure to adjust the path and filename in case you decide to use different embeddings/path.

Hierarchical Multi-label Classification

Running the main.py will run the complete Pipeline if in train mode: Loading the data, preprocessing and training the classifier. The preprocessed data is stored in the resources folder, to save time in sequential runs. Same applies to the computation of the embedding matrix, which is stored for a fixed sequence length.

Option Description Default
--mode Mode, e.g. train and test on validation or test on test set (train_test) train_validation
--classifier Select between CNN, LSTM and capsule capsule
--lang Datset to be used EN
--level Max Genre Level of the hierarchy 1

The level setting can only be used if the program is provided with a hierarchy, otherwise the networks handle the data as a traditional multi-label classification task.

General Settings:

Option Description Default
--sequence_length Maximum sequence imput length of text 100
--epochs Number of epochs to train the classifier 60
--use_statc Whether the embedding layer should not be trainable False
--use_early_stop Uses early stopping during training False
--batch_size Set minibatch size 32
--learning_rate The learning rate of the classifier 0.0005
--learning_decay Whether to use learning decay, 1 indicates no decay, 0 max. 1
--init_layer Whether to initialize the final layer with label co-occurence. False
--iterations How many classifiers to be trained, only relevant for train_n_models_final 3
--activation_th Activation threshold of the final layer 0.5
--adjust_hierarchy Postprocessing hierarchy correction None
--correction_th Threshold for threshold-label correction method False

Please note, that --init_layer, --correction_th --adjust_hierarchy are only usable, if the hierarchy of a dataset is given as input as well.

Capsule settings:

Option Description Default
--dense_capsule_dim Dimensionality of capsules on final layer 16
--n_channels Number of capsules per feature map 50

LSTM settings:

Option Description Default
--lstm_units Number of units in the lstm 700

CNN settings:

Option Description Default
--num_filters Number of filters for each window size 500

Example:
python3.5 main.py --mode train_validation --classifier cnn --lang EN --sequence_length 100 --learning_rate 0.001 --learning_decay 1

For further inquries: [email protected]

blurbgenrecollection-hmc's People

Contributors

raldir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

blurbgenrecollection-hmc's Issues

Add compatibility to TensorFlow versions >=1.13.1

when i run the code as your tuturial. i has some problem.can you show me your tensorflow verson of the project

WARNING:
Traceback (most recent call last):
  File "main.py", line 400, in <module>
    main()
  File "main.py", line 299, in main
    run()
  File "main.py", line 321, in run
    model = create_model(dev = True, preload = False)
  File "main.py", line 371, in create_model
    return model_capsule(dev, preload)
  File "main.py", line 258, in model_capsule
    args.dense_capsule_dim, args.n_channels, 3, dev)
  File "/code/BlurbGenreCollection_Classification/code/networks.py", line 37, in create_model_capsule
    input = inputs, use_static = use_static, voc = vocabulary, lang = language, dev = dev)
  File "/code/BlurbGenreCollection_Classification/code/networks.py", line 238, in pre_embedding
    trainable= trainable)(input)
  File "/home/prozx/anaconda3/lib/python3.6/site-packages/keras/engine/base_layer.py", line 430, in __call__
    self.set_weights(self._initial_weights)
  File "/home/prozx/anaconda3/lib/python3.6/site-packages/keras/engine/base_layer.py", line 1051, in set_weights
    'provided weight shape ' + str(w.shape))
ValueError: Layer weight shape (45, 300) not compatible with provided weight shape (100288, 300)

CompQ_Loader

I run your model, but I enounter an error, can you provide this source code:
File "main.py", line 7, in <module> from data_helpers import load_data, extract_hierarchies, remove_genres_not_level File "/home/eric/Documents/Experiments/BlurbGenreCollection_Classification/code/data_helpers.py", line 11, in <module> from comp_questions_loader import CompQ_Loader ModuleNotFoundError: No module named 'comp_questions_loader'

ValueError: Can not do batch_dot

Hi,

I'm trying to run the capsulenet classifier using the command below:
python main.py --mode train_validation --classifier capsule --lang EN --sequence_length 100 --learning_rate 0.001 --learning_decay 1

However, the create_model method throws an exception when constructing the model. The Traceback is as follows

Traceback (most recent call last):
  File "main.py", line 400, in <module>
    main()
  File "main.py", line 299, in main
    run()
  File "main.py", line 321, in run
    model = create_model(dev = True, preload = False)
  File "main.py", line 371, in create_model
    return model_capsule(dev, preload)
  File "main.py", line 258, in model_capsule
    args.dense_capsule_dim, args.n_channels, 3, dev)
  File "/home/daan_vandennest/git/BlurbGenreCollection_Classification/code/networks.py", line 50, in create_model_capsule
    name='digitcaps')(primarycaps)
  File "/home/daan_vandennest/miniconda3/envs/capsnet/lib/python3.6/site-packages/keras/engine/base_layer.py", line 451, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/daan_vandennest/git/BlurbGenreCollection_Classification/code/capsulelayers.py", line 119, in call
    b += K.batch_dot(outputs, inputs_hat, [2, 3])
  File "/home/daan_vandennest/miniconda3/envs/capsnet/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1261, in batch_dot
    'y.shape[%d] (%d != %d).' % (axes[0], axes[1], d1, d2))
ValueError: Can not do batch_dot on inputs with shapes (None, 131, 131, 2805, 16) and (None, 131, None, 2805, 16) with axes=[2, 3]. x.shape[2] != y.shape[3] (131 != 2805).

I'v made no changes to the code. The only difference is that I'm not using tensorflow-gpu, but plain tensorflow.
Do you have any idea what might be causing this?

For completeness' sake I've added the output of pip freeze below:

absl-py==0.9.0
astor==0.8.1
beautifulsoup4==4.6.0
bleach==1.5.0
blis==0.2.4
boto==2.49.0
boto3==1.12.14
botocore==1.15.14
certifi==2019.11.28
chardet==3.0.4
cycler==0.10.0
cymem==2.0.3
cysignals==1.10.2
Cython==0.29.15
decorator==4.4.2
docutils==0.15.2
en-core-web-sm==2.1.0
future==0.18.2
gast==0.3.3
gensim==3.8.0
GPy==1.9.5
GPyOpt==1.2.5
graphviz==0.8.3
grpcio==1.27.2
h5py==2.8.0
html5lib==0.9999999
idna==2.9
jmespath==0.9.5
Keras==2.2.5
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.2.1
matplotlib==2.2.2
murmurhash==1.0.2
numpy==1.16.5
pandas==0.23.4
paramz==0.9.4
pathlib==1.0.1
pipenv==2018.11.26
plac==0.9.6
plumbum==1.6.6
preshed==2.0.1
protobuf==3.11.3
pydot==1.2.3
pyfasttext==0.4.5
pyparsing==2.4.6
python-dateutil==2.8.1
pytz==2019.3
PyYAML==5.3
regex==2017.4.5
requests==2.23.0
s3transfer==0.3.3
scikit-learn==0.19.1
scipy==1.1.0
six==1.14.0
smart-open==1.9.0
spacy==2.1.8
spyder==2.3.8
srsly==1.0.2
stop-words==2015.2.23.1
tensorboard==1.7.0
tensorflow==1.7.0
termcolor==1.1.0
thinc==7.0.8
tqdm==4.43.0
treetaggerwrapper==2.2.4
ujson==1.35
urllib3==1.25.8
virtualenv-clone==0.5.3
wasabi==0.6.0
Werkzeug==1.0.0
 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.