This repository is my research project, and it is also a study of TensorFlow, Deep Learning.
The main objective of the project is to solve the hierarchical multi-label text classification (HMC) problem. Different from the multi-label text classification, HMC classifies each instance (object) into several different paths of the class hierarchy.
- Python 3.6
- Tensorflow 1.8 +
- Numpy
- Gensim
Many real-world applications involve hierarchical multi-label classification and organize data in a hierarchical structure, classes are specialized into subclasses or grouped into superclasses, which is a good way to show the characteristics of data and provide a multidimensional perspective to tackle the problem.
Like most type of electronic document (e.g. web-pages, digital libraries, patents and e-mails), they are usually associated with one or more categories and all these categories are stored hierarchically in a tree or Direct Acyclic Graph (DAG).
The Figure show an example of predefined labels in hierarchical multi-label classification of documents in a patent texts.
- Documents are shown as colored rectangles, labels as rounded rectangles.
- Circles in the rounded rectangles indicate that the corresponding document has been assigned the label.
- Arrows indicate hierarchical structure between labels.
See data format in data
folder which including the data sample files.
You can use jieba
package if you are going to deal with the chinese text data.
This repository can be used in other datasets(text classification) by two ways:
- Modify your datasets into the same format of the sample.
- Modify the data preprocess code in
data_helpers.py
.
Anyway, it should depends on what your data and task are.
You can pre-training your word vectors(based on your corpus) in many ways:
- Use
gensim
package to pre-train data. - Use
glove
tools to pre-train data. - Even can use a fasttext network to pre-train data.
References:
References:
黄威,Randolph
SCU SE Bachelor; USTC CS Master
Email: [email protected]
My Blog: randolph.pro
LinkedIn: randolph's linkedin