Investigating Knowledge Injection Approaches for Field of Research Classification of Scholarly Articles

Description

This repository holds the code for my master's thesis project, which investigates classifying scholarly articles into research fields by exploring knowledge injection approaches. The full thesis can be accessed here for more detailed information.

There are different models in the models directory that utilise different features from scholarly articles:

Titles + abstracts
Authors
Publishers
Full metadata

The models also have different methods to semantically represent fields of research:

Categorical (baseline)
Using taxonomy labels (the taxonomy used is https://orkg.org/fields)
Linking labels to DBpedia entities and using the text under rdfs:label + rdfs:comment
Linking labels to DBpedia entities and using knowledge graph embeddings (pre-trained embeddings from https://zenodo.org/records/6384728)

Dataset

Download pre-prepared dataset

All data required for running the classifiers are available for download at: https://zenodo.org/records/10245830. After downloading, please save all .pt files under /data/classifier in order to be able to train and test the models.

Construct dataset

This repository also contains the code for creating the data in the link above (including linking ORKG labels to DBpedia entities) under the data_prep directory. The data is prepared by using the nfdi4ds dataset for the field of research classification (FoRC) shared task. The code for creating this dataset can be found here. A link to download the dataset can be provided in order to run the steps below.

Link the ORKG taxonomy to DBpedia entities:

python data_prep/entity_linking/entity_linking.py

Create KGEs of taxonomy labels:

Note that this step includes downloading a pre-trained DBpedia embeddings dataset from Zenodo (https://zenodo.org/records/6384728) and thus requires enough space. It will ca. 3 hours to download and ca. 1 hour to run the code in order to get the embeddings. In order to run the code, the dataset from Zenodo should be downloaded by running zenodo_get -d '10.5281/zenodo.6384728'. After obtaining the dataset, it should be saved under /data/embeddings.zip.

python data_prep/entity_embeddings/get_kges_pretrained.py

Alternatively, KGEs can be constructed using pyRDF2Vec. This process will take 2-3 hours and does not need an external dataset. However, the models perform better using the pre-trained embeddings as opposed to the ones constructed using pyRDF2Vec.

python data_prep/entity_embeddings/get_kges_pyrdf2vec.py

Create textual representations from DBpedia of taxonomy labels:

python data_prep/entity_embeddings/get_kg_texts.py

Create the binary dataset for the classifier:

python data_prep/data_for_classifier.py

Create authors and publishers embeddings: The code below creates embeddings for authors and publishers that can be used in the classifiers below. Note that both of these scripts use SciNCL to create embeddings of each title and abstract in the dataset and thus require enough system memory to run. Each code will take ca. 3 hours to run.

python data_prep/authors_data.py

python data_prep/publishers_data.py

Models

Categorical baseline:

python models/categorical_baseline.py

Pairwise text classifier (class features either ORKG labels or DBpedia entities text):

python models/text_classifier_trainer.py

KGEs only:

python models/kge-only-classifier.py

Adding author embeddings:

python models/kge-authors-classifier.py

Adding publishers embeddings:

python models/kge-publishers-classifier.py

Full metadata:

python models/kge-authors-publishers-classifier.py

Results

Publication Features	Class Features	Precision	Recall	F1	Accuracy
Baseline
Titles + Abstracts	Categorical Encoder	0.0	0.0	0.0	74.85
Embedding Class Labels with SciNCL
Titles + Abstracts	ORKG Labels Text	93.54	93.80	93.67	96.83
Injecting DBpedia Class Features
Titles + Abstracts	DBpedia Text	93.55	94.11	93.83	96.91
Titles + Abstracts	KGEs	75.83	29.39	42.36	80.00
Titles + Abstracts	DBpedia Text + KGEs	93.18	93.19	93.18	96.60
Adding Publication Metadata
Titles + Abstracts + Authors	DBpedia Text + KGEs	93.20	92.02	92.61	96.32
Titles + Abstracts + Publishers	DBpedia Text + KGEs	92.25	93.52	92.88	96.43
Titles + Abstracts + Authors + Publishers	DBpedia Text + KGEs	93.28	92.51	92.90	96.43

dfki-nlp / for-classifier Goto Github PK

for-classifier's Introduction

Investigating Knowledge Injection Approaches for Field of Research Classification of Scholarly Articles

Description

Dataset

Download pre-prepared dataset

Construct dataset

Models

Results

Additional graphs and comparisons between the models can be viewed at: https://api.wandb.ai/links/raya-abu-ahmad/ykbq4ke4.

for-classifier's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent