Coder Social home page Coder Social logo

dfki-nlp / for-classifier Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 29.65 MB

๐Ÿ“š Code for my master's thesis "Investigating Knowledge Injection Approaches for Research Field Classification of Scholarly Articles".

Python 100.00%
classification master-thesis natural-language-processing fields-of-research dbpedia knowledge-graphs knowledge-injection

for-classifier's Introduction

Investigating Knowledge Injection Approaches for Field of Research Classification of Scholarly Articles

header2

Description

This repository holds the code for my master's thesis project, which investigates classifying scholarly articles into research fields by exploring knowledge injection approaches. The full thesis can be accessed here for more detailed information.

There are different models in the models directory that utilise different features from scholarly articles:

  • Titles + abstracts
  • Authors
  • Publishers
  • Full metadata

The models also have different methods to semantically represent fields of research:

  • Categorical (baseline)
  • Using taxonomy labels (the taxonomy used is https://orkg.org/fields)
  • Linking labels to DBpedia entities and using the text under rdfs:label + rdfs:comment
  • Linking labels to DBpedia entities and using knowledge graph embeddings (pre-trained embeddings from https://zenodo.org/records/6384728)

Dataset

Download pre-prepared dataset

All data required for running the classifiers are available for download at: https://zenodo.org/records/10245830. After downloading, please save all .pt files under /data/classifier in order to be able to train and test the models.

Construct dataset

This repository also contains the code for creating the data in the link above (including linking ORKG labels to DBpedia entities) under the data_prep directory. The data is prepared by using the nfdi4ds dataset for the field of research classification (FoRC) shared task. The code for creating this dataset can be found here. A link to download the dataset can be provided in order to run the steps below.

  1. Link the ORKG taxonomy to DBpedia entities:
python data_prep/entity_linking/entity_linking.py
  1. Create KGEs of taxonomy labels:

Note that this step includes downloading a pre-trained DBpedia embeddings dataset from Zenodo (https://zenodo.org/records/6384728) and thus requires enough space. It will ca. 3 hours to download and ca. 1 hour to run the code in order to get the embeddings. In order to run the code, the dataset from Zenodo should be downloaded by running zenodo_get -d '10.5281/zenodo.6384728'. After obtaining the dataset, it should be saved under /data/embeddings.zip.

python data_prep/entity_embeddings/get_kges_pretrained.py

Alternatively, KGEs can be constructed using pyRDF2Vec. This process will take 2-3 hours and does not need an external dataset. However, the models perform better using the pre-trained embeddings as opposed to the ones constructed using pyRDF2Vec.

python data_prep/entity_embeddings/get_kges_pyrdf2vec.py
  1. Create textual representations from DBpedia of taxonomy labels:
python data_prep/entity_embeddings/get_kg_texts.py
  1. Create the binary dataset for the classifier:
python data_prep/data_for_classifier.py
  1. Create authors and publishers embeddings: The code below creates embeddings for authors and publishers that can be used in the classifiers below. Note that both of these scripts use SciNCL to create embeddings of each title and abstract in the dataset and thus require enough system memory to run. Each code will take ca. 3 hours to run.
python data_prep/authors_data.py
python data_prep/publishers_data.py

Models

  1. Categorical baseline:
python models/categorical_baseline.py
  1. Pairwise text classifier (class features either ORKG labels or DBpedia entities text):
python models/text_classifier_trainer.py
  1. KGEs only:
python models/kge-only-classifier.py
  1. Adding author embeddings:
python models/kge-authors-classifier.py
  1. Adding publishers embeddings:
python models/kge-publishers-classifier.py
  1. Full metadata:
python models/kge-authors-publishers-classifier.py

Results

Publication Features Class Features Precision Recall F1 Accuracy
Baseline
Titles + Abstracts Categorical Encoder 0.0 0.0 0.0 74.85
Embedding Class Labels with SciNCL
Titles + Abstracts ORKG Labels Text 93.54 93.80 93.67 96.83
Injecting DBpedia Class Features
Titles + Abstracts DBpedia Text 93.55 94.11 93.83 96.91
Titles + Abstracts KGEs 75.83 29.39 42.36 80.00
Titles + Abstracts DBpedia Text + KGEs 93.18 93.19 93.18 96.60
Adding Publication Metadata
Titles + Abstracts + Authors DBpedia Text + KGEs 93.20 92.02 92.61 96.32
Titles + Abstracts + Publishers DBpedia Text + KGEs 92.25 93.52 92.88 96.43
Titles + Abstracts + Authors + Publishers DBpedia Text + KGEs 93.28 92.51 92.90 96.43

Additional graphs and comparisons between the models can be viewed at: https://api.wandb.ai/links/raya-abu-ahmad/ykbq4ke4.

for-classifier's People

Contributors

ryabhmd avatar

Stargazers

Sefika Efeoglu avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.