Coder Social home page Coder Social logo

michaelfaerber / scholarly-entity-usage-detection Goto Github PK

View Code? Open in Web Editor NEW
18.0 4.0 2.0 244.84 MB

Identifying Used Methods and Datasets in Scientific Publications

Jupyter Notebook 98.24% Python 1.75% Shell 0.01%
scholarly-data knowledge-graph bibliometrics scientometrics papers entity scibert named-entity-recognition classification

scholarly-entity-usage-detection's Introduction

Scholarly entity usage detection

Abstract

We introduce a new method to extract named entities from scientific publications. Unlike other Named Entity Recognition tasks we extract those named entities which have actually been used in the papers, not just mentioned or proposed. We train our classification model on method and data set names and show that for both entity types equally good performance can be achieved. We show that our model can be applied to any entity type with minimal human interaction. We further create an extension to the Microsoft Academic Graph of the used entities which we use to analyze the information about used methods and data sets.

Summary of our approach

Our classification-pipeline consists of a named entity recognition using a TSE-NER approach, a usage-classificator part using SciBERT and finally an aggregation of sentence-level usage classification results to the document level. Classification pipeline in detail

Structure of this project

This project is divided into several submodules. A detailed description can be found in the respective module subdirectories.

  • SmartPub-TSE-NER: For named entity recognition, we train a CRF using TSE-NER, which is a fork of mvallet91/SmartPub-TSENER but uses SciBERT instead of word2vec embeddings.
  • annotation-set-extraction is used for creating the annotation data set that is used for training of our usage classificator.
  • annotators-agreement is used for calculating the annotator agreement of the created data set.
  • usage-classificator: Trains four different models for classifying whether an entity in a sentence has been used or proposed.
  • classification-pipeline: Applies both the TSE-NER model for named entitiy recognition as well as a trained usage classification model to a corpus of documents.
  • studies contains several jupyter notebooks for analysis of the results from the classification pipeline.
  • mag-extension contains our extensions to the Microsoft Academic Graph.

Contact

The system has been designed and implemented by Michael Färber, Alexander Albers, and Felix Schüber. Feel free to reach out to us:

Michael Färber, [email protected]

How to Cite

Please cite our work as follows:

@inproceedings{Faerber2021SDU,
  author    = {Michael F{\"{a}}rber and
               Alexander Albers and 
               Felix Schüber},
  title     = "{Identifying Used Methods and Datasets in Scientific Publications}",
  booktitle = "{Proceedings of the AAAI-21 Workshop on Scientific Document Understanding (SDU'21)@AAAI'21}",
  location  = "{Virtual Event}",
  year      = {2021}
}

scholarly-entity-usage-detection's People

Contributors

alexander-albers avatar michaelfaerber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

jarygrace mhdella

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.