Coder Social home page Coder Social logo

coleridge-initiative / rclc Goto Github PK

View Code? Open in Web Editor NEW
21.0 14.0 6.0 8.06 MB

Rich Context leaderboard competition, including the corpus and current SOTA for required tasks.

Home Page: https://coleridgeinitiative.org/richcontext

License: Creative Commons Zero v1.0 Universal

Python 99.13% Shell 0.87%
nlp competition entity-linking rich-context dataset-ids knowledge-graph corpus leaderboard metadata-extraction

rclc's Introduction

Tracking Progress in Rich Context

The Coleridge Initiative at NYU has been researching Rich Context to enhance search and discovery of datasets used in scientific research – see the Background Info section for more details. Partnering with experts throughout academia and industry, NYU-CI has worked to leverage the closely adjacent fields of NLP/NLU, knowledge graph, recommender systems, scholarly infrastructure, data mining from scientific literature, dataset discovery, linked data, open vocabularies, metadata management, data governance, and so on. Leaderboards are published here on GitHub to track state-of-the-art (SOTA) progress among the top results.


Leaderboard 1

Entity Linking for Datasets in Publications

The first challenge is to identify the datasets used in research publications, initially focused on the problem of entity linking. Research papers generally mention the datasets they've used, although there are limited formal means to describe that metadata in a machine-readable way. The goal here is to predict a set of dataset IDs for each publication. The dataset IDs within the corpus represent the set of all possible datasets which will appear.

Identifying dataset mentions typically requires:

  • extracting text from an open access PDF
  • some NLP parsing of the text
  • feature engineering (e.g., attention to where text is located in a paper)
  • modeling to identify up to 5 datasets per publication

See Evaluating Models for Entity Linking with Datasets for details about how the Top5uptoD leaderboard metric is calculated.

Instructions

Use of open source and open standards are especially important to further the cause for effective, reproducible research. We're hosting this competition to focus on the research challenges of specific machine learning use cases encountered within Rich Context – see the Workflow Stages section.

If you have any questions about the Rich Context leaderboard competition – and especially if you identify any problems in the corpus (e.g., data quality, incorrect metadata, broken links, etc.) – please use the GitHub issues for this repo and pull requests to report, discuss, and resolve them.

rclc's People

Contributors

abhi-balaji avatar ceteri avatar ernestogimeno avatar jasonzhangzy1757 avatar philipskokoh avatar srand525 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rclc's Issues

Open access of these publications is no longer available

Just found out that OpenAccess publications from papers.ssrn.com do not return pdf files.
Here are they:

[
{"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-b3712b3852b3c38fdeb7", "openAccess": "https://papers.ssrn.com/sol3/Delivery.cfm/19677.pdf?abstractid=2785275&mirid=1&type=2"},
{"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-f864820c2ac96be88d5d", "openAccess": "https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2640935_code536749.pdf?abstractid=2023843&mirid=1&type=2"},
{"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-f45c2734c942ab11ea19", "openAccess": "https://papers.ssrn.com/sol3/Delivery.cfm/3871.pdf?abstractid=2793911&mirid=1&type=2"}
]

Convert into txt files PDFs that are images

Hi,

I discovered that 2b4497873374d080d359.pdf is not converted well into .txt. I only got an empty file. But the pdf is ok. Talking to @philipskokoh in the repo of his baseline we found out that it is because the pdf of the publication is actually a scanned document, so it is like an image rather than text. So the tools we are using right now to convert the pdf into txt don’t work since they need real text.

I think we only have one scanned (image) pdf for now. But it can be a limitation in the future and definitely something to take into account when augmenting the dataset. @ceteri

Best,
Haritz

missing requirements

beautifulsoup4 is missing in requirements.
pdfminer is missing in requirements.
requests-html is missing in requirements.
ray is missing in requirements

Evaluation metrics

We've had lots of discussions about how to handle the evaluation metrics, between using a simpler TopK approach versus the Top5uptoD described here.

The main debate was how feasible the latter would be. Here's sample code, in:
Evaluating Models for Entity Linking with Datasets.

That said, let's keep the discussion going and see if there are other metrics that work better?

Replace SPv1 with a better PDF parser

Rework the pipeline, after PDF download, so that text gets extracted in a semi-structured way.

Some options to evaluate:

If a package has any dependencies on JVM, then it's best to containerize that part of the workflow with Docker; we don't want to be managing JVM apps. As much as possible, use instances from DockerHub instead of creating new ones.

In the case of grobid there are already several DockerHub instances: https://hub.docker.com/search?q=grobid&type=image

  • which is simplest for us to use?
  • what's the best way to integrate into our RCLC workflow?

Add Unit Test for TextRank Part in the Pipeline

For next stage, add a simple unit test for the TextRank part of the pipeline, using the example file in example/pub. That way if any of our imported libraries change, we'll be able to find errors.

A quick suggestion that a first pass at the test would be to run that example file, and confirm the top few key-phrases for each section. If anything breaks, that should fail quickly.

Missing Publication and Dataset Resources

Hi,

I executed python corpus.py corpus.ttl and then python download_corpus_resources.py to download the corpus but I got this output. Is this the expected output? It looks like some publications cannot be downloaded.

Number of records in the corpus: 586
Number of research publications: 480
Successfully downloaded 474 pdf files.
Missing publication resources: {'012df4a72af52b038483', 'dca54974ff51a5f7f8ab', '5f48a343cb75195cd646', 'c8f9b19b39e34d98a557','988428e18884e28e037c', '42c2755ec0f983870e62'}
Number of datasets: 106
Successfully downloaded 101 resource files.
Missing dataset resources: {'875ffb2b04b1392cd1f2', 'fe338b5b2f3f6b0d11a4', '53ca68ba0ded95220662', '33b1ce039c67a6658644', '379ff5f518e664ba2353'}

I checked the publication with id: "012df4a72af52b038483", and it looks like the link is not broken. Here is the link I got from corpus.jsonld
https://aasldpubs.onlinelibrary.wiley.com/doi/pdf/10.1002/hep.23220

@ceteri @philipskokoh Do you know why this happen?

Thanks

duplicate publications

Some of the publications are duplicated within the v0.1.7 release.

For example, publication-20d6bb688a5e9b2dd3cd shows up twice.

Need to add a filter to make publications unique.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.