coleridge-initiative / rclc Goto Github PK

View Code? Open in Web Editor NEW

21.0 14.0 6.0 8.06 MB

Rich Context leaderboard competition, including the corpus and current SOTA for required tasks.

Home Page: https://coleridgeinitiative.org/richcontext

License: Creative Commons Zero v1.0 Universal

Python 99.13% Shell 0.87%

nlp competition entity-linking rich-context dataset-ids knowledge-graph corpus leaderboard metadata-extraction

rclc's Introduction

Tracking Progress in Rich Context

The Coleridge Initiative at NYU has been researching Rich Context to enhance search and discovery of datasets used in scientific research – see the Background Info section for more details. Partnering with experts throughout academia and industry, NYU-CI has worked to leverage the closely adjacent fields of NLP/NLU, knowledge graph, recommender systems, scholarly infrastructure, data mining from scientific literature, dataset discovery, linked data, open vocabularies, metadata management, data governance, and so on. Leaderboards are published here on GitHub to track state-of-the-art (SOTA) progress among the top results.

Leaderboard 1

Entity Linking for Datasets in Publications

The first challenge is to identify the datasets used in research publications, initially focused on the problem of entity linking. Research papers generally mention the datasets they've used, although there are limited formal means to describe that metadata in a machine-readable way. The goal here is to predict a set of dataset IDs for each publication. The dataset IDs within the corpus represent the set of all possible datasets which will appear.

Identifying dataset mentions typically requires:

extracting text from an open access PDF
some NLP parsing of the text
feature engineering (e.g., attention to where text is located in a paper)
modeling to identify up to 5 datasets per publication

See Evaluating Models for Entity Linking with Datasets for details about how the Top5uptoD leaderboard metric is calculated.

Instructions

Use of open source and open standards are especially important to further the cause for effective, reproducible research. We're hosting this competition to focus on the research challenges of specific machine learning use cases encountered within Rich Context – see the Workflow Stages section.

If you have any questions about the Rich Context leaderboard competition – and especially if you identify any problems in the corpus (e.g., data quality, incorrect metadata, broken links, etc.) – please use the GitHub issues for this repo and pull requests to report, discuss, and resolve them.

rclc's People

Contributors

Stargazers

Watchers

Forkers

philipskokoh jasonzhangzy1757 zhengliz justcherie andreajparker kaydoh

rclc's Issues

Open access of these publications is no longer available

Just found out that OpenAccess publications from papers.ssrn.com do not return pdf files.
Here are they:

[
{"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-b3712b3852b3c38fdeb7", "openAccess": "https://papers.ssrn.com/sol3/Delivery.cfm/19677.pdf?abstractid=2785275&mirid=1&type=2"},
{"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-f864820c2ac96be88d5d", "openAccess": "https://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID2640935_code536749.pdf?abstractid=2023843&mirid=1&type=2"},
{"@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication-f45c2734c942ab11ea19", "openAccess": "https://papers.ssrn.com/sol3/Delivery.cfm/3871.pdf?abstractid=2793911&mirid=1&type=2"}
]

Convert into txt files PDFs that are images

Hi,

I discovered that 2b4497873374d080d359.pdf is not converted well into .txt. I only got an empty file. But the pdf is ok. Talking to @philipskokoh in the repo of his baseline we found out that it is because the pdf of the publication is actually a scanned document, so it is like an image rather than text. So the tools we are using right now to convert the pdf into txt don’t work since they need real text.

I think we only have one scanned (image) pdf for now. But it can be a limitation in the future and definitely something to take into account when augmenting the dataset. @ceteri

Best,
Haritz

missing requirements

beautifulsoup4 is missing in requirements.
pdfminer is missing in requirements.
requests-html is missing in requirements.
ray is missing in requirements

Submit URLs for entries on the leaderboard

Submit URLs for entries on the leaderboard here.

Run phrase extraction on on the text from PDFs

For the next stage to add to this pipeline, run phrase extraction from the JSON files that result from #10.

See https://pypi.org/project/pytextrank/ for a pipeline based on spaCy
with example code at https://github.com/DerwenAI/pytextrank#usage

The objective is to determine the top-ranked phrases for each section of a research paper. If possible, also collect the title of the section -- then save results out to JSON.

Evaluation metrics

We've had lots of discussions about how to handle the evaluation metrics, between using a simpler TopK approach versus the Top5uptoD described here.

The main debate was how feasible the latter would be. Here's sample code, in:
Evaluating Models for Entity Linking with Datasets.

That said, let's keep the discussion going and see if there are other metrics that work better?

Replace SPv1 with a better PDF parser

Rework the pipeline, after PDF download, so that text gets extracted in a semi-structured way.

Some options to evaluate:

Parsr https://github.com/axa-group/Parsr
grobid https://github.com/kermitt2/grobid

If a package has any dependencies on JVM, then it's best to containerize that part of the workflow with Docker; we don't want to be managing JVM apps. As much as possible, use instances from DockerHub instead of creating new ones.

In the case of grobid there are already several DockerHub instances: https://hub.docker.com/search?q=grobid&type=image

which is simplest for us to use?
what's the best way to integrate into our RCLC workflow?

Add Unit Test for TextRank Part in the Pipeline

For next stage, add a simple unit test for the TextRank part of the pipeline, using the example file in example/pub. That way if any of our imported libraries change, we'll be able to find errors.

A quick suggestion that a first pass at the test would be to run that example file, and confirm the top few key-phrases for each section. If anything breaks, that should fail quickly.

Missing Publication and Dataset Resources

Hi,

I executed python corpus.py corpus.ttl and then python download_corpus_resources.py to download the corpus but I got this output. Is this the expected output? It looks like some publications cannot be downloaded.

Number of records in the corpus: 586
Number of research publications: 480
Successfully downloaded 474 pdf files.
Missing publication resources: {'012df4a72af52b038483', 'dca54974ff51a5f7f8ab', '5f48a343cb75195cd646', 'c8f9b19b39e34d98a557','988428e18884e28e037c', '42c2755ec0f983870e62'}
Number of datasets: 106
Successfully downloaded 101 resource files.
Missing dataset resources: {'875ffb2b04b1392cd1f2', 'fe338b5b2f3f6b0d11a4', '53ca68ba0ded95220662', '33b1ce039c67a6658644', '379ff5f518e664ba2353'}

I checked the publication with id: "012df4a72af52b038483", and it looks like the link is not broken. Here is the link I got from corpus.jsonld
https://aasldpubs.onlinelibrary.wiley.com/doi/pdf/10.1002/hep.23220

@ceteri @philipskokoh Do you know why this happen?

Thanks

duplicate publications

Some of the publications are duplicated within the v0.1.7 release.

For example, publication-20d6bb688a5e9b2dd3cd shows up twice.

Need to add a filter to make publications unique.