coleridge-initiative / rcgraph Goto Github PK

View Code? Open in Web Editor NEW

3.0 13.0 2.0 328.21 MB

Rich Context knowledge graph management

Home Page: https://rc.coleridgeinitiative.org/?radius=3&entity=NOAA

License: Creative Commons Zero v1.0 Universal

Python 3.83% Jupyter Notebook 35.87% HTML 60.31%

rich-context metadata knowledge-graph

rcgraph's Introduction

RCGraph

Manage the Rich Context knowledge graph.

Installation

First, there are two options for creating an environment.

Option 1: use virtualenv to create a virtual environment with the local Python 3.x as the target binary.

Then activate virtualenv and update your pip configuration:

source venv/bin/activate
pip install setuptools --upgrade

Option 2: use conda -- see https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

Second, clone the repo:

git clone https://github.com/Coleridge-Initiative/RCGraph.git

Third, connect into the directory and initialize the local Git configuration for the required submodules:

cd RCGraph
git submodule init
git submodule update
git config status.submodulesummary 1

Given that foundation, load the dependencies:

pip install -r requirements.txt

Fourth, set up the local rc.cfg configuration file and run unit the tests (see below) to confirm that this project has been installed and configured properly.

Submodules

Ontology definitions used for the KG are linked into this project as a submodule:

https://github.com/Coleridge-Initiative/adrf-onto

Git repos exist for almost every entity in the KG, also linked as submodules:

The RCLC leaderboard competition is also linked as a submodule since it consumes from this repo for corpus updates:

https://github.com/Coleridge-Initiative/rclc

Updates

To update the submodules to their latest HEAD commit in master branch run:

git submodule foreach "(git fetch; git merge origin/master; cd ..;)"

Then add the submodule and commit.

For more info about how to use Git submodules, see:

Workflow

Initial Steps

update datasets.json -- datasets are the foundation for the KG
add a new partition of publication metadata for each data ingest

Step 1: Graph Consistency Tests

To perform these tests:

coverage run -m unittest discover

Then create GitHub issues among the submodules for any failed tests.

Also, you can generate a coverage report and upload that via:

coverage report
bash <(curl -s https://codecov.io/bash) -t @.cc_token

Test coverage reports can be viewed at https://codecov.io/gh/Coleridge-Initiative/RCGraph

Step 2: Gather the DOIs, etc.

Use title search across the scholarly infrastructure APIs to identify a DOI and other metadata for each publication.

python run_step2.py

Results are organized in partitions within the bucket_stage subdirectory, using the same partition names from the preceding workflow steps, to make errors easier to trace and troubleshoot.

See the misses_step2.json file which reports the title of each publication that failed every API lookup.

Step 3: Gather the PDFs, etc.

Use publication lookup with DOIs across the scholarly infrastructure APIs to identify open access PDFs, journals, authors, keywords, etc.

python run_step3.py

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

See the misses_step3.json file which reports the title of each publication that failed every API lookup.

Step 4: Reconcile Journal Entities

This is a manual step.

Scan results from calls to scholarly infrastructure APIs, then apply business logic to reconcile the journal for each publication with the journals.json entity listing.

python run_step4.py

Disputed entity defintions are written to standard output, and suggested additions are written to a new update_journals.json file.

The person running this step must review each suggestion, then determine whether to add the suggested journals to the journals.json entities file -- or make other changes to previously described journal entities. For example, sometimes the metadata returned from discovery APIs has errors and would cause data quality issues within the KG.

Some good tools for manually checking journal metadata via ISSNs include ISSN.org, Crossref, and NCBI. For example, using the ISSN "1531-3204" to lookup journal metadata:

Often there will be outdated/invalidated ISSNs or low-info-content defaults (e.g., substituting SSRN) included in API results, which could derail our KG development.

Journal names get used later in the workflow to construct UUIDs for publications, prior to generating the public corpus. This step performs consistency tests and filtering of the API metadata, to avoid data quality issues later.

See the misses_step4.json file which reports the title of each publication that doesn't have a journal.

Caveats:

If you don't understand what this step performs, don't run it
Do not make manual edits to the journals.json file

Step 5: Reconcile Author Lists

This is a manual step.

Scan results from calls to scholarly infrastructure APIs, then apply business logic to reconcile (disambiguate) the author lists for each publication with the authors.json entity listing.

python run_author.py

Lists of authors are parsed from metadata in the bucket_stage then disambiguated.

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

The stage produces two files:

authors.json -- list of known authors
auth_train.tsv -- training set for self-supervised model

See the misses_author.json file which reports the title of each publication that doesn't any authors.

Caveats:

Do not make manual edits to authors.json or auth_train.tsv

Step 6: Pull Abstracts

This workflow step pulls the abstracts from the results of API calls in previous steps.

python run_abstract.py

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

See the misses_abstract.json file which reports the title of each publication that had no abstract.

Step 7: Parse Keyphrases from Abstracts

This workflow step parses keyphrases from abstracts.

python run_keyphr.py

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

See the misses_keyphr.json file which reports the title of each publication that could not parse keyphrases.

Step 9: Finalize Metadata Corrections

This workflow step finalizes the metadata corrections for each publication, including selection of a URL, open access PDF, etc., along with the manual override.

python run_final.py

Results are organized in partitions in the bucket_final subdirectory, using the same partition names from the previous workflow step.

See the misses_final.json file which reports the title of each publication that failed every API lookup.

Step 10: Generate Corpus Update

This workflow step generates uuid values (late binding) for both publications and datasets, then serializes the full output as TTL in tmp.ttl and as JSON-LD in tmp.jsonld for a corpus update:

python gen_ttl.py

Afterwards, move the generated tmp.* files into the RCLC repo and rename them:

mv tmp.* rclc
cd rclc
mv tmp.ttl corpus.ttl
mv tmp.jsonld corpus.jsonld

To publish the corpus:

commit and create a new tagged release
run bin/download_resources.py to download PDFs
extract text from PDFs
upload to the public S3 bucket and write manifest

Step 11: Generate UI Web App Update

To update the UI web app:

./gen_ttl.py --full_graph true
cp tmp.jsonld full.jsonld 
cp tmp.ttl full.ttl 
gsutil cp full.jsonld gs://rich-context/

rcgraph's People

Contributors

Stargazers

Watchers

Forkers

jasonzhangzy1757 haritzpuerto

rcgraph's Issues

Create a library for “imperfect” string comparisons

We need a library to calibrate different methods for an "imperfect" text matching. The idea is to be able to be able to calibrate parameters for different contexts (e.g. datasets titles, publication titles, author names, etc)

Incorporate results from the ML competition

Begin to use models from our ML competition:

develop the required training sets
train models
evaluate models on new data
building ensembles (where possible)

Also, iterate to experiment with our own models and publish those:

to help boost the competition
to abstract learnings from the competition into our RCGraph workflow

Incorporate OAG for authors

Integrate use of the Open Academic Graph to reconcile metadata about authors.

See https://www.openacademic.ai/oag/

This is a public version of the Microsoft Academic Graph used by teams in our ML competition.

[Future Work/ Idea] Graph Database to store the data of the RCKG

If I remember correctly, I think @ernestogimeno mentioned to me that the jsonld corpus format has some limitations that may affect the scalability so at some point we may need to use a real database.

A possible option is to use a graph database such as Neo4j. In this database, the information is stored in graph format so it might be suitable for our needs.

I just put this thought here for future discussion if we need to upgrade the jsonld format.

explore how to leverage CRediT

Explore how to leverage CRediT for credentialing authors, with integration into ORCID

Adjust run_step3.py to the request rate limit of Semantic Scholar API (100 request per 5 minutes window)

Dear Ernesto,
Thanks for your question about our API rate limit.
The current rate limit is 100 requests per 5-minute window per IP address. Our API will return 500's if overloaded. If you begin to see this, we request you slow down.
Please don't hesitate to let me know if you have any more questions about our API.
Sincerely,
Linda Wagner, Semantic Scholar Support

Evaluate OAI-PMH integration

Would there be any benefits to integrating with or consuming from OAI-PMH http://www.openarchives.org/pmh/ for programmatic metadata harvesting / metadata exchange?

analytics: dataset co-occurrence

We need means to analyze the co-occurrence rates for datasets. In other words, looking at the rate of datasets being used together based on links from publications in our KG.

Use the full.jsonld instance of the KG as the basis for analysis.

Steps:

results are best packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
later we'll move the analysis into an additional workflow step, and show the results in the web app.

USDA would love to have this before the mid-March demo.

Add abstracts entities into the KG

Add the publication abstracts into the KG, based on stage3 results from discovery APIs.

Best to use run_stage3.py as a template, although the main part to reuse is how it iterates through partitions in the BUCKET_STAGE

@ernestogimeno has also worked with this code and can help guide/advise.

Will need to pull the abstract field from metadata, where it exists. The responses from Semantic Scholar tend to have these -- and we may be able to extend other API calls to get abstracts. @lobodemonte can assist on those extensions.

The end goal will be to include abstracts as metadata in the graph -- where available -- and then also run these through a later stage that runs the TextRank algorithm to extract key phrases.

Prototype a reporting web site for USDA

Provide API calls in Swagger to serve KG subgraphs, based on https://github.com/DerwenAI/RCServer
Develop a web-based interactive report that implements the wireframes provided to USDA, NOAA, etc.

Deadlines:

Initial demo due 2020-01-31
USDA demo onsite 2020-02-07

Publication authors list issues

There are a few issues with this publication authors:

Actual author Cheung, Karen and not Cheung, Kwok-Fai
The author list order is not correctly captured
This might not be an issue, but the highlighted author in the RC UI is the last author in the actual author list. Is that correct?

Publication UUID: publication-74219ad6ba9e918165d1
DOI: 10.1016/j.childyouth.2018.12.009
URL: https://www.sciencedirect.com/science/article/pii/S0190740918308430?via%3Dihub

download_resources.py is missing

Hi!
From the readme:

To publish the corpus:
commit and create a new tagged release
run bin/download_resources.py to download PDFs

But I cannot find download_resources.py in the repository. Is it missing?
Thanks @ernestogimeno

analysis: author cliques

We need means to analyze the co-occurrence rates for authors. In other words, looking at the rate of authors being co-authors together on research publications in our KG.

Use the full.jsonld instance of the KG as the basis for analysis.

Since authors "social" (ostensibly, unlike datasets) then it's more interested to have clique analysis instead of co-occurrence probability rates. A good approach would likely be

build a graph in networkx from the JSON-LD (similar to the web app in RCServer)
use clique features

Steps:

results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
later we'll move the analysis into an additional workflow step, and show the results in the web app.

ResearchGate is very interested in this feature -- lates aim for late March or sooner.

analysis: publisher classifier

We need means to analyze the "quality" for the more popular journal article publishers. In other words, we need a classifier based on the publisher (ScienceDirect, PubMed, OUP, etc.) for how likely the entries in publication partitions will "survive" all the way through our workflow to successful PDF parsing.

Methodology:

analyze the distribution of publishers among the entries in partitions in bucket_final by resolving doi fields to URLs, then fetching those to determine their DNS domain
what's the distribution of how many fail to have open access PDFs? use a title match on errors/misses_final.txt
what's the distribution of how many PDFs fail to download? see rclc/errors.txt
what's the distribution of how many PDFs fail to be parsed, for text extraction?

Delivery:

results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
later we'll move the analysis into an additional workflow step.

Include pdf links manually identified by reviewers in the corpus

Take action on error logs to improve data quality in KG

See the error logs per workflow stage in:

https://github.com/Coleridge-Initiative/RCGraph/tree/master/errors

These are errors encountered and logged at each stage.
For each, we must evaluate:

try to determine root cause for each API lookup failure
for legit error cases, update error handling in richcontext.scholapi
for a metadata error upstream, make a PR to fix that
for a metadata error in API results, make a PR for manual override

Overall, we need to create or improve unit tests to give better coverage where possible, and also improve our API integration.

analysis: scientific paper section classifier

We need to train a text-based classifier model to identify the sections of a parsed PDF for a research paper. Dataset linking ML models typically depend on section as a feature.

The source data is in the S3 bucket see corpus_docs/pub/txt/*.txt

Some good prior work -- though not necessarily best as starting points for this project:

There's plenty of "research" published in this area, although be careful since most of these published works are horridly out of date, represent disproven practiced, and should be mostly ignored:

While there's been much research into this area using dependency parsing in general plus the many transformer approaches which are more recent, we'll start with weak supervision instead.

Start with use of Snorkel https://www.snorkel.org/use-cases/ to build a set of labeling functions for preparing the training data for the classifier. That way we'll be able to update and rebuild training sets more dynamically, as our corpus changes.

Prototype recommender systems

Build out a prototype recommender system.
Use of weak supervision is indicated; see https://www.snorkel.org/use-cases/recsys-tutorial

Depends on #29 #34 #39 #40 #41

Evaluate GESIS paper about dataset linking

Evaluate GESIS paper about dataset linking.

The natural language and machine learning technology referenced in this paper are relatively dated by now (they don't reflect the major, industry-wide advances since 2018)

https://www.semanticscholar.org/paper/A-semi-automatic-approach-for-detecting-dataset-in-Ghavimi-Mayr/09d728103b2d0d05fda57a6e65ecaa528858ebc9

Missing pytextrank and spacy in requirements.txt

I also needed to do this:

python3 -m spacy download en_core_web_sm

Link other required entities into the KG: authors

Other entities which still must become integrated into the KG:

authors

Paco leads on these points for RCGraph

Link other required entities into the KG: keywords

depends on #29

Paco leads on these points for RCGraph

Create a federated search script

decouple internal data dependencies?

Can (or should) we decouple our internal data dependencies from Git using lazydata ?

https://github.com/rstojnic/lazydata

clean partitions incorrectly added with verified not-links

Review the following partitions

20200311_NSF_SED_and_SDR_part8_publications
20200302_federated2_USDA_AgriculturalResourceManagementSurvey_part28_publications
20200302_USDA_ARMS_website_part1_publications

Replace the datadropts in Richcontextmetadata, RCPublications, rerun the KG workflow in RCGraph and update the rckg jsonld file saved as the latest version

knowledge graph clean-up

General issues for cleaning up the knowledge graph:

workflow steps that call APIs should reuse previous responses when an API call fails
have options in gen_ttl.py to generate the public corpus for the ML competition vs. the full corpus which isn't filtered (to use in ADRF, recsys, etc.)
filter out publications that lack URLs or open access PDFs at the very latest point possible
make sure that publications and other entities going into TTL/JSON-LD are unique
pull the ISSN identifiers per publication

Fix verify_doi method to prevent error BAD DOI: http://doi.org/10.7289/V5W9573D

integrate with GCIS

import from GCIS into journals.json based on https://data.globalchange.gov/journal
find ways to integrate for updates

Adjust run_stept2.py to the restrictions of Dimensions API

"The Dimensions Analytics API is limited to 30 requests per IP address per minute." https://docs.dimensions.ai/dsl/api.html

Our richcontext.scholapi package is using a Dimensions library to habdle the connections, but still will put to sleep for 30 seconds almost every time a request is made once we reach the limit, slowing down the process a lot.

Find more abstracts and add them into the KG

Abstracts from schol.semantic have been added into KG in #58 and #39. Try to locate where abstracts could be in other places in the publication from other API calls, and add them into KG as well.

Alternative metadata sources from the agencies + libraries

The agencies (NOAA, NASA, USDA, GCIS, USGS, etc.) are providing other sources for metadata which we can import. Mostly we'll be working with the agency libraries.

We'll be working on a case by case basis for these; workflows will probably need to be changed to adapt to these new sources.

Paco leads on these points; we must wait for the CRADA to be signed with NOAA before proceeding.

Evaluate impact of not using Dimensions API in the current workflow

Update RCGraph to use richcontext 1.2

[Question] Is richcontext.graph.RCGraph used once the workflow is finished?

Hi!
I've seen that RCGraph is used to create the jsonld corpus. For example, it's used in run_step2.py

But it seems that it's not used in corpus.ipynb
So I wonder if RCGraph class can be used to explore the data contained by the KG. On the other hand, in corpus.ipynb a graph object is created to visualize the graph

nxg = nx.Graph()

Maybe the RCGraph is not designed to store data and that's why corpus.ipynb uses the bare jsonld file?

Thanks @ernestogimeno @JasonZhangzy1757

Update federated_search.py to use richcontext 1.2

Abstract of publication-2b2480acb1b98d322868 includes extra text that is not part of the abstract

Publication title Standing Still or Moving Up? Evidence from Wisconsin on the Long-Term Employment and Earnings of TANF Participants

Publication UUID (2020-06-02) publication-2b2480acb1b98d322868

The actual abstract is:

Abstract
This study identified the employment and earnings trajectories of welfare recipients over six years for a sample of 14,150 women who entered the Temporary Assistance for Needy Families program (TANF) in Wisconsin in its first year. Wisconsin longitudinal administrative data were used to examine differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We developed a conceptual approach to categorizing participants' employment and earnings trajectory groups. Results indicate substantial diversity in employment and earnings patterns. Some women have consistently positive outcomes, others show steady improvements over time, and others have inconsistent patterns that end strong. We found that 46% of the sample fit into one of three successful employment trajectories, and 22% fit into one of three successful earnings trajectories. Results also reveal that many women who were successful in the mid-term were not able to sustain their progress. For example, only 56% of those who were earning successes in the mid-term were still successful in the long-term. Finally, logistic regression models were used to compare the factors associated with mid-term and long-term success and with employment success and earnings success. Implications of findings are discussed.

But this other text is also included:

KEY WORDS: employment and earnings; poverty; social policy; TANF; welfare reform ********** Since the passage of the Personal Responsibility and Work Opportunity Reconciliation Act (PRWORA) of 1996, welfare programs have aimed to move low-income women with children from welfare to work. In the past decade, the number of single-mother families receiving traditional cash assistance (now called Temporary Assistance for Needy Families [TANF]) has fallen dramatically, and employment rates have risen. However, much of the literature on the economic well-being of welfare recipients and welfare leavers suggests that many who have moved from welfare to work move in and out of the labor market frequently, are working for low wages, and have insufficient earnings to support a family above the poverty line without receiving public means-tested benefits. (For a discussion of post-TANF economic status, see Blank, 2006; Cancian & Meyer, 2004; Danziger, Heflin, Corcoran, Oltmans, & Wang, 2002; Grogger & Karoly, 2005; Johnson & Corcoran, 2003.) Most of the early studies on economic well-being of welfare recipients after welfare reform have examined employment, earnings, and income after leaving cash benefits over fairly short periods of time. Less is known about whether the short-term economic success (or lack of success) has persisted in the long-term. In this article, we use longitudinal administrative data to examine the employment and earnings trajectories of welfare recipients over six years for a sample of approximately 17,000 women who entered TANF in Wisconsin in its first year. We propose a method to characterize the six-year patterns of employment and earnings and consider differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We compare the factors associated with mid-term and long-term success. POLICY CONTEXT This article focuses on TANF participants in Wisconsin. Wisconsin's TANF program, called Wisconsin Works (W-2), was instituted in September 1997. W-2 consists of several 'tiers' and is structured to mirror the world of employment. Thus, individuals do not receive a cash payment unless they are working in a community service job, are engaged in a work-like activity (W-2 Transitions), or have a child younger than 13 weeks old (caretaker of newborn). In addition, individuals who are the most work-ready can receive a variety of services without receiving cash (case management). Individuals are expected to begin in the tier that corresponds to their level of work-readiness and to progress up the tiers until they no longer need any W-2 services.

Prototype a data impact factor

We'll follow the example from Bundesbank to estimate metrics for datasets.

One of the benefits of using graphs is that metrics can be calculated for data impact factor which might be prohibitively difficult otherwise.

Paco leads on these points; we need a larger graph and must have other entities represented first.

Depends on: #29 #30 #39 #40 #41

Link other required entities into the KG: affiliations

depends on #29

Paco leads on these points for RCGraph

coleridge-initiative / rcgraph Goto Github PK

rcgraph's Introduction

RCGraph

Installation

Submodules

Updates

Workflow

Initial Steps

Step 1: Graph Consistency Tests

Step 2: Gather the DOIs, etc.

Step 3: Gather the PDFs, etc.

Step 4: Reconcile Journal Entities

Step 5: Reconcile Author Lists

Step 6: Pull Abstracts

Step 7: Parse Keyphrases from Abstracts

Step 9: Finalize Metadata Corrections

Step 10: Generate Corpus Update

Step 11: Generate UI Web App Update

rcgraph's People

Contributors

Stargazers

Watchers

Forkers

rcgraph's Issues

Recommend Projects

Recommend Topics

Recommend Org