Coder Social home page Coder Social logo

coleridge-initiative / rcgraph Goto Github PK

View Code? Open in Web Editor NEW
3.0 13.0 2.0 328.21 MB

Rich Context knowledge graph management

Home Page: https://rc.coleridgeinitiative.org/?radius=3&entity=NOAA

License: Creative Commons Zero v1.0 Universal

Python 3.83% Jupyter Notebook 35.87% HTML 60.31%
rich-context metadata knowledge-graph

rcgraph's Introduction

RCGraph

Manage the Rich Context knowledge graph.

Installation

First, there are two options for creating an environment.

Option 1: use virtualenv to create a virtual environment with the local Python 3.x as the target binary.

Then activate virtualenv and update your pip configuration:

source venv/bin/activate
pip install setuptools --upgrade

Option 2: use conda -- see https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

Second, clone the repo:

git clone https://github.com/Coleridge-Initiative/RCGraph.git

Third, connect into the directory and initialize the local Git configuration for the required submodules:

cd RCGraph
git submodule init
git submodule update
git config status.submodulesummary 1

Given that foundation, load the dependencies:

pip install -r requirements.txt

Fourth, set up the local rc.cfg configuration file and run unit the tests (see below) to confirm that this project has been installed and configured properly.

Submodules

Ontology definitions used for the KG are linked into this project as a submodule:

Git repos exist for almost every entity in the KG, also linked as submodules:

The RCLC leaderboard competition is also linked as a submodule since it consumes from this repo for corpus updates:

Updates

To update the submodules to their latest HEAD commit in master branch run:

git submodule foreach "(git fetch; git merge origin/master; cd ..;)"

Then add the submodule and commit.

For more info about how to use Git submodules, see:

Workflow

Initial Steps

  • update datasets.json -- datasets are the foundation for the KG
  • add a new partition of publication metadata for each data ingest

Step 1: Graph Consistency Tests

To perform these tests:

coverage run -m unittest discover

Then create GitHub issues among the submodules for any failed tests.

Also, you can generate a coverage report and upload that via:

coverage report
bash <(curl -s https://codecov.io/bash) -t @.cc_token

Test coverage reports can be viewed at https://codecov.io/gh/Coleridge-Initiative/RCGraph

Step 2: Gather the DOIs, etc.

Use title search across the scholarly infrastructure APIs to identify a DOI and other metadata for each publication.

python run_step2.py

Results are organized in partitions within the bucket_stage subdirectory, using the same partition names from the preceding workflow steps, to make errors easier to trace and troubleshoot.

See the misses_step2.json file which reports the title of each publication that failed every API lookup.

Step 3: Gather the PDFs, etc.

Use publication lookup with DOIs across the scholarly infrastructure APIs to identify open access PDFs, journals, authors, keywords, etc.

python run_step3.py

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

See the misses_step3.json file which reports the title of each publication that failed every API lookup.

Step 4: Reconcile Journal Entities

This is a manual step.

Scan results from calls to scholarly infrastructure APIs, then apply business logic to reconcile the journal for each publication with the journals.json entity listing.

python run_step4.py

Disputed entity defintions are written to standard output, and suggested additions are written to a new update_journals.json file.

The person running this step must review each suggestion, then determine whether to add the suggested journals to the journals.json entities file -- or make other changes to previously described journal entities. For example, sometimes the metadata returned from discovery APIs has errors and would cause data quality issues within the KG.

Some good tools for manually checking journal metadata via ISSNs include ISSN.org, Crossref, and NCBI. For example, using the ISSN "1531-3204" to lookup journal metadata:

Often there will be outdated/invalidated ISSNs or low-info-content defaults (e.g., substituting SSRN) included in API results, which could derail our KG development.

Journal names get used later in the workflow to construct UUIDs for publications, prior to generating the public corpus. This step performs consistency tests and filtering of the API metadata, to avoid data quality issues later.

See the misses_step4.json file which reports the title of each publication that doesn't have a journal.

Caveats:

  • If you don't understand what this step performs, don't run it
  • Do not make manual edits to the journals.json file

Step 5: Reconcile Author Lists

This is a manual step.

Scan results from calls to scholarly infrastructure APIs, then apply business logic to reconcile (disambiguate) the author lists for each publication with the authors.json entity listing.

python run_author.py

Lists of authors are parsed from metadata in the bucket_stage then disambiguated.

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

The stage produces two files:

  • authors.json -- list of known authors
  • auth_train.tsv -- training set for self-supervised model

See the misses_author.json file which reports the title of each publication that doesn't any authors.

Caveats:

  • Do not make manual edits to authors.json or auth_train.tsv

Step 6: Pull Abstracts

This workflow step pulls the abstracts from the results of API calls in previous steps.

python run_abstract.py

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

See the misses_abstract.json file which reports the title of each publication that had no abstract.

Step 7: Parse Keyphrases from Abstracts

This workflow step parses keyphrases from abstracts.

python run_keyphr.py

Results are organized in partitions in the bucket_stage subdirectory, using the same partition names from the preceding workflow steps.

See the misses_keyphr.json file which reports the title of each publication that could not parse keyphrases.

Step 9: Finalize Metadata Corrections

This workflow step finalizes the metadata corrections for each publication, including selection of a URL, open access PDF, etc., along with the manual override.

python run_final.py

Results are organized in partitions in the bucket_final subdirectory, using the same partition names from the previous workflow step.

See the misses_final.json file which reports the title of each publication that failed every API lookup.

Step 10: Generate Corpus Update

This workflow step generates uuid values (late binding) for both publications and datasets, then serializes the full output as TTL in tmp.ttl and as JSON-LD in tmp.jsonld for a corpus update:

python gen_ttl.py

Afterwards, move the generated tmp.* files into the RCLC repo and rename them:

mv tmp.* rclc
cd rclc
mv tmp.ttl corpus.ttl
mv tmp.jsonld corpus.jsonld

To publish the corpus:

  1. commit and create a new tagged release
  2. run bin/download_resources.py to download PDFs
  3. extract text from PDFs
  4. upload to the public S3 bucket and write manifest

Step 11: Generate UI Web App Update

To update the UI web app:

./gen_ttl.py --full_graph true
cp tmp.jsonld full.jsonld 
cp tmp.ttl full.ttl 
gsutil cp full.jsonld gs://rich-context/

rcgraph's People

Contributors

abhi-balaji avatar andrewhnorris avatar ceteri avatar ernestogimeno avatar jasonzhangzy1757 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rcgraph's Issues

Create a library for “imperfect” string comparisons

We need a library to calibrate different methods for an "imperfect" text matching. The idea is to be able to be able to calibrate parameters for different contexts (e.g. datasets titles, publication titles, author names, etc)

Incorporate results from the ML competition

Begin to use models from our ML competition:

  • develop the required training sets
  • train models
  • evaluate models on new data
  • building ensembles (where possible)

Also, iterate to experiment with our own models and publish those:

  1. to help boost the competition
  2. to abstract learnings from the competition into our RCGraph workflow

[Future Work/ Idea] Graph Database to store the data of the RCKG

If I remember correctly, I think @ernestogimeno mentioned to me that the jsonld corpus format has some limitations that may affect the scalability so at some point we may need to use a real database.

A possible option is to use a graph database such as Neo4j. In this database, the information is stored in graph format so it might be suitable for our needs.

I just put this thought here for future discussion if we need to upgrade the jsonld format.

analytics: dataset co-occurrence

We need means to analyze the co-occurrence rates for datasets. In other words, looking at the rate of datasets being used together based on links from publications in our KG.

Use the full.jsonld instance of the KG as the basis for analysis.

Steps:

  1. results are best packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
  2. later we'll move the analysis into an additional workflow step, and show the results in the web app.

USDA would love to have this before the mid-March demo.

Add abstracts entities into the KG

Add the publication abstracts into the KG, based on stage3 results from discovery APIs.

Best to use run_stage3.py as a template, although the main part to reuse is how it iterates through partitions in the BUCKET_STAGE

@ernestogimeno has also worked with this code and can help guide/advise.

Will need to pull the abstract field from metadata, where it exists. The responses from Semantic Scholar tend to have these -- and we may be able to extend other API calls to get abstracts. @lobodemonte can assist on those extensions.

The end goal will be to include abstracts as metadata in the graph -- where available -- and then also run these through a later stage that runs the TextRank algorithm to extract key phrases.

download_resources.py is missing

Hi!
From the readme:

To publish the corpus:
commit and create a new tagged release
run bin/download_resources.py to download PDFs

But I cannot find download_resources.py in the repository. Is it missing?
Thanks @ernestogimeno

analysis: author cliques

We need means to analyze the co-occurrence rates for authors. In other words, looking at the rate of authors being co-authors together on research publications in our KG.

Use the full.jsonld instance of the KG as the basis for analysis.

Since authors "social" (ostensibly, unlike datasets) then it's more interested to have clique analysis instead of co-occurrence probability rates. A good approach would likely be

  1. build a graph in networkx from the JSON-LD (similar to the web app in RCServer)
  2. use clique features

Steps:

  1. results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
  2. later we'll move the analysis into an additional workflow step, and show the results in the web app.

ResearchGate is very interested in this feature -- lates aim for late March or sooner.

analysis: publisher classifier

We need means to analyze the "quality" for the more popular journal article publishers. In other words, we need a classifier based on the publisher (ScienceDirect, PubMed, OUP, etc.) for how likely the entries in publication partitions will "survive" all the way through our workflow to successful PDF parsing.

Methodology:

  • analyze the distribution of publishers among the entries in partitions in bucket_final by resolving doi fields to URLs, then fetching those to determine their DNS domain
  • what's the distribution of how many fail to have open access PDFs? use a title match on errors/misses_final.txt
  • what's the distribution of how many PDFs fail to download? see rclc/errors.txt
  • what's the distribution of how many PDFs fail to be parsed, for text extraction?

Delivery:

  1. results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
  2. later we'll move the analysis into an additional workflow step.

Take action on error logs to improve data quality in KG

See the error logs per workflow stage in:

These are errors encountered and logged at each stage.
For each, we must evaluate:

  • try to determine root cause for each API lookup failure
  • for legit error cases, update error handling in richcontext.scholapi
  • for a metadata error upstream, make a PR to fix that
  • for a metadata error in API results, make a PR for manual override

Overall, we need to create or improve unit tests to give better coverage where possible, and also improve our API integration.

analysis: scientific paper section classifier

We need to train a text-based classifier model to identify the sections of a parsed PDF for a research paper. Dataset linking ML models typically depend on section as a feature.

The source data is in the S3 bucket see corpus_docs/pub/txt/*.txt

Some good prior work -- though not necessarily best as starting points for this project:

There's plenty of "research" published in this area, although be careful since most of these published works are horridly out of date, represent disproven practiced, and should be mostly ignored:

While there's been much research into this area using dependency parsing in general plus the many transformer approaches which are more recent, we'll start with weak supervision instead.

Start with use of Snorkel https://www.snorkel.org/use-cases/ to build a set of labeling functions for preparing the training data for the classifier. That way we'll be able to update and rebuild training sets more dynamically, as our corpus changes.

clean partitions incorrectly added with verified not-links

Review the following partitions

  • 20200311_NSF_SED_and_SDR_part8_publications
  • 20200302_federated2_USDA_AgriculturalResourceManagementSurvey_part28_publications
  • 20200302_USDA_ARMS_website_part1_publications

Replace the datadropts in Richcontextmetadata, RCPublications, rerun the KG workflow in RCGraph and update the rckg jsonld file saved as the latest version

knowledge graph clean-up

General issues for cleaning up the knowledge graph:

  • workflow steps that call APIs should reuse previous responses when an API call fails
  • have options in gen_ttl.py to generate the public corpus for the ML competition vs. the full corpus which isn't filtered (to use in ADRF, recsys, etc.)
  • filter out publications that lack URLs or open access PDFs at the very latest point possible
  • make sure that publications and other entities going into TTL/JSON-LD are unique
  • pull the ISSN identifiers per publication

Alternative metadata sources from the agencies + libraries

The agencies (NOAA, NASA, USDA, GCIS, USGS, etc.) are providing other sources for metadata which we can import. Mostly we'll be working with the agency libraries.

We'll be working on a case by case basis for these; workflows will probably need to be changed to adapt to these new sources.

Paco leads on these points; we must wait for the CRADA to be signed with NOAA before proceeding.

[Question] Is richcontext.graph.RCGraph used once the workflow is finished?

Hi!
I've seen that RCGraph is used to create the jsonld corpus. For example, it's used in run_step2.py

But it seems that it's not used in corpus.ipynb
So I wonder if RCGraph class can be used to explore the data contained by the KG. On the other hand, in corpus.ipynb a graph object is created to visualize the graph

nxg = nx.Graph()

Maybe the RCGraph is not designed to store data and that's why corpus.ipynb uses the bare jsonld file?

Thanks @ernestogimeno @JasonZhangzy1757

Abstract of publication-2b2480acb1b98d322868 includes extra text that is not part of the abstract

Publication title Standing Still or Moving Up? Evidence from Wisconsin on the Long-Term Employment and Earnings of TANF Participants

Publication UUID (2020-06-02) publication-2b2480acb1b98d322868

The actual abstract is:

Abstract
This study identified the employment and earnings trajectories of welfare recipients over six years for a sample of 14,150 women who entered the Temporary Assistance for Needy Families program (TANF) in Wisconsin in its first year. Wisconsin longitudinal administrative data were used to examine differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We developed a conceptual approach to categorizing participants' employment and earnings trajectory groups. Results indicate substantial diversity in employment and earnings patterns. Some women have consistently positive outcomes, others show steady improvements over time, and others have inconsistent patterns that end strong. We found that 46% of the sample fit into one of three successful employment trajectories, and 22% fit into one of three successful earnings trajectories. Results also reveal that many women who were successful in the mid-term were not able to sustain their progress. For example, only 56% of those who were earning successes in the mid-term were still successful in the long-term. Finally, logistic regression models were used to compare the factors associated with mid-term and long-term success and with employment success and earnings success. Implications of findings are discussed.

But this other text is also included:

KEY WORDS: employment and earnings; poverty; social policy; TANF; welfare reform ********** Since the passage of the Personal Responsibility and Work Opportunity Reconciliation Act (PRWORA) of 1996, welfare programs have aimed to move low-income women with children from welfare to work. In the past decade, the number of single-mother families receiving traditional cash assistance (now called Temporary Assistance for Needy Families [TANF]) has fallen dramatically, and employment rates have risen. However, much of the literature on the economic well-being of welfare recipients and welfare leavers suggests that many who have moved from welfare to work move in and out of the labor market frequently, are working for low wages, and have insufficient earnings to support a family above the poverty line without receiving public means-tested benefits. (For a discussion of post-TANF economic status, see Blank, 2006; Cancian & Meyer, 2004; Danziger, Heflin, Corcoran, Oltmans, & Wang, 2002; Grogger & Karoly, 2005; Johnson & Corcoran, 2003.) Most of the early studies on economic well-being of welfare recipients after welfare reform have examined employment, earnings, and income after leaving cash benefits over fairly short periods of time. Less is known about whether the short-term economic success (or lack of success) has persisted in the long-term. In this article, we use longitudinal administrative data to examine the employment and earnings trajectories of welfare recipients over six years for a sample of approximately 17,000 women who entered TANF in Wisconsin in its first year. We propose a method to characterize the six-year patterns of employment and earnings and consider differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We compare the factors associated with mid-term and long-term success. POLICY CONTEXT This article focuses on TANF participants in Wisconsin. Wisconsin's TANF program, called Wisconsin Works (W-2), was instituted in September 1997. W-2 consists of several 'tiers' and is structured to mirror the world of employment. Thus, individuals do not receive a cash payment unless they are working in a community service job, are engaged in a work-like activity (W-2 Transitions), or have a child younger than 13 weeks old (caretaker of newborn). In addition, individuals who are the most work-ready can receive a variety of services without receiving cash (case management). Individuals are expected to begin in the tier that corresponds to their level of work-readiness and to progress up the tiers until they no longer need any W-2 services.

Prototype a data impact factor

We'll follow the example from Bundesbank to estimate metrics for datasets.

One of the benefits of using graphs is that metrics can be calculated for data impact factor which might be prohibitively difficult otherwise.

Paco leads on these points; we need a larger graph and must have other entities represented first.

Depends on: #29 #30 #39 #40 #41

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.