Coder Social home page Coder Social logo

coleridge-initiative / rcgraph Goto Github PK

View Code? Open in Web Editor NEW
3.0 13.0 2.0 328.21 MB

Rich Context knowledge graph management

Home Page: https://rc.coleridgeinitiative.org/?radius=3&entity=NOAA

License: Creative Commons Zero v1.0 Universal

Python 3.83% Jupyter Notebook 35.87% HTML 60.31%
rich-context metadata knowledge-graph

rcgraph's Issues

Take action on error logs to improve data quality in KG

See the error logs per workflow stage in:

These are errors encountered and logged at each stage.
For each, we must evaluate:

  • try to determine root cause for each API lookup failure
  • for legit error cases, update error handling in richcontext.scholapi
  • for a metadata error upstream, make a PR to fix that
  • for a metadata error in API results, make a PR for manual override

Overall, we need to create or improve unit tests to give better coverage where possible, and also improve our API integration.

Prototype a data impact factor

We'll follow the example from Bundesbank to estimate metrics for datasets.

One of the benefits of using graphs is that metrics can be calculated for data impact factor which might be prohibitively difficult otherwise.

Paco leads on these points; we need a larger graph and must have other entities represented first.

Depends on: #29 #30 #39 #40 #41

analysis: publisher classifier

We need means to analyze the "quality" for the more popular journal article publishers. In other words, we need a classifier based on the publisher (ScienceDirect, PubMed, OUP, etc.) for how likely the entries in publication partitions will "survive" all the way through our workflow to successful PDF parsing.

Methodology:

  • analyze the distribution of publishers among the entries in partitions in bucket_final by resolving doi fields to URLs, then fetching those to determine their DNS domain
  • what's the distribution of how many fail to have open access PDFs? use a title match on errors/misses_final.txt
  • what's the distribution of how many PDFs fail to download? see rclc/errors.txt
  • what's the distribution of how many PDFs fail to be parsed, for text extraction?

Delivery:

  1. results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
  2. later we'll move the analysis into an additional workflow step.

analysis: author cliques

We need means to analyze the co-occurrence rates for authors. In other words, looking at the rate of authors being co-authors together on research publications in our KG.

Use the full.jsonld instance of the KG as the basis for analysis.

Since authors "social" (ostensibly, unlike datasets) then it's more interested to have clique analysis instead of co-occurrence probability rates. A good approach would likely be

  1. build a graph in networkx from the JSON-LD (similar to the web app in RCServer)
  2. use clique features

Steps:

  1. results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
  2. later we'll move the analysis into an additional workflow step, and show the results in the web app.

ResearchGate is very interested in this feature -- lates aim for late March or sooner.

[Future Work/ Idea] Graph Database to store the data of the RCKG

If I remember correctly, I think @ernestogimeno mentioned to me that the jsonld corpus format has some limitations that may affect the scalability so at some point we may need to use a real database.

A possible option is to use a graph database such as Neo4j. In this database, the information is stored in graph format so it might be suitable for our needs.

I just put this thought here for future discussion if we need to upgrade the jsonld format.

Add abstracts entities into the KG

Add the publication abstracts into the KG, based on stage3 results from discovery APIs.

Best to use run_stage3.py as a template, although the main part to reuse is how it iterates through partitions in the BUCKET_STAGE

@ernestogimeno has also worked with this code and can help guide/advise.

Will need to pull the abstract field from metadata, where it exists. The responses from Semantic Scholar tend to have these -- and we may be able to extend other API calls to get abstracts. @lobodemonte can assist on those extensions.

The end goal will be to include abstracts as metadata in the graph -- where available -- and then also run these through a later stage that runs the TextRank algorithm to extract key phrases.

knowledge graph clean-up

General issues for cleaning up the knowledge graph:

  • workflow steps that call APIs should reuse previous responses when an API call fails
  • have options in gen_ttl.py to generate the public corpus for the ML competition vs. the full corpus which isn't filtered (to use in ADRF, recsys, etc.)
  • filter out publications that lack URLs or open access PDFs at the very latest point possible
  • make sure that publications and other entities going into TTL/JSON-LD are unique
  • pull the ISSN identifiers per publication

analytics: dataset co-occurrence

We need means to analyze the co-occurrence rates for datasets. In other words, looking at the rate of datasets being used together based on links from publications in our KG.

Use the full.jsonld instance of the KG as the basis for analysis.

Steps:

  1. results are best packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
  2. later we'll move the analysis into an additional workflow step, and show the results in the web app.

USDA would love to have this before the mid-March demo.

[Question] Is richcontext.graph.RCGraph used once the workflow is finished?

Hi!
I've seen that RCGraph is used to create the jsonld corpus. For example, it's used in run_step2.py

But it seems that it's not used in corpus.ipynb
So I wonder if RCGraph class can be used to explore the data contained by the KG. On the other hand, in corpus.ipynb a graph object is created to visualize the graph

nxg = nx.Graph()

Maybe the RCGraph is not designed to store data and that's why corpus.ipynb uses the bare jsonld file?

Thanks @ernestogimeno @JasonZhangzy1757

clean partitions incorrectly added with verified not-links

Review the following partitions

  • 20200311_NSF_SED_and_SDR_part8_publications
  • 20200302_federated2_USDA_AgriculturalResourceManagementSurvey_part28_publications
  • 20200302_USDA_ARMS_website_part1_publications

Replace the datadropts in Richcontextmetadata, RCPublications, rerun the KG workflow in RCGraph and update the rckg jsonld file saved as the latest version

analysis: scientific paper section classifier

We need to train a text-based classifier model to identify the sections of a parsed PDF for a research paper. Dataset linking ML models typically depend on section as a feature.

The source data is in the S3 bucket see corpus_docs/pub/txt/*.txt

Some good prior work -- though not necessarily best as starting points for this project:

There's plenty of "research" published in this area, although be careful since most of these published works are horridly out of date, represent disproven practiced, and should be mostly ignored:

While there's been much research into this area using dependency parsing in general plus the many transformer approaches which are more recent, we'll start with weak supervision instead.

Start with use of Snorkel https://www.snorkel.org/use-cases/ to build a set of labeling functions for preparing the training data for the classifier. That way we'll be able to update and rebuild training sets more dynamically, as our corpus changes.

Incorporate results from the ML competition

Begin to use models from our ML competition:

  • develop the required training sets
  • train models
  • evaluate models on new data
  • building ensembles (where possible)

Also, iterate to experiment with our own models and publish those:

  1. to help boost the competition
  2. to abstract learnings from the competition into our RCGraph workflow

Create a library for “imperfect” string comparisons

We need a library to calibrate different methods for an "imperfect" text matching. The idea is to be able to be able to calibrate parameters for different contexts (e.g. datasets titles, publication titles, author names, etc)

Abstract of publication-2b2480acb1b98d322868 includes extra text that is not part of the abstract

Publication title Standing Still or Moving Up? Evidence from Wisconsin on the Long-Term Employment and Earnings of TANF Participants

Publication UUID (2020-06-02) publication-2b2480acb1b98d322868

The actual abstract is:

Abstract
This study identified the employment and earnings trajectories of welfare recipients over six years for a sample of 14,150 women who entered the Temporary Assistance for Needy Families program (TANF) in Wisconsin in its first year. Wisconsin longitudinal administrative data were used to examine differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We developed a conceptual approach to categorizing participants' employment and earnings trajectory groups. Results indicate substantial diversity in employment and earnings patterns. Some women have consistently positive outcomes, others show steady improvements over time, and others have inconsistent patterns that end strong. We found that 46% of the sample fit into one of three successful employment trajectories, and 22% fit into one of three successful earnings trajectories. Results also reveal that many women who were successful in the mid-term were not able to sustain their progress. For example, only 56% of those who were earning successes in the mid-term were still successful in the long-term. Finally, logistic regression models were used to compare the factors associated with mid-term and long-term success and with employment success and earnings success. Implications of findings are discussed.

But this other text is also included:

KEY WORDS: employment and earnings; poverty; social policy; TANF; welfare reform ********** Since the passage of the Personal Responsibility and Work Opportunity Reconciliation Act (PRWORA) of 1996, welfare programs have aimed to move low-income women with children from welfare to work. In the past decade, the number of single-mother families receiving traditional cash assistance (now called Temporary Assistance for Needy Families [TANF]) has fallen dramatically, and employment rates have risen. However, much of the literature on the economic well-being of welfare recipients and welfare leavers suggests that many who have moved from welfare to work move in and out of the labor market frequently, are working for low wages, and have insufficient earnings to support a family above the poverty line without receiving public means-tested benefits. (For a discussion of post-TANF economic status, see Blank, 2006; Cancian & Meyer, 2004; Danziger, Heflin, Corcoran, Oltmans, & Wang, 2002; Grogger & Karoly, 2005; Johnson & Corcoran, 2003.) Most of the early studies on economic well-being of welfare recipients after welfare reform have examined employment, earnings, and income after leaving cash benefits over fairly short periods of time. Less is known about whether the short-term economic success (or lack of success) has persisted in the long-term. In this article, we use longitudinal administrative data to examine the employment and earnings trajectories of welfare recipients over six years for a sample of approximately 17,000 women who entered TANF in Wisconsin in its first year. We propose a method to characterize the six-year patterns of employment and earnings and consider differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We compare the factors associated with mid-term and long-term success. POLICY CONTEXT This article focuses on TANF participants in Wisconsin. Wisconsin's TANF program, called Wisconsin Works (W-2), was instituted in September 1997. W-2 consists of several 'tiers' and is structured to mirror the world of employment. Thus, individuals do not receive a cash payment unless they are working in a community service job, are engaged in a work-like activity (W-2 Transitions), or have a child younger than 13 weeks old (caretaker of newborn). In addition, individuals who are the most work-ready can receive a variety of services without receiving cash (case management). Individuals are expected to begin in the tier that corresponds to their level of work-readiness and to progress up the tiers until they no longer need any W-2 services.

download_resources.py is missing

Hi!
From the readme:

To publish the corpus:
commit and create a new tagged release
run bin/download_resources.py to download PDFs

But I cannot find download_resources.py in the repository. Is it missing?
Thanks @ernestogimeno

Alternative metadata sources from the agencies + libraries

The agencies (NOAA, NASA, USDA, GCIS, USGS, etc.) are providing other sources for metadata which we can import. Mostly we'll be working with the agency libraries.

We'll be working on a case by case basis for these; workflows will probably need to be changed to adapt to these new sources.

Paco leads on these points; we must wait for the CRADA to be signed with NOAA before proceeding.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.