The rcgraph's discuss from coleridge-initiative

Prototype recommender systems

Build out a prototype recommender system.
Use of weak supervision is indicated; see https://www.snorkel.org/use-cases/recsys-tutorial

Depends on #29 #34 #39 #40 #41

Take action on error logs to improve data quality in KG

See the error logs per workflow stage in:

https://github.com/Coleridge-Initiative/RCGraph/tree/master/errors

These are errors encountered and logged at each stage.
For each, we must evaluate:

try to determine root cause for each API lookup failure
for legit error cases, update error handling in richcontext.scholapi
for a metadata error upstream, make a PR to fix that
for a metadata error in API results, make a PR for manual override

Overall, we need to create or improve unit tests to give better coverage where possible, and also improve our API integration.

Evaluate impact of not using Dimensions API in the current workflow

decouple internal data dependencies?

Can (or should) we decouple our internal data dependencies from Git using lazydata ?

https://github.com/rstojnic/lazydata

Prototype a data impact factor

We'll follow the example from Bundesbank to estimate metrics for datasets.

One of the benefits of using graphs is that metrics can be calculated for data impact factor which might be prohibitively difficult otherwise.

Paco leads on these points; we need a larger graph and must have other entities represented first.

Depends on: #29 #30 #39 #40 #41

Fix verify_doi method to prevent error BAD DOI: http://doi.org/10.7289/V5W9573D

analysis: publisher classifier

We need means to analyze the "quality" for the more popular journal article publishers. In other words, we need a classifier based on the publisher (ScienceDirect, PubMed, OUP, etc.) for how likely the entries in publication partitions will "survive" all the way through our workflow to successful PDF parsing.

Methodology:

analyze the distribution of publishers among the entries in partitions in bucket_final by resolving doi fields to URLs, then fetching those to determine their DNS domain
what's the distribution of how many fail to have open access PDFs? use a title match on errors/misses_final.txt
what's the distribution of how many PDFs fail to download? see rclc/errors.txt
what's the distribution of how many PDFs fail to be parsed, for text extraction?

Delivery:

results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
later we'll move the analysis into an additional workflow step.

analysis: author cliques

We need means to analyze the co-occurrence rates for authors. In other words, looking at the rate of authors being co-authors together on research publications in our KG.

Use the full.jsonld instance of the KG as the basis for analysis.

Since authors "social" (ostensibly, unlike datasets) then it's more interested to have clique analysis instead of co-occurrence probability rates. A good approach would likely be

build a graph in networkx from the JSON-LD (similar to the web app in RCServer)
use clique features

Steps:

results are best visualized and packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
later we'll move the analysis into an additional workflow step, and show the results in the web app.

ResearchGate is very interested in this feature -- lates aim for late March or sooner.

Adjust run_step3.py to the request rate limit of Semantic Scholar API (100 request per 5 minutes window)

Dear Ernesto,
Thanks for your question about our API rate limit.
The current rate limit is 100 requests per 5-minute window per IP address. Our API will return 500's if overloaded. If you begin to see this, we request you slow down.
Please don't hesitate to let me know if you have any more questions about our API.
Sincerely,
Linda Wagner, Semantic Scholar Support

Update federated_search.py to use richcontext 1.2

Incorporate OAG for authors

Integrate use of the Open Academic Graph to reconcile metadata about authors.

See https://www.openacademic.ai/oag/

This is a public version of the Microsoft Academic Graph used by teams in our ML competition.

[Future Work/ Idea] Graph Database to store the data of the RCKG

If I remember correctly, I think @ernestogimeno mentioned to me that the jsonld corpus format has some limitations that may affect the scalability so at some point we may need to use a real database.

A possible option is to use a graph database such as Neo4j. In this database, the information is stored in graph format so it might be suitable for our needs.

I just put this thought here for future discussion if we need to upgrade the jsonld format.

Link other required entities into the KG: authors

Other entities which still must become integrated into the KG:

authors

Paco leads on these points for RCGraph

Adjust run_stept2.py to the restrictions of Dimensions API

"The Dimensions Analytics API is limited to 30 requests per IP address per minute." https://docs.dimensions.ai/dsl/api.html

Our richcontext.scholapi package is using a Dimensions library to habdle the connections, but still will put to sleep for 30 seconds almost every time a request is made once we reach the limit, slowing down the process a lot.

Add abstracts entities into the KG

Add the publication abstracts into the KG, based on stage3 results from discovery APIs.

Best to use run_stage3.py as a template, although the main part to reuse is how it iterates through partitions in the BUCKET_STAGE

@ernestogimeno has also worked with this code and can help guide/advise.

Will need to pull the abstract field from metadata, where it exists. The responses from Semantic Scholar tend to have these -- and we may be able to extend other API calls to get abstracts. @lobodemonte can assist on those extensions.

The end goal will be to include abstracts as metadata in the graph -- where available -- and then also run these through a later stage that runs the TextRank algorithm to extract key phrases.

Create a federated search script

knowledge graph clean-up

General issues for cleaning up the knowledge graph:

workflow steps that call APIs should reuse previous responses when an API call fails
have options in gen_ttl.py to generate the public corpus for the ML competition vs. the full corpus which isn't filtered (to use in ADRF, recsys, etc.)
filter out publications that lack URLs or open access PDFs at the very latest point possible
make sure that publications and other entities going into TTL/JSON-LD are unique
pull the ISSN identifiers per publication

integrate with GCIS

import from GCIS into journals.json based on https://data.globalchange.gov/journal
find ways to integrate for updates

Update RCGraph to use richcontext 1.2

Find more abstracts and add them into the KG

Abstracts from schol.semantic have been added into KG in #58 and #39. Try to locate where abstracts could be in other places in the publication from other API calls, and add them into KG as well.

analytics: dataset co-occurrence

We need means to analyze the co-occurrence rates for datasets. In other words, looking at the rate of datasets being used together based on links from publications in our KG.

Use the full.jsonld instance of the KG as the basis for analysis.

Steps:

results are best packaged as a Jupyter notebook here in the https://github.com/Coleridge-Initiative/RCGraph repo
later we'll move the analysis into an additional workflow step, and show the results in the web app.

USDA would love to have this before the mid-March demo.

Publication authors list issues

There are a few issues with this publication authors:

Actual author Cheung, Karen and not Cheung, Kwok-Fai
The author list order is not correctly captured
This might not be an issue, but the highlighted author in the RC UI is the last author in the actual author list. Is that correct?

Publication UUID: publication-74219ad6ba9e918165d1
DOI: 10.1016/j.childyouth.2018.12.009
URL: https://www.sciencedirect.com/science/article/pii/S0190740918308430?via%3Dihub

Include pdf links manually identified by reviewers in the corpus

Evaluate GESIS paper about dataset linking

Evaluate GESIS paper about dataset linking.

The natural language and machine learning technology referenced in this paper are relatively dated by now (they don't reflect the major, industry-wide advances since 2018)

https://www.semanticscholar.org/paper/A-semi-automatic-approach-for-detecting-dataset-in-Ghavimi-Mayr/09d728103b2d0d05fda57a6e65ecaa528858ebc9

[Question] Is richcontext.graph.RCGraph used once the workflow is finished?

Hi!
I've seen that RCGraph is used to create the jsonld corpus. For example, it's used in run_step2.py

But it seems that it's not used in corpus.ipynb
So I wonder if RCGraph class can be used to explore the data contained by the KG. On the other hand, in corpus.ipynb a graph object is created to visualize the graph

nxg = nx.Graph()

Maybe the RCGraph is not designed to store data and that's why corpus.ipynb uses the bare jsonld file?

Thanks @ernestogimeno @JasonZhangzy1757

clean partitions incorrectly added with verified not-links

Review the following partitions

20200311_NSF_SED_and_SDR_part8_publications
20200302_federated2_USDA_AgriculturalResourceManagementSurvey_part28_publications
20200302_USDA_ARMS_website_part1_publications

Replace the datadropts in Richcontextmetadata, RCPublications, rerun the KG workflow in RCGraph and update the rckg jsonld file saved as the latest version

analysis: scientific paper section classifier

We need to train a text-based classifier model to identify the sections of a parsed PDF for a research paper. Dataset linking ML models typically depend on section as a feature.

The source data is in the S3 bucket see corpus_docs/pub/txt/*.txt

Some good prior work -- though not necessarily best as starting points for this project:

There's plenty of "research" published in this area, although be careful since most of these published works are horridly out of date, represent disproven practiced, and should be mostly ignored:

While there's been much research into this area using dependency parsing in general plus the many transformer approaches which are more recent, we'll start with weak supervision instead.

Start with use of Snorkel https://www.snorkel.org/use-cases/ to build a set of labeling functions for preparing the training data for the classifier. That way we'll be able to update and rebuild training sets more dynamically, as our corpus changes.

Prototype a reporting web site for USDA

Provide API calls in Swagger to serve KG subgraphs, based on https://github.com/DerwenAI/RCServer
Develop a web-based interactive report that implements the wireframes provided to USDA, NOAA, etc.

Deadlines:

Initial demo due 2020-01-31
USDA demo onsite 2020-02-07

Incorporate results from the ML competition

Begin to use models from our ML competition:

develop the required training sets
train models
evaluate models on new data
building ensembles (where possible)

Also, iterate to experiment with our own models and publish those:

to help boost the competition
to abstract learnings from the competition into our RCGraph workflow

Link other required entities into the KG: keywords

depends on #29

Paco leads on these points for RCGraph

Link other required entities into the KG: affiliations

depends on #29

Paco leads on these points for RCGraph

Evaluate OAI-PMH integration

Would there be any benefits to integrating with or consuming from OAI-PMH http://www.openarchives.org/pmh/ for programmatic metadata harvesting / metadata exchange?

explore how to leverage CRediT

Explore how to leverage CRediT for credentialing authors, with integration into ORCID

Create a library for “imperfect” string comparisons

We need a library to calibrate different methods for an "imperfect" text matching. The idea is to be able to be able to calibrate parameters for different contexts (e.g. datasets titles, publication titles, author names, etc)

Abstract of publication-2b2480acb1b98d322868 includes extra text that is not part of the abstract

Publication title Standing Still or Moving Up? Evidence from Wisconsin on the Long-Term Employment and Earnings of TANF Participants

Publication UUID (2020-06-02) publication-2b2480acb1b98d322868

The actual abstract is:

Abstract
This study identified the employment and earnings trajectories of welfare recipients over six years for a sample of 14,150 women who entered the Temporary Assistance for Needy Families program (TANF) in Wisconsin in its first year. Wisconsin longitudinal administrative data were used to examine differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We developed a conceptual approach to categorizing participants' employment and earnings trajectory groups. Results indicate substantial diversity in employment and earnings patterns. Some women have consistently positive outcomes, others show steady improvements over time, and others have inconsistent patterns that end strong. We found that 46% of the sample fit into one of three successful employment trajectories, and 22% fit into one of three successful earnings trajectories. Results also reveal that many women who were successful in the mid-term were not able to sustain their progress. For example, only 56% of those who were earning successes in the mid-term were still successful in the long-term. Finally, logistic regression models were used to compare the factors associated with mid-term and long-term success and with employment success and earnings success. Implications of findings are discussed.

But this other text is also included:

KEY WORDS: employment and earnings; poverty; social policy; TANF; welfare reform ********** Since the passage of the Personal Responsibility and Work Opportunity Reconciliation Act (PRWORA) of 1996, welfare programs have aimed to move low-income women with children from welfare to work. In the past decade, the number of single-mother families receiving traditional cash assistance (now called Temporary Assistance for Needy Families [TANF]) has fallen dramatically, and employment rates have risen. However, much of the literature on the economic well-being of welfare recipients and welfare leavers suggests that many who have moved from welfare to work move in and out of the labor market frequently, are working for low wages, and have insufficient earnings to support a family above the poverty line without receiving public means-tested benefits. (For a discussion of post-TANF economic status, see Blank, 2006; Cancian & Meyer, 2004; Danziger, Heflin, Corcoran, Oltmans, & Wang, 2002; Grogger & Karoly, 2005; Johnson & Corcoran, 2003.) Most of the early studies on economic well-being of welfare recipients after welfare reform have examined employment, earnings, and income after leaving cash benefits over fairly short periods of time. Less is known about whether the short-term economic success (or lack of success) has persisted in the long-term. In this article, we use longitudinal administrative data to examine the employment and earnings trajectories of welfare recipients over six years for a sample of approximately 17,000 women who entered TANF in Wisconsin in its first year. We propose a method to characterize the six-year patterns of employment and earnings and consider differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We compare the factors associated with mid-term and long-term success. POLICY CONTEXT This article focuses on TANF participants in Wisconsin. Wisconsin's TANF program, called Wisconsin Works (W-2), was instituted in September 1997. W-2 consists of several 'tiers' and is structured to mirror the world of employment. Thus, individuals do not receive a cash payment unless they are working in a community service job, are engaged in a work-like activity (W-2 Transitions), or have a child younger than 13 weeks old (caretaker of newborn). In addition, individuals who are the most work-ready can receive a variety of services without receiving cash (case management). Individuals are expected to begin in the tier that corresponds to their level of work-readiness and to progress up the tiers until they no longer need any W-2 services.

download_resources.py is missing

Hi!
From the readme:

To publish the corpus:
commit and create a new tagged release
run bin/download_resources.py to download PDFs

But I cannot find download_resources.py in the repository. Is it missing?
Thanks @ernestogimeno

Missing pytextrank and spacy in requirements.txt

I also needed to do this:

python3 -m spacy download en_core_web_sm

Alternative metadata sources from the agencies + libraries

The agencies (NOAA, NASA, USDA, GCIS, USGS, etc.) are providing other sources for metadata which we can import. Mostly we'll be working with the agency libraries.

We'll be working on a case by case basis for these; workflows will probably need to be changed to adapt to these new sources.

Paco leads on these points; we must wait for the CRADA to be signed with NOAA before proceeding.

coleridge-initiative / rcgraph Goto Github PK

rcgraph's Issues

Recommend Projects

Recommend Topics

Recommend Org