coleridge-initiative / rcgraph Goto Github PK
View Code? Open in Web Editor NEWRich Context knowledge graph management
Home Page: https://rc.coleridgeinitiative.org/?radius=3&entity=NOAA
License: Creative Commons Zero v1.0 Universal
Rich Context knowledge graph management
Home Page: https://rc.coleridgeinitiative.org/?radius=3&entity=NOAA
License: Creative Commons Zero v1.0 Universal
See the error logs per workflow stage in:
These are errors encountered and logged at each stage.
For each, we must evaluate:
richcontext.scholapi
Overall, we need to create or improve unit tests to give better coverage where possible, and also improve our API integration.
Can (or should) we decouple our internal data dependencies from Git using lazydata
?
We'll follow the example from Bundesbank to estimate metrics for datasets.
One of the benefits of using graphs is that metrics can be calculated for data impact factor which might be prohibitively difficult otherwise.
Paco leads on these points; we need a larger graph and must have other entities represented first.
We need means to analyze the "quality" for the more popular journal article publishers. In other words, we need a classifier based on the publisher (ScienceDirect, PubMed, OUP, etc.) for how likely the entries in publication partitions will "survive" all the way through our workflow to successful PDF parsing.
Methodology:
doi
fields to URLs, then fetching those to determine their DNS domainDelivery:
We need means to analyze the co-occurrence rates for authors. In other words, looking at the rate of authors being co-authors together on research publications in our KG.
Use the full.jsonld instance of the KG as the basis for analysis.
Since authors "social" (ostensibly, unlike datasets) then it's more interested to have clique analysis instead of co-occurrence probability rates. A good approach would likely be
networkx
from the JSON-LD (similar to the web app in RCServer
)Steps:
ResearchGate is very interested in this feature -- lates aim for late March or sooner.
Dear Ernesto,
Thanks for your question about our API rate limit.
The current rate limit is 100 requests per 5-minute window per IP address. Our API will return 500's if overloaded. If you begin to see this, we request you slow down.
Please don't hesitate to let me know if you have any more questions about our API.
Sincerely,
Linda Wagner, Semantic Scholar Support
Integrate use of the Open Academic Graph to reconcile metadata about authors.
See https://www.openacademic.ai/oag/
This is a public version of the Microsoft Academic Graph used by teams in our ML competition.
If I remember correctly, I think @ernestogimeno mentioned to me that the jsonld corpus format has some limitations that may affect the scalability so at some point we may need to use a real database.
A possible option is to use a graph database such as Neo4j. In this database, the information is stored in graph format so it might be suitable for our needs.
I just put this thought here for future discussion if we need to upgrade the jsonld format.
Other entities which still must become integrated into the KG:
Paco leads on these points for RCGraph
"The Dimensions Analytics API is limited to 30 requests per IP address per minute." https://docs.dimensions.ai/dsl/api.html
Our richcontext.scholapi package is using a Dimensions library to habdle the connections, but still will put to sleep for 30 seconds almost every time a request is made once we reach the limit, slowing down the process a lot.
Add the publication abstracts into the KG, based on stage3
results from discovery APIs.
Best to use run_stage3.py
as a template, although the main part to reuse is how it iterates through partitions in the BUCKET_STAGE
@ernestogimeno has also worked with this code and can help guide/advise.
Will need to pull the abstract
field from metadata, where it exists. The responses from Semantic Scholar tend to have these -- and we may be able to extend other API calls to get abstracts. @lobodemonte can assist on those extensions.
The end goal will be to include abstracts as metadata in the graph -- where available -- and then also run these through a later stage that runs the TextRank algorithm to extract key phrases.
General issues for cleaning up the knowledge graph:
gen_ttl.py
to generate the public corpus for the ML competition vs. the full corpus which isn't filtered (to use in ADRF, recsys, etc.)journals.json
based on https://data.globalchange.gov/journalWe need means to analyze the co-occurrence rates for datasets. In other words, looking at the rate of datasets being used together based on links from publications in our KG.
Use the full.jsonld instance of the KG as the basis for analysis.
Steps:
USDA would love to have this before the mid-March demo.
There are a few issues with this publication authors:
Cheung, Karen
and not Cheung, Kwok-Fai
Publication UUID: publication-74219ad6ba9e918165d1
DOI: 10.1016/j.childyouth.2018.12.009
URL: https://www.sciencedirect.com/science/article/pii/S0190740918308430?via%3Dihub
Evaluate GESIS paper about dataset linking.
The natural language and machine learning technology referenced in this paper are relatively dated by now (they don't reflect the major, industry-wide advances since 2018)
Hi!
I've seen that RCGraph is used to create the jsonld corpus. For example, it's used in run_step2.py
But it seems that it's not used in corpus.ipynb
So I wonder if RCGraph class can be used to explore the data contained by the KG. On the other hand, in corpus.ipynb a graph object is created to visualize the graph
nxg = nx.Graph()
Maybe the RCGraph is not designed to store data and that's why corpus.ipynb uses the bare jsonld file?
Thanks @ernestogimeno @JasonZhangzy1757
Review the following partitions
Replace the datadropts in Richcontextmetadata, RCPublications, rerun the KG workflow in RCGraph and update the rckg jsonld file saved as the latest version
We need to train a text-based classifier model to identify the sections of a parsed PDF for a research paper. Dataset linking ML models typically depend on section
as a feature.
The source data is in the S3 bucket see corpus_docs/pub/txt/*.txt
Some good prior work -- though not necessarily best as starting points for this project:
There's plenty of "research" published in this area, although be careful since most of these published works are horridly out of date, represent disproven practiced, and should be mostly ignored:
While there's been much research into this area using dependency parsing in general plus the many transformer approaches which are more recent, we'll start with weak supervision instead.
Start with use of Snorkel https://www.snorkel.org/use-cases/ to build a set of labeling functions for preparing the training data for the classifier. That way we'll be able to update and rebuild training sets more dynamically, as our corpus changes.
Prototype a reporting web site for USDA
Deadlines:
Begin to use models from our ML competition:
Also, iterate to experiment with our own models and publish those:
RCGraph
workflowLink other required entities into the KG: keywords
depends on #29
Paco leads on these points for RCGraph
Link other required entities into the KG: affiliations
depends on #29
Paco leads on these points for RCGraph
Would there be any benefits to integrating with or consuming from OAI-PMH http://www.openarchives.org/pmh/ for programmatic metadata harvesting / metadata exchange?
Explore how to leverage CRediT for credentialing authors, with integration into ORCID
We need a library to calibrate different methods for an "imperfect" text matching. The idea is to be able to be able to calibrate parameters for different contexts (e.g. datasets titles, publication titles, author names, etc)
Publication title Standing Still or Moving Up? Evidence from Wisconsin on the Long-Term Employment and Earnings of TANF Participants
Publication UUID (2020-06-02) publication-2b2480acb1b98d322868
The actual abstract is:
Abstract
This study identified the employment and earnings trajectories of welfare recipients over six years for a sample of 14,150 women who entered the Temporary Assistance for Needy Families program (TANF) in Wisconsin in its first year. Wisconsin longitudinal administrative data were used to examine differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We developed a conceptual approach to categorizing participants' employment and earnings trajectory groups. Results indicate substantial diversity in employment and earnings patterns. Some women have consistently positive outcomes, others show steady improvements over time, and others have inconsistent patterns that end strong. We found that 46% of the sample fit into one of three successful employment trajectories, and 22% fit into one of three successful earnings trajectories. Results also reveal that many women who were successful in the mid-term were not able to sustain their progress. For example, only 56% of those who were earning successes in the mid-term were still successful in the long-term. Finally, logistic regression models were used to compare the factors associated with mid-term and long-term success and with employment success and earnings success. Implications of findings are discussed.
But this other text is also included:
KEY WORDS: employment and earnings; poverty; social policy; TANF; welfare reform ********** Since the passage of the Personal Responsibility and Work Opportunity Reconciliation Act (PRWORA) of 1996, welfare programs have aimed to move low-income women with children from welfare to work. In the past decade, the number of single-mother families receiving traditional cash assistance (now called Temporary Assistance for Needy Families [TANF]) has fallen dramatically, and employment rates have risen. However, much of the literature on the economic well-being of welfare recipients and welfare leavers suggests that many who have moved from welfare to work move in and out of the labor market frequently, are working for low wages, and have insufficient earnings to support a family above the poverty line without receiving public means-tested benefits. (For a discussion of post-TANF economic status, see Blank, 2006; Cancian & Meyer, 2004; Danziger, Heflin, Corcoran, Oltmans, & Wang, 2002; Grogger & Karoly, 2005; Johnson & Corcoran, 2003.) Most of the early studies on economic well-being of welfare recipients after welfare reform have examined employment, earnings, and income after leaving cash benefits over fairly short periods of time. Less is known about whether the short-term economic success (or lack of success) has persisted in the long-term. In this article, we use longitudinal administrative data to examine the employment and earnings trajectories of welfare recipients over six years for a sample of approximately 17,000 women who entered TANF in Wisconsin in its first year. We propose a method to characterize the six-year patterns of employment and earnings and consider differential patterns of mid-term (three years) and long-term (six years) employment and earnings success. We compare the factors associated with mid-term and long-term success. POLICY CONTEXT This article focuses on TANF participants in Wisconsin. Wisconsin's TANF program, called Wisconsin Works (W-2), was instituted in September 1997. W-2 consists of several 'tiers' and is structured to mirror the world of employment. Thus, individuals do not receive a cash payment unless they are working in a community service job, are engaged in a work-like activity (W-2 Transitions), or have a child younger than 13 weeks old (caretaker of newborn). In addition, individuals who are the most work-ready can receive a variety of services without receiving cash (case management). Individuals are expected to begin in the tier that corresponds to their level of work-readiness and to progress up the tiers until they no longer need any W-2 services.
Hi!
From the readme:
To publish the corpus:
commit and create a new tagged release
run bin/download_resources.py to download PDFs
But I cannot find download_resources.py
in the repository. Is it missing?
Thanks @ernestogimeno
I also needed to do this:
python3 -m spacy download en_core_web_sm
The agencies (NOAA, NASA, USDA, GCIS, USGS, etc.) are providing other sources for metadata which we can import. Mostly we'll be working with the agency libraries.
We'll be working on a case by case basis for these; workflows will probably need to be changed to adapt to these new sources.
Paco leads on these points; we must wait for the CRADA to be signed with NOAA before proceeding.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.