Comments (5)
After looking at Failures for Step 2, sources of failures seem to come from:
- Title differs on special character (quotes, hyphens, etc
- Title with encoding issue or cut-off special character
- API Source not used (e.g. RePEc, SSRN)
from rcgraph.
I reran the failed titles (step_2) (763 total) with other API's and found that by adding 3 more APIs we could decrease the number of failed lookups by 25%, but that leaves a huge chunk of titles with possible typing/encoding-related errors
CrossRef: (found/total)
33/763
CrossRef + PubMed:
38/763
CrossRef + DataCite:
109/763
Core:
95/763
CrossRef + DataCite + Core:
195/763
from rcgraph.
Thank you @lobodemonte -
Sounds like the improved title matching (less exact matches) may help here.
from rcgraph.
Nice work! That's super-helpful to guide how we leverage the APIs in the workflow.
For the typing/encoding-related errors, is it mostly the case that the titles we have are incorrect?
from rcgraph.
Hi @ceteri,
Yes, the rest seem to be either some minor difference with the actual title and what we have in our data partitions, there's also a small portion of titles that won't have a doi (e.g. RePEc publications)
from rcgraph.
Related Issues (20)
- Include pdf links manually identified by reviewers in the corpus
- clean partitions incorrectly added with verified not-links
- download_resources.py is missing HOT 1
- [Question] Is richcontext.graph.RCGraph used once the workflow is finished? HOT 1
- [Future Work/ Idea] Graph Database to store the data of the RCKG HOT 1
- analytics: dataset co-occurrence
- analysis: author cliques
- analysis: publisher classifier
- analysis: scientific paper section classifier
- Create a library for “imperfect” string comparisons HOT 1
- Update RCGraph to use richcontext 1.2
- Update federated_search.py to use richcontext 1.2
- Abstract of publication-2b2480acb1b98d322868 includes extra text that is not part of the abstract
- Publication authors list issues HOT 1
- Missing pytextrank and spacy in requirements.txt
- Adjust run_stept2.py to the restrictions of Dimensions API
- Evaluate impact of not using Dimensions API in the current workflow
- Adjust run_step3.py to the request rate limit of Semantic Scholar API (100 request per 5 minutes window)
- Find more abstracts and add them into the KG
- Fix verify_doi method to prevent error BAD DOI: http://doi.org/10.7289/V5W9573D
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rcgraph.