Comments (4)
from cosmos.
Also found on this page, along with figures apparently grouped with the wrong caption.
from cosmos.
This isn't a bug in retrieval or the API. It's a user error in ingesting the data. The bottom line is that I accidentally put a small set of PDFs into the queue twice. We don't do duplicate checking on ingest, so I ended up doubling some of the results. Part of the issue is that I was doing separate runs on small piles of documents instead of one big run on the full pile (as the committed code would do it).
The foreign keys don't cascade at the moment and making that change + yanking out the documents is too big a change for the live set at the moment.
We could also add pdf name (or pdf checksum) checks to make sure we don't ingest the same document more than once.
from cosmos.
@iross the DOI/GDD id should definitely be carried along all the way though every single thing that any of our applications do!
I know that this is being done, but just a reminder!
from cosmos.
Related Issues (20)
- Leverage in-text table mentions to expand context recall HOT 4
- ASKE-ID as a parameter on ES retrieval API HOT 1
- Add SPECTER model functionality
- context enrichment should gracefully handle tableless documents
- replace os.path with pathlib.Path HOT 2
- The parquet files not found in output directory HOT 3
- Document naming question HOT 2
- Worker question HOT 2
- Elasticsearch virtual memory error HOT 4
- PIL Image size limits HOT 2
- NVIDIA driver HOT 4
- SSL certificate expired
- Wrong exception catch when extracted tables don't exist HOT 1
- Allow one, both, or neither extractions HOT 1
- Recognize, capture, and correct underlying PDF parsing issues HOT 1
- Parquet Files not found in Output HOT 1
- XGBoost Error HOT 3
- Timeout after 10 seconds? HOT 1
- Column text mixing in output
- Expecting `.png` but receiving `.jpg` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cosmos.