owid / data-api Goto Github PK
View Code? Open in Web Editor NEWAPI for accessing data from our data catalog.
License: MIT License
API for accessing data from our data catalog.
License: MIT License
Random ideas collected at our tech tea
DuckDB doesn't support both reading and writing at the same time and they don't have it on their roadmap. We have a single writer (crawler) and multiple readers (API opens connection from each thread).
We could write our own custom locking mechanism, but that could be hard to manage. Perhaps copying DB, updating it and then replacing the original one could be easier? (DB should be much smaller if we only keep metadata in)
You could always just make an integer version number and work with
duckdb-<version>.db
and then increment it each time the format changes. Then other users know they need to update.
Simple VERSION
constant in crawl.py
and in data-api
could be enough to make this useful.
Int64
type is stored in DuckDB as NULLABLE BIGINT
. Calling fetch_df
on it converts it to float64
(because it assumes there could be NaNs) which could be confusing for users (for instance weekly covid cases are such a case).
We could either keep it in arrow format, never call fetch_df
and let users read it as feather file or convert float64
back to Int64
before returning it to user.
Right now we replicate all data into local DuckDB. That has some advantages like query performance or no traffic between S3 and our server. On the other hand, some rare datasets like faostat
or SDG are huge and replicating them takes a long time (though we might have to do it only once). I haven't synced all datasets yet, but I assume that our entire database would be ~10GB (there's a lot of space for optimisation though!) which isn't small.
A lot could be optimised, though I'm wondering from a philosophical perspective if we should invest time in it or consider fetching them directly from S3. Perhaps fetching them from S3/R2 is the future and we should go that way?
Might be worth checking how fast and feasible this is.
Let anyone commit and make deploy
to see it live quickly, so that it can have lots of little experimental contributions.
Since our analytics server is quite small, this could become an issue since ETL runs on the same server. There's no reason API should be using that much memory. How is that possible?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.