simonw / datasette Goto Github PK
View Code? Open in Web Editor NEWAn open source multi-tool for exploring and publishing data
Home Page: https://datasette.io
License: Apache License 2.0
An open source multi-tool for exploring and publishing data
Home Page: https://datasette.io
License: Apache License 2.0
Some SQLite databases embed images in columns. It would be cool if these had URLs.
/database-name-7sha256/table-name/compound-pk/column
/database-name-7sha256/table-name/compound-pk/column.json
/database-name-7sha256/table-name/compound-pk/column.png
/database-name-7sha256/table-name/compound-pk/column.gif
/database-name-7sha256/table-name/compound-pk/column.txt
The one without an explicit file extension auto-detects the correct extension.
The SQLite bundled with Python 3 doesn't support the FTS5 search extension. It would be nice if the SQLite built by our Dockerfile could support as many modern SQLite features as possible.
https://web.archive.org/web/20170212034155/http://charlesleifer.com/blog/using-the-sqlite-json1-and-fts5-extensions-with-python/ has instructions on building a more recent SQLite and the pysqlite package. Our Dockerfile could carry out an updated version of this process.
At a minimum this will include:
Version one should have the following characteristics:
I'm going to loop through every database file in the app root directory and bundle all of them.
Each one will be accessible at /databasename
Note this is without the file extension, and we will disallow multiple files with the same name but different extensions.
Supported extensions to start with will be .db
and .sqlite
and .sqlite3
/database-name-7sha256/table-name/compound-pk
/database-name-7sha256/table-name/compound-pk.json
Tricky part will be figuring out what the private key is - especially since it could be a compound primary key and it might involve different data types.
Absolutely everything should have a far-future expires header
Part of the URL will be the truncated sha1 hash of the database file itself, calculated at build time
https://pymotw.com/3/asyncio/executors.html
Would be good to have some actual benchmarks so I can evaluate if this is worth it or not.
Probably using YAML because then we get support for multiline strings:
bats:
db: bats.sqlite3
name: "Bat sightings"
queries:
specific_row: |
select * from Bats
where a = 1;
https://ace.c9.io/#nav=embedding looks like a good option
How about data from open elections eg https://github.com/openelections/openelections-data-ca?files=1
https://www.sqlite.org/pragma.html#pragma_foreign_key_list
SQLite has robust support for introspecting foreign keys. I could use that to automatically link to the corresponding record from my tables.
It would be cool if I could figure out a way to generate both the create table statements and the inserts for an individual table or the entire database and then stream them down to the client.
Imagine if this:
Turned into this:
This would involve introducing a new precedent of query string arguments that start with an _ having special meanings. While we're at it, could try adding _fields=x,y,z
Tasks:
Needs to redirect to the location with the hash while retaining the query string. This should also work with the .json extension.
Since we are going to have a metadata file which sets the title/description/etc for each database, why not allow you to run the app in —dev mode which makes the homepage into a WYSIWYG editor that can save to that file format.
Uploading files appears to be undocumented, but I found it in their code here: https://github.com/zeit/now-cli/blob/0ca7d1fe44ebdf460b64fdc38ba543b8e295ac40/src/providers/sh/util/index.js#L291
Full URL design:
/database-name
/database-name.json
/database-name-7sha256
/database-name-7sha256.json
/database-name/table-name
/database-name/table-name.json
/database-name-7sha256/table-name
/database-name-7sha256/table-name.json
/database-name-7sha256/table-name/compound-pk
/database-name-7sha256/table-name/compound-pk.json
Default returns this:
{
“Columns”: [“id”, “name”, “age”],
“Rows”: [
[45, “Simon”, 36]
]
}
.jsono instead returns a list of objects each duplicating the headers in its keys.
They both probably share the same pagination mechanism so it might not be a jsono flat list.
I sch database should have a name, optional description, download link and a summary of the tables
Flights.db
Flights and suchlike blah.
URL? License?
577373 rows across 14 tables
airports, routes, airlines...
Title of the homepage is derived from the databases or can be manually overridden e. “Datasets of Flights, NHS, Blah...” - or if only one database just the title of that.
Implemented in de04d7a
We run the risk of someone attempting a select statement that returns thousands of rows and hence takes several seconds just to JSON encode the response, effectively blocking the event loop and pausing all other traffic.
The Twisted community have a solution for this, can we adapt that in some way? http://as.ynchrono.us/2010/06/asynchronous-json_18.html?m=1
Include instructions on building a local Docker container - currently detailed here: https://gist.github.com/simonw/0ea5c960608c2d876e4637a5e48aa95d (those instructions don't work now that we have removed the Dockerfile in favour of a template generated by datasette publish
)
Since I am already running on Sanic, how hard would it be to add a websocket ebdpoint that lets you talk to sqlite interactively?
Could this be used to efficiently support streaming in answers to giant queries?
ujson is already a dependency of Sanic, and should be quite a bit faster.
Maybe in each request it checks the time and if 5s has elapsed since t last scanned the directory it scans it again
This would allow people with dedicated hosting to run the app there and just upload new datasets whenever they want. It would also be very convenient for development.
e.g
/database/table?name__contains=Simon&age__gte=4
Same format as Django: double underscore as the split.
If you need to match against a column that happens to contain a double underscore in its official name, do this:
/database/table?weird__column__exact=Simon
__exact is the default operation if none is supplied.
Options so far:
Terms to play with:
It would be neat if additional functionality could be opted-in to the system in the form of easy-to-add plugins, hosted as separate packages. First example: a Google Analytics plugin, which adds GA tracking code with your tracking ID to the web interface for your dataset.
This may be an opportunity to experiment with entry points: http://amir.rachum.com/blog/2017/07/28/python-entry-points/
Follow on from #11
?timestamp___date=2017-07-17
- return every item where the timestamp falls on that date?timestamp___year=2017
- return every item where the timestamp falls within 2017?timestamp___month=1
- return every item where the month component is January?timestamp___day=10
- return every item where the day-of-the-month component is 10Follow on from #23
Let's run SQLite operations in threads, so we don't end up blocking our core event loop.
These articles are helpful:
https://github.com/locustio/locust
Needed for #32
Use Sanic support for steaming large files http://sanic.readthedocs.io/en/latest/sanic/response.html#file-streaming
I’m currently operating under the assumption that it’s safe to allow arbitrary SQL statements because we are dealing with an immutable database. But this might not be the case - there are some pretty weird SQLite language extensions (ATTACH, PRAGMA etc) and I’m not certain they cannot be used to break things in a way that would affect future requests to the API.
Solution: provide a “safe mode” option which disables the ?sql= mechanism. This still leaves the URL filter lookups, so I need to make sure that those are “safe”.
In the future I may also implement a whitelist option where datasets can be configured to only allow specific filters against specific columns.
http://localhost:8006/northwind-40d049b/Categories.json 500s right now
The string representation of one of the values looks like this:
b"\x15\x1c/\x00\x02\x00
This is a bytestring from the database which cannot be naively converted to a unicode string.
Might be as simple as: pick he type of chart (bar, line) and then pick the column for the X axis and the column for the Y axis. Maybe also allow a pie chart. It’s up to the user to come up with SQL that gets the right values.
The app will ship with default templates but, just like with the Django admin, you will be able to override them using either explicit configuration settings or just by dropping in templates with certain file names.
Template inheritance should work here, both allowing you to override just the base template and allowing you to customize tiny bits of others.
Every single page of my interface should be implemented as a function that returns JSON.
I can then build my jinja templates on top of the exact data that would be returned by the API version.
Maybe do this using streaming with multiple pagination SQL queries so we can support arbritrarily large exports.
How would this work against a view which doesn’t have an obvious efficient pagination mechanism? Maybe limit views to up to 1000 exported records?
Relates to #5
Use Sanic’s testing mechanism. Test should create a temporary SQLite database file on disk by executing sql that is stored in the test themselves.
For the moment we can just test the JSON API more thoroughly and just sanity check that the HTML output doesn’t throw any errors.
I want to understand how the system could perform under load with many concurrent long-running queries. Can we serve these without blocking the event loop?
The first version needs to take one or more file names or URLs, then generate and deploy an app to Now. It will assume you already have the now command installed and configured.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.