simonw / datasette Goto Github PK

View Code? Open in Web Editor NEW

9.0K 99.0 636.0 6.3 MB

An open source multi-tool for exploring and publishing data

Home Page: https://datasette.io

License: Apache License 2.0

Python 89.13% HTML 6.66% CSS 1.46% Dockerfile 0.05% JavaScript 2.15% Shell 0.32% C 0.13% Just 0.10%

sqlite python datasets json docker datasette automatic-api asgi csv datasette-io

datasette's People

Contributors

Stargazers

Watchers

Forkers

macropin intfrr tomdyson weimingtom bradparks jashion acaciusshun lyrl jdk6979 matrixdekoder jacobian amilajack raynae hyeri0609 terranodo plucena24 fengweijp apostolisly cuulee chubbymaggie mengyou658 datahack-ru libery number0 wsxiaoys skyformat99 dremerten ouchiko n8659150 plunix bissonex techeye220 pathcl nickdirienzo bhavik1st psychemedia ryanpitts bsmithgall nishad knowtheory esaslow cosmiccamel russss maboiteprivee andrewhayward pytables lsb marcello3d philgyford philroche witwer wallawaz msingle r4vi imerica mailtruck renesugar robjmills willingc amenzeo kevboh warezaddict-com mdmartinez frostytear bluewhale1207 jaywgraves cxz valrcs a2393439531 shaunstanislauslau dtschirmer zeichenkette galaxyeye gfrmin reclaimed normade creatigent sc0ttbeardsley gravitytrope jaap3 aduong blech ak-krajewska bern4rdelli emuhedo kafkaforks boban-dj 4iji wukuan405 gridl prometeoai royalosyin huyenme nathanlawrence gilby125 bilel rjauquet shinybrar skols jkrzy

datasette's Issues

Make individual column valuables addressable, with smart content types

Some SQLite databases embed images in columns. It would be cool if these had URLs.

/database-name-7sha256/table-name/compound-pk/column
/database-name-7sha256/table-name/compound-pk/column.json
/database-name-7sha256/table-name/compound-pk/column.png
/database-name-7sha256/table-name/compound-pk/column.gif
/database-name-7sha256/table-name/compound-pk/column.txt

The one without an explicit file extension auto-detects the correct extension.

Dockerfile should build more recent SQLite with FTS5 and spatialite support

The SQLite bundled with Python 3 doesn't support the FTS5 search extension. It would be nice if the SQLite built by our Dockerfile could support as many modern SQLite features as possible.

https://web.archive.org/web/20170212034155/http://charlesleifer.com/blog/using-the-sqlite-json1-and-fts5-extensions-with-python/ has instructions on building a more recent SQLite and the pysqlite package. Our Dockerfile could carry out an updated version of this process.

Code that generates compile-time properties about the database

At a minimum this will include:

sha hash of each database file
list of tables with row counts for each database file

Default HTML/CSS needs to look reasonable and be responsive

Version one should have the following characteristics:

Looks OK
Works great on mobile
Loads extremely fast
No JavaScript! At least not in v1.

Support multiple databases

I'm going to loop through every database file in the app root directory and bundle all of them.

Each one will be accessible at /databasename

Note this is without the file extension, and we will disallow multiple files with the same name but different extensions.

Supported extensions to start with will be .db and .sqlite and .sqlite3

Addressable pages for every row in a table

/database-name-7sha256/table-name/compound-pk
/database-name-7sha256/table-name/compound-pk.json

Tricky part will be figuring out what the private key is - especially since it could be a compound primary key and it might involve different data types.

Idea: colour scheme based on sha256 of db

Make URLs immutable

Absolutely everything should have a far-future expires header

Part of the URL will be the truncated sha1 hash of the database file itself, calculated at build time

Try running SQLite queries in a separate thread

https://pymotw.com/3/asyncio/executors.html

Would be good to have some actual benchmarks so I can evaluate if this is worth it or not.

Config file with support for defining canned queries

Probably using YAML because then we get support for multiline strings:

bats:
  db: bats.sqlite3
  name: "Bat sightings"
  queries:
    specific_row: |
      select * from Bats
      where a = 1;

SQLite has robust support for introspecting foreign keys. I could use that to automatically link to the corresponding record from my tables.

Endpoint that returns SQL ready to be piped into DB

It would be cool if I could figure out a way to generate both the create table statements and the inserts for an individual table or the entire database and then stream them down to the client.

?_group_count=country - return counts by specific column(s)

Imagine if this:

https://stateless-datasets-jykibytogk.now.sh/flights-07d1283/airports.jsono?country__contains=gu&_group_count=country

Turned into this:

https://stateless-datasets-jykibytogk.now.sh/flights-07d1283?sql=select%20country,%20count(*)%20as%20group_count_country%20from%20airports%20where%20country%20like%20%27%gu%%27%20group%20by%20country%20order%20by%20group_count_country%20desc

This would involve introducing a new precedent of query string arguments that start with an _ having special meanings. While we're at it, could try adding _fields=x,y,z

Tasks:

Get initial version working
Refactor code to not just "pretend to be a view"
Get foreign key relationships expanded

/database?sql= should redirect correctly

Needs to redirect to the location with the hash while retaining the query string. This should also work with the .json extension.

Homepage UI for editing metadata file

Since we are going to have a metadata file which sets the title/description/etc for each database, why not allow you to run the app in —dev mode which makes the homepage into a WYSIWYG editor that can save to that file format.

Command line tool for uploading one or more DBs to Now

Uploading files appears to be undocumented, but I found it in their code here: https://github.com/zeit/now-cli/blob/0ca7d1fe44ebdf460b64fdc38ba543b8e295ac40/src/providers/sh/util/index.js#L291

Implement full URL design

Full URL design:

/database-name
/database-name.json
/database-name-7sha256
/database-name-7sha256.json
/database-name/table-name
/database-name/table-name.json
/database-name-7sha256/table-name
/database-name-7sha256/table-name.json
/database-name-7sha256/table-name/compound-pk
/database-name-7sha256/table-name/compound-pk.json

Better JSON response options

Default returns this:

{
    “Columns”: [“id”, “name”, “age”],
    “Rows”: [
         [45, “Simon”, 36]
    ]
}

.jsono instead returns a list of objects each duplicating the headers in its keys.

They both probably share the same pagination mechanism so it might not be a jsono flat list.

Homepage should show summary of databases

I sch database should have a name, optional description, download link and a summary of the tables

Flights.db
Flights and suchlike blah.
URL? License?
577373 rows across 14 tables
airports, routes, airlines...

Title of the homepage is derived from the databases or can be manually overridden e. “Datasets of Flights, NHS, Blah...” - or if only one database just the title of that.

Initial proof-of-concept

Implemented in de04d7a

Ability to serialize massive JSON without blocking event loop

We run the risk of someone attempting a select statement that returns thousands of rows and hence takes several seconds just to JSON encode the response, effectively blocking the event loop and pausing all other traffic.

The Twisted community have a solution for this, can we adapt that in some way? http://as.ynchrono.us/2010/06/asynchronous-json_18.html?m=1

Make a proper README

Include instructions on building a local Docker container - currently detailed here: https://gist.github.com/simonw/0ea5c960608c2d876e4637a5e48aa95d (those instructions don't work now that we have removed the Dockerfile in favour of a template generated by datasette publish)

See if I can get a websockets interface working

Since I am already running on Sanic, how hard would it be to add a websocket ebdpoint that lets you talk to sqlite interactively?

Could this be used to efficiently support streaming in answers to giant queries?

Switch to ujson

ujson is already a dependency of Sanic, and should be quite a bit faster.

While running, server should spot new db files added to its directory

Maybe in each request it checks the time and if 5s has elapsed since t last scanned the directory it scans it again

This would allow people with dedicated hosting to run the app there and just upload new datasets whenever they want. It would also be very convenient for development.

Support Django-style filters in querystring arguments

e.g

/database/table?name__contains=Simon&age__gte=4

Same format as Django: double underscore as the split.

If you need to match against a column that happens to contain a double underscore in its official name, do this:

/database/table?weird__column__exact=Simon

__exact is the default operation if none is supplied.

Attempting an INSERT or UPDATE should return a sane error message

Pick a name

Options so far:

immutabase
datasite
sqlstatic
dbserve
sqlserve

Terms to play with:

immutable
sqlite
dataset
json
static
serve

Datasette Plugins

It would be neat if additional functionality could be opted-in to the system in the form of easy-to-add plugins, hosted as separate packages. First example: a Google Analytics plugin, which adds GA tracking code with your tracking ID to the web interface for your dataset.

This may be an opportunity to experiment with entry points: http://amir.rachum.com/blog/2017/07/28/python-entry-points/

In development mode, should still pick up new .db files

Follow on from #11

Implement sensible query pagination

Refactor to use class based views

http://sanic.readthedocs.io/en/latest/sanic/class_based_views.html

date, year, month and day querystring lookups

?timestamp___date=2017-07-17 - return every item where the timestamp falls on that date
?timestamp___year=2017 - return every item where the timestamp falls within 2017
?timestamp___month=1 - return every item where the month component is January
?timestamp___day=10 - return every item where the day-of-the-month component is 10

Follow on from #23

Run SQLite operations in a thread pool

Let's run SQLite operations in threads, so we don't end up blocking our core event loop.

These articles are helpful:

Use Sanic configuration mechanism

http://sanic.readthedocs.io/en/latest/sanic/config.html

Use locust for benchmarking and load tests

https://github.com/locustio/locust

Needed for #32

Efficient url for downloading the raw database file

Use Sanic support for steaming large files http://sanic.readthedocs.io/en/latest/sanic/response.html#file-streaming

Protect against malicious SQL that causes damage even though our DB is immutable

I’m currently operating under the assumption that it’s safe to allow arbitrary SQL statements because we are dealing with an immutable database. But this might not be the case - there are some pretty weird SQLite language extensions (ATTACH, PRAGMA etc) and I’m not certain they cannot be used to break things in a way that would affect future requests to the API.

Solution: provide a “safe mode” option which disables the ?sql= mechanism. This still leaves the URL filter lookups, so I need to make sure that those are “safe”.

In the future I may also implement a whitelist option where datasets can be configured to only allow specific filters against specific columns.

Handle bytestring records encoding to JSON

http://localhost:8006/northwind-40d049b/Categories.json 500s right now

The string representation of one of the values looks like this:

b"\x15\x1c/\x00\x02\x00

This is a bytestring from the database which cannot be naively converted to a unicode string.

Ability to plot a simple graph

Might be as simple as: pick he type of chart (bar, line) and then pick the column for the X axis and the column for the Y axis. Maybe also allow a pie chart. It’s up to the user to come up with SQL that gets the right values.

Make it so you can override templates

The app will ship with default templates but, just like with the Django admin, you will be able to override them using either explicit configuration settings or just by dropping in templates with certain file names.

Template inheritance should work here, both allowing you to override just the base template and allowing you to customize tiny bits of others.

Framework where by every page is JSON plus a template

Every single page of my interface should be implemented as a function that returns JSON.

I can then build my jinja templates on top of the exact data that would be returned by the API version.

Support CSV export with a .csv extension

Maybe do this using streaming with multiple pagination SQL queries so we can support arbritrarily large exports.

How would this work against a view which doesn’t have an obvious efficient pagination mechanism? Maybe limit views to up to 1000 exported records?

Relates to #5