simonw / datasette Goto Github PK

View Code? Open in Web Editor NEW

8.9K 101.0 625.0 6.29 MB

An open source multi-tool for exploring and publishing data

Home Page: https://datasette.io

License: Apache License 2.0

Python 89.13% HTML 6.66% CSS 1.46% Dockerfile 0.05% JavaScript 2.15% Shell 0.32% C 0.13% Just 0.10%

sqlite python datasets json docker datasette automatic-api asgi csv datasette-io

datasette's Introduction

An open source multi-tool for exploring and publishing data

Datasette is a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.

Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world.

Explore a demo, watch a video about the project or try it out by uploading and publishing your own CSV data.

datasette.io is the official project website
Latest Datasette News
Comprehensive documentation: https://docs.datasette.io/
Examples: https://datasette.io/examples
Live demo of current main branch: https://latest.datasette.io/
Questions, feedback or want to talk about the project? Join our Discord

Want to stay up-to-date with the project? Subscribe to the Datasette newsletter for tips, tricks and news on what's new in the Datasette ecosystem.

Installation

If you are on a Mac, Homebrew is the easiest way to install Datasette:

brew install datasette

You can also install it using pip or pipx:

pip install datasette

Datasette requires Python 3.8 or higher. We also have detailed installation instructions covering other options such as Docker.

Basic usage

datasette serve path/to/database.db

This will start a web server on port 8001 - visit http://localhost:8001/ to access the web interface.

serve is the default subcommand, you can omit it if you like.

Use Chrome on OS X? You can run datasette against your browser history like so:

 datasette ~/Library/Application\ Support/Google/Chrome/Default/History --nolock

Now visiting http://localhost:8001/History/downloads will show you a web interface to browse your downloads data:

metadata.json

If you want to include licensing and source information in the generated datasette website you can do so using a JSON file that looks something like this:

{
    "title": "Five Thirty Eight",
    "license": "CC Attribution 4.0 License",
    "license_url": "http://creativecommons.org/licenses/by/4.0/",
    "source": "fivethirtyeight/data on GitHub",
    "source_url": "https://github.com/fivethirtyeight/data"
}

Save this in metadata.json and run Datasette like so:

datasette serve fivethirtyeight.db -m metadata.json

The license and source information will be displayed on the index page and in the footer. They will also be included in the JSON produced by the API.

datasette publish

If you have Heroku or Google Cloud Run configured, Datasette can deploy one or more SQLite databases to the internet with a single command:

datasette publish heroku database.db

Or:

datasette publish cloudrun database.db

This will create a docker image containing both the datasette application and the specified SQLite database files. It will then deploy that image to Heroku or Cloud Run and give you a URL to access the resulting website and API.

See Publishing data in the documentation for more details.

Datasette Lite

Datasette Lite is Datasette packaged using WebAssembly so that it runs entirely in your browser, no Python web application server required. Read more about that in the Datasette Lite documentation.

datasette's People

Contributors

Stargazers

Watchers

Forkers

macropin intfrr tomdyson weimingtom bradparks jashion acaciusshun lyrl jdk6979 matrixdekoder jacobian amilajack raynae hyeri0609 terranodo plucena24 fengweijp apostolisly cuulee chubbymaggie mengyou658 datahack-ru libery number0 wsxiaoys skyformat99 dremerten ouchiko n8659150 plunix bissonex techeye220 pathcl nickdirienzo bhavik1st psychemedia ryanpitts bsmithgall nishad knowtheory esaslow cosmiccamel russss maboiteprivee andrewhayward pytables lsb marcello3d philgyford philroche witwer wallawaz msingle r4vi imerica mailtruck renesugar robjmills willingc amenzeo kevboh warezaddict-com mdmartinez frostytear bluewhale1207 jaywgraves cxz valrcs a2393439531 shaunstanislauslau dtschirmer zeichenkette galaxyeye gfrmin reclaimed normade creatigent sc0ttbeardsley gravitytrope jaap3 aduong blech ak-krajewska bern4rdelli emuhedo kafkaforks boban-dj 4iji wukuan405 gridl prometeoai royalosyin huyenme nathanlawrence gilby125 bilel rjauquet shinybrar skols jkrzy

datasette's Issues

/database?sql= should redirect correctly

Needs to redirect to the location with the hash while retaining the query string. This should also work with the .json extension.

Idea: colour scheme based on sha256 of db

Implement command-line tool interface

The first version needs to take one or more file names or URLs, then generate and deploy an app to Now. It will assume you already have the now command installed and configured.

Homepage UI for editing metadata file

Since we are going to have a metadata file which sets the title/description/etc for each database, why not allow you to run the app in —dev mode which makes the homepage into a WYSIWYG editor that can save to that file format.

Add a syntax highlighting SQL editor

https://ace.c9.io/#nav=embedding looks like a good option

Code that generates compile-time properties about the database

At a minimum this will include:

sha hash of each database file
list of tables with row counts for each database file

Ability to plot a simple graph

Might be as simple as: pick he type of chart (bar, line) and then pick the column for the X axis and the column for the Y axis. Maybe also allow a pie chart. It’s up to the user to come up with SQL that gets the right values.

date, year, month and day querystring lookups

?timestamp___date=2017-07-17 - return every item where the timestamp falls on that date
?timestamp___year=2017 - return every item where the timestamp falls within 2017
?timestamp___month=1 - return every item where the month component is January
?timestamp___day=10 - return every item where the day-of-the-month component is 10

Follow on from #23

Implement full URL design

Full URL design:

/database-name
/database-name.json
/database-name-7sha256
/database-name-7sha256.json
/database-name/table-name
/database-name/table-name.json
/database-name-7sha256/table-name
/database-name-7sha256/table-name.json
/database-name-7sha256/table-name/compound-pk
/database-name-7sha256/table-name/compound-pk.json

Make it so you can override templates

The app will ship with default templates but, just like with the Django admin, you will be able to override them using either explicit configuration settings or just by dropping in templates with certain file names.

Template inheritance should work here, both allowing you to override just the base template and allowing you to customize tiny bits of others.

Switch to ujson

ujson is already a dependency of Sanic, and should be quite a bit faster.

Try running SQLite queries in a separate thread

https://pymotw.com/3/asyncio/executors.html

Would be good to have some actual benchmarks so I can evaluate if this is worth it or not.

Config file with support for defining canned queries

Probably using YAML because then we get support for multiline strings:

bats:
  db: bats.sqlite3
  name: "Bat sightings"
  queries:
    specific_row: |
      select * from Bats
      where a = 1;

Ability to serialize massive JSON without blocking event loop

We run the risk of someone attempting a select statement that returns thousands of rows and hence takes several seconds just to JSON encode the response, effectively blocking the event loop and pausing all other traffic.

The Twisted community have a solution for this, can we adapt that in some way? http://as.ynchrono.us/2010/06/asynchronous-json_18.html?m=1

Efficient url for downloading the raw database file

Use Sanic support for steaming large files http://sanic.readthedocs.io/en/latest/sanic/response.html#file-streaming

Support Django-style filters in querystring arguments

e.g

/database/table?name__contains=Simon&age__gte=4

Same format as Django: double underscore as the split.

If you need to match against a column that happens to contain a double underscore in its official name, do this:

/database/table?weird__column__exact=Simon

__exact is the default operation if none is supplied.

Initial test suite

Unit tests against application itself

Use Sanic’s testing mechanism. Test should create a temporary SQLite database file on disk by executing sql that is stored in the test themselves.

For the moment we can just test the JSON API more thoroughly and just sanity check that the HTML output doesn’t throw any errors.

While running, server should spot new db files added to its directory

Maybe in each request it checks the time and if 5s has elapsed since t last scanned the directory it scans it again

This would allow people with dedicated hosting to run the app there and just upload new datasets whenever they want. It would also be very convenient for development.

Experiment with patterns for concurrent long running queries

I want to understand how the system could perform under load with many concurrent long-running queries. Can we serve these without blocking the event loop?

Addressable pages for every row in a table

/database-name-7sha256/table-name/compound-pk
/database-name-7sha256/table-name/compound-pk.json

Tricky part will be figuring out what the private key is - especially since it could be a compound primary key and it might involve different data types.

Make a proper README

Include instructions on building a local Docker container - currently detailed here: https://gist.github.com/simonw/0ea5c960608c2d876e4637a5e48aa95d (those instructions don't work now that we have removed the Dockerfile in favour of a template generated by datasette publish)

Create neat example database

How about data from open elections eg https://github.com/openelections/openelections-data-ca?files=1

Initial proof-of-concept

Implemented in de04d7a

Handle bytestring records encoding to JSON

http://localhost:8006/northwind-40d049b/Categories.json 500s right now

The string representation of one of the values looks like this:

b"\x15\x1c/\x00\x02\x00

This is a bytestring from the database which cannot be naively converted to a unicode string.

Protect against malicious SQL that causes damage even though our DB is immutable

I’m currently operating under the assumption that it’s safe to allow arbitrary SQL statements because we are dealing with an immutable database. But this might not be the case - there are some pretty weird SQLite language extensions (ATTACH, PRAGMA etc) and I’m not certain they cannot be used to break things in a way that would affect future requests to the API.

Solution: provide a “safe mode” option which disables the ?sql= mechanism. This still leaves the URL filter lookups, so I need to make sure that those are “safe”.

In the future I may also implement a whitelist option where datasets can be configured to only allow specific filters against specific columns.

Use locust for benchmarking and load tests

https://github.com/locustio/locust

Needed for #32

Attempting an INSERT or UPDATE should return a sane error message

?_group_count=country - return counts by specific column(s)

Imagine if this:

https://stateless-datasets-jykibytogk.now.sh/flights-07d1283/airports.jsono?country__contains=gu&_group_count=country

Turned into this:

https://stateless-datasets-jykibytogk.now.sh/flights-07d1283?sql=select%20country,%20count(*)%20as%20group_count_country%20from%20airports%20where%20country%20like%20%27%gu%%27%20group%20by%20country%20order%20by%20group_count_country%20desc

This would involve introducing a new precedent of query string arguments that start with an _ having special meanings. While we're at it, could try adding _fields=x,y,z

Tasks:

Get initial version working
Refactor code to not just "pretend to be a view"
Get foreign key relationships expanded

Datasette Plugins

It would be neat if additional functionality could be opted-in to the system in the form of easy-to-add plugins, hosted as separate packages. First example: a Google Analytics plugin, which adds GA tracking code with your tracking ID to the web interface for your dataset.

This may be an opportunity to experiment with entry points: http://amir.rachum.com/blog/2017/07/28/python-entry-points/

In development mode, should still pick up new .db files

Follow on from #11

Use Sanic configuration mechanism

http://sanic.readthedocs.io/en/latest/sanic/config.html

Homepage should show summary of databases

I sch database should have a name, optional description, download link and a summary of the tables

Flights.db
Flights and suchlike blah.
URL? License?
577373 rows across 14 tables
airports, routes, airlines...

Title of the homepage is derived from the databases or can be manually overridden e. “Datasets of Flights, NHS, Blah...” - or if only one database just the title of that.

Make URLs immutable

Absolutely everything should have a far-future expires header

Part of the URL will be the truncated sha1 hash of the database file itself, calculated at build time

Implement sensible query pagination

Command line tool for uploading one or more DBs to Now

Uploading files appears to be undocumented, but I found it in their code here: https://github.com/zeit/now-cli/blob/0ca7d1fe44ebdf460b64fdc38ba543b8e295ac40/src/providers/sh/util/index.js#L291

Endpoint that returns SQL ready to be piped into DB

It would be cool if I could figure out a way to generate both the create table statements and the inserts for an individual table or the entire database and then stream them down to the client.

Support CSV export with a .csv extension

Maybe do this using streaming with multiple pagination SQL queries so we can support arbritrarily large exports.

How would this work against a view which doesn’t have an obvious efficient pagination mechanism? Maybe limit views to up to 1000 exported records?

Relates to #5

Set up Travis

Do something neat with foreign keys

https://www.sqlite.org/pragma.html#pragma_foreign_key_list

SQLite has robust support for introspecting foreign keys. I could use that to automatically link to the corresponding record from my tables.

See if I can get a websockets interface working

Since I am already running on Sanic, how hard would it be to add a websocket ebdpoint that lets you talk to sqlite interactively?

Could this be used to efficiently support streaming in answers to giant queries?

Better JSON response options

Default returns this:

{
    “Columns”: [“id”, “name”, “age”],
    “Rows”: [
         [45, “Simon”, 36]
    ]
}

.jsono instead returns a list of objects each duplicating the headers in its keys.

They both probably share the same pagination mechanism so it might not be a jsono flat list.

Dockerfile should build more recent SQLite with FTS5 and spatialite support

The SQLite bundled with Python 3 doesn't support the FTS5 search extension. It would be nice if the SQLite built by our Dockerfile could support as many modern SQLite features as possible.

https://web.archive.org/web/20170212034155/http://charlesleifer.com/blog/using-the-sqlite-json1-and-fts5-extensions-with-python/ has instructions on building a more recent SQLite and the pysqlite package. Our Dockerfile could carry out an updated version of this process.

Pick a name

Options so far:

immutabase
datasite
sqlstatic
dbserve
sqlserve

Terms to play with:

immutable
sqlite
dataset
json
static
serve

Looks OK
Works great on mobile
Loads extremely fast
No JavaScript! At least not in v1.

Support multiple databases

I'm going to loop through every database file in the app root directory and bundle all of them.

Each one will be accessible at /databasename

Note this is without the file extension, and we will disallow multiple files with the same name but different extensions.

Supported extensions to start with will be .db and .sqlite and .sqlite3

Make individual column valuables addressable, with smart content types

Some SQLite databases embed images in columns. It would be cool if these had URLs.

/database-name-7sha256/table-name/compound-pk/column
/database-name-7sha256/table-name/compound-pk/column.json
/database-name-7sha256/table-name/compound-pk/column.png
/database-name-7sha256/table-name/compound-pk/column.gif
/database-name-7sha256/table-name/compound-pk/column.txt

The one without an explicit file extension auto-detects the correct extension.