lyst / shovel Goto Github PK

9.0 8.0 7.0 54 KB

License: Apache License 2.0

Makefile 8.75% Python 91.25%

shovel's Introduction

Shovel

You can version-control your code, you want to version-control your datasets too. Very many data science workflows can be broken down into working on three stages of data:

"input": the dataset as provided to you, a query against Redshift, a query against Postgres, a query against your favourite API...
"working": various transformations that you do.
"output": various results, such as the accuracy of an ML algorithm on this dataset, summary graphs, etc.

The principle of shovel is to help store and version your "input", when combined with versioned code all of your results can be reproducible. How you manage your "working" and your "output" is out-of-scope, and up to you. This is the first major goal of shovel: making it easier to reproduce results in the future.

The second major goal is to store our datasets centrally (on S3 for now), so that everyone may access everything. This is good for collaboration. This is also good for organising our datasets, and for backing them up.

Installation

To install,

python setup.py install

(For development, python setup.py develop works.)

If you want to install directly from git, use:

pip install git+https://github.com/lyst/shovel.git#egg=shovel

Shovel reads its config from the environment. As a minimum, you need the following environment variables defines:

AWS_ACCESS_KEY_ID - for boto
AWS_SECRET_ACCESS_KEY - for boto
SHOVEL_DEFAULT_BUCKET - the bucket in which to store your data

In addition:

SHOVEL_DEFAULT_ROOT (bottomless-pit) - the root prefix for the default Pit your data will be stored in (shovel will always include this as the prefix when writing to your bucket.

Fetching datasets from your Pit

shovel imposes that datasets should live in a namespace PROJECT/DATASET/VERSION.

PROJECT is the top-level project a dataset belongs to, e.g. google-ngrams...
DATASET is the name of the dataset. This is intended to contain different datasets e.g. eng-all-20120701
VERSION is the version number and shold be in the format f"v{int(n)" and is intended to be used if errors are found in the dataset and they need updating. It should always make sense to re-run some analyses on the latest version of the dataset.

You should consider using a pre-existing dataset over creating a new one, if an appropriate one exists.

Using the shovel command-line tool, fetch existing datasets with

shovel dig <LOCAL_DIRECTORY> <PROJECT> <DATASET> <VERSION>

to fetch the dataset into LOCAL_DIRECTORY. For example shovel dig ~/google-ngrams/english2012 google-ngrams eng-all-20120701 v0

Or from python

from shovel import dig

dig('~/google-ngrams/english2012', 'google-ngrams', 'eng-all-20120701', 'v0')

Preparing and pushing datasets to S3

Push a local directory containing a dataset to S3 with

shovel bury ~/google-ngrams/english2012 google-ngrams eng-all-20120701 v0

Or from python

from shovel import dig

bury('~/google-ngrams/english2012', 'google-ngrams', 'eng-all-20120701', 'v0')

bury will fail if the version already exists.

Enough talk...

shovel's People

Contributors

Stargazers

Watchers

Forkers

calvingiles jonathan-eckel oliverholworthy felix-org getfiit tkdalic

shovel's Issues

Add tests for peek

Project Name Collision - Requirement already satisfied

There's already a python project called shovel (currently >600 stars on GitHub)

https://github.com/seomoz/shovel

Which causes a problem for pip install and module imports

pip install git+https://github.com/lyst/shovel.git#egg=shovel
Requirement already satisfied: shovel from git+https://github.com/lyst/shovel.git#egg=shovel

Potential Solutions

https://www.python.org/dev/peps/pep-0423/

Rename the package and project from shovel to lyst.shovel
Come up with another name that doesn't already correspond to a package in PyPI

Add CI

Add LICENCE

Validate version numbers are incremental

Idea: model shovel more closely on git

Git LFS has some nice properties, but doesn't really map well to large datasets used for analysis. A git model of checking in all resources is good for reproducibility, but it is nice to separate the data from the code.

A proposed future direction for shovel is to support shovel <git commands where shovel intercepts some commands and swaps a bunch of behaviours out. These can likely be done using git hooks, so it may be possible to init those and then use git directly.

One benefit of the shovel model over LFS is that it lets you version datasets separate from a git repo and share them across multiple. In that sense, the git hooks would need to inspect the state of the filesystem and manage the dig and bury steps of shovel as part of the hooks.

These thoughts are very undeveloped.

Add ignore_exists argument to bury

It is commit to want to re-run a notebook that bury's data. It should be possible to leave the command in place and have it not error if it has been done before.

Require a README for all datasets

When submitting a dataset, enforce a README that describes the source of data.

When force burying, check for locally deleted files and remove from remote.

Add install instructions to README

Support 'latest' version when `dig`ing

Add tests for force bury

Add force bury

Add peek method to inspect the default pit.

Without this, it is hard to know where to store a dataset.

Read dataset params from a `.shovel` file in the dataset root

To support better git status versioning etc, the actual version of a dataset should live in a shovel file in the root directory. This allows peek to check the version of whatever the file system thinks the version should be, and bury to update the file in a way that shows up in git status.

This is in response to #20