quiltdata / quilt Goto Github PK

Quilt is a data mesh for connecting people with actionable data

License: Apache License 2.0

Python 16.46% HTML 0.10% JavaScript 9.12% Dockerfile 0.14% Makefile 0.01% Jupyter Notebook 38.64% Shell 0.09% Less 0.01% TypeScript 35.25% Jinja 0.18%

data data-engineering data-version-control data-versioning python serialization parquet

quilt's Introduction

Quilt is a data mesh for connecting people with actionable data

Python Quick start, tutorials

If you have Python and an S3 bucket, you're ready to create versioned datasets with Quilt. Visit the Quilt docs for installation instructions, a quick start, and more.

Quilt in action

open.quiltdata.com is a petabyte-scale open data portal that runs on Quilt
quiltdata.com includes case studies, use cases, videos, and instructions on how to run a private Quilt instance
Versioning data and models for rapid experimentation in machine learning shows how to use Quilt for real world projects

Who is Quilt for?

Quilt is for data-driven teams and offers features for coders (data scientists, data engineers, developers) and business users alike.

What does Quilt do?

Quilt manages data like code so that teams in machine learning, biotech, and analytics can experiment faster, build smarter models, and recover from errors.

How does Quilt work?

Quilt consists of a Python client, web catalog, lambda functions—all of which are open source—plus a suite of backend services and Docker containers orchestrated by CloudFormation.

The backend services are available under a paid license on quiltdata.com.

Use cases

Share data at scale. Quilt wraps AWS S3 to add simple URLs, web preview for large files, and sharing via email address (no need to create an IAM role).
Understand data better through inline documentation (Jupyter notebooks, markdown) and visualizations (Vega, Vega Lite)
Discover related data by indexing objects in ElasticSearch
Model data by providing a home for large data and models that don't fit in git, and by providing immutable versions for objects and data sets (a.k.a. "Quilt Packages")
Decide by broadening data access within the organization and supporting the documentation of decision processes through audit-able versioning and inline documentation

quilt's People

Contributors

Stargazers

Watchers

Forkers

vdt joannateo vanzaj stimuland elpidabantra stillmatic manzo1991 stethd diwu1989 eode aaronksalmon ouwen meffij nl0 mhassan102 kurlov hsaputra annewallace cscg elgalu afcarl michalaq affineparameter codeaudit yuanjie-ai jbn thbeh priya-gittest diethardsteiner daclarke c-m-hunt ellisonbg knaaptime remilapeyre ploncker ogfunkycold stjordanis sdileep agiza rc-ontruck cloudtwter sampathweb rsyi battyone allencellmodeling sanketsaurav cosmic-byte keshava ishaansharma nathandemaria viveklak muthuuk knut0815 nmra-sreddy neumoratx nguyen-cao laeeth ltirrell fiskus sir-sigurd donovanr ostyk zkan prime-entropy rekon admariner afiqmuzaffar pecigonzalo datalearns hongooi73 akarve muschellij2 bearerpipelinetest eodenyire johnaffolter flying-sheep camilovelezr erdal-pb marcodlk quiltsimon sreenathachakala poplar79 newbision genostack datanadi shabss datnoor georgewshen brunoscaglione

quilt's Issues

Package variables are not identified

I am trying to import the package from the link. But its not accessing the variables in sat_6_full further. Its a data node. I am trying it as

from quilt.data.kmoore import sat4_6_airborne
sat4_6_airborne.sat_6_full.train_x()

Programmatic remove from Local

No way currently to remove a set of packages.

Current process:

$ quilt ls
...
$ quilt rm usr/pkg
$ quilt rm usr/pkg

Would like the ability to remove where condition matches regex:

$ quilt rm ^usr/*

Or some other programmatic interface for removal. Not necessarily regex.

Undeclared Pandas dependency (?) broke any usage of the CLI

Using pip-compile to generate my requirements.txt resulted in pandas==0.19.2 for my project.

Trying to start up quilt's CLI resulted in:

(sourceress) sourceress jason/model-info-amount: quilt
Traceback (most recent call last):
  File "/Users/jasonbenn/.pyenv/versions/sourceress/bin/quilt", line 7, in <module>
    from quilt.tools.main import main
  File "/Users/jasonbenn/.pyenv/versions/3.6.3/envs/sourceress/lib/python3.6/site-packages/quilt/__init__.py", line 83, in <module>
    from .tools.command import (
  File "/Users/jasonbenn/.pyenv/versions/3.6.3/envs/sourceress/lib/python3.6/site-packages/quilt/tools/command.py", line 33, in <module>
    from .build import (build_package, build_package_from_contents, generate_build_file,
  File "/Users/jasonbenn/.pyenv/versions/3.6.3/envs/sourceress/lib/python3.6/site-packages/quilt/tools/build.py", line 11, in <module>
    from pandas.errors import ParserError
ModuleNotFoundError: No module named 'pandas.errors'

This pandas.errors module exists in 0.22.0 but not 0.19.2.

Where is this project's setup.py or requirements.txt?

Summary Stats for Packages

A majority of my packages are quite large so the largest desire is for there to be at least an indication of package size before install on the Quilt registry page.

If this is the current package stats area:

Latest update

date
    1/10/2018, 2:07:10 PM
author
    @aicsjackson
version
    2aed0783417118de00913d2ffe7be9968c881eb990993d6f5fdb957d0f49354

Would be nice to also see items like:

Latest update

date
    1/10/2018, 2:07:10 PM
author
    @aicsjackson
version
    2aed0783417118de00913d2ffe7be9968c881eb990993d6f5fdb957d0f49354
size
    200 GB
files
    tiff | 400
    json | 1600

Would allow for better insight into the package contents prior to downloading.

Searching Doesn't Allow Term Chaining

The current implementation of searching, at least from what I can tell, doesn't allow term chaining.

If I search for myosin, I get the following results back:

These are correct results, however if I wanted to search for differing sets of data with a single search, I cannot. Example being: maybe I am interested in both myosin and junctions so I search myosin junctions (note these terms don't really relate, but results for either myosin or junctions should return in my opinion)

I am returned results that only contain them both, not the individual lines.

To my understanding this is because all search terms must not only all be in the README, node names, package name, etc. But all the search terms must be in the direct order they are given in.

As seen here. I know myosin is in the aics/random_sample package, so I search for random myosin.

random is in the name of the package, and myosin is labeled as a keyword in the README, but because they are not directly ordered like the search, the package aics/random_sample is not returned.

I haven't looked into ts_vector too much but I would believe that ts_vector is per-word not per-document. In which case I think this is an issue of breaking down the search into individual terms?

System-wide local storage

I have a multi-user Linux server that I will be using for a Data Science class. I would like to use quilt to distribute datasets with Jupyter Notebooks to students. But I don't want all students to download their own copies of the data when using quilt. I see that quilt uses appdirs.user_data_dir() to get the directory to use, and that I can set XDG_DATA_HOME to override that location:

https://github.com/ActiveState/appdirs/blob/master/appdirs.py#L92

Will this break quilt?
Will this give me the optimization of each unique dataset only being downloaded a single time for all users?
How will this affect users creating and pushing datasets?

Thanks!

Make All Children Nodes Subscriptable

Currently GroupNodes and DataNodes are callable only by attribute. With an executable function attached if you want the actual data.

from quilt.data.aics import prod

# retrieves the image load node
print(prod.fovs.unique_fov.image.load)

# retrieves the image load node file path
print(prod.fovs.unique_fov.image.load())

This current implementation is great. However it would also be nice to traverse nodes as dict-like.

from quilt.data.aics import prod

# retrieves the image load node
print(prod['fovs']['unique_fov']['image']['load'])

# retrieves the image load node file path
print(prod['fovs']['unique_fov']['image']['load']())

Pretty Printing Group Nodes

Currently, when I: print(pkg.group_node)
I am returned something like this:

<GroupNode>
node_0
node_1
node_2
node_3
node_4
node_5
node_6
node_7
node_8
node_9

Where every sub-node is printed.

However, for packages like mine, it isn't uncommon to have hundreds of nodes nested under each other and thus this printing structure isn't the best for viewing.

Two options in my mind:

When I print a group node, have it print a structure like that of the contents section of the webpage preview of a package.
print(pkg.group_node)
Would return:

<GroupNode>
node_0
   subnode_0
      subsubnode_0
      subsubnode_1
   subnode_1
      subsubnode_0
      subsubnode_1
   ...
node_1
   subnode_0
      subsubnode_0
      subsubnode_1
   subnode_1
      subsubnode_0
      subsubnode_1
   ...
...

Or have it print first n and last n subnodes of the node.
print(pkg.group_node)
Would return:

<GroupNode>
node_0
node_1
node_2
...
node_n-2
node_n-1
node_n

Maybe implement both and have a quilt.set_prints('show_depth') / quilt.set_prints('show_length') to indicate which printing method the user wants.

GitHub login

Per multiple requests from HN, Twitter

Linked Data (RDFa, JSON-LD)

These datasets would be even more discoverable if at least their metadata were described with Linked Data.

Here's an example of adding RDFa to an HTML page (and links to background info and validators for Schema.org, OpenGraph, and Twitter Cards structured data):
https://github.com/CodeForAntarctica/codeforantarctica.github.io/pull/3

There's a lot of linked open data. Here's a view of the LODCloud:

http://lod-cloud.net

This should maybe be another issue.

If we have URIs for column names, datatypes, and physical units (also accuracy and precision), we can find and compare and maybe even concatenate datasets.

subject: http://dataset_uri/#columnname
name/label: "Column Name"@en
datatypes: XSD URIs
physical units:
- QUDT
- https://schema.org/StructuredValue

I think what is needed are dataset headers that map this metadata onto columns in existing formats. CSVW (CSV on the Web) is built on the Tabular Data Model; which specify how this metadata can be expressed as JSONLD (and thus any RDF representation).

SPARQL is one query language for linked data.

I wrote up a bunch of this background information in pandas-dev/pandas#3402 "ENH: Linked Datasets (RDF)". And mentioned the opportunity for CSVW-like headers in Arrow and Parquet to @wesm on Twitter awhile back.

Dataset metadata on dataset HTML pages
Dataset column metadata across the ecosystem

Self-Hosted Private Remotes

Perhaps I missed it but is there/will there be functionality for pushing and installing from self-hosted private data registries?

Inspect a particular version

Could be useful to be able to inspect the contents of a package at a particular point in time, say by passing the hash:

quilt inspect [USER]/[PKG] --hash [HASH]

More concretely:

I have a bunch of versions of the same package

hunan/data                                       2c2caeb3
hunan/data                                       4f7dec20
hunan/data                                       6aa0f6e2
hunan/data                                       7860bf0f
hunan/data                                       a07a0847
hunan/data                  latest               0faf1a74

And would like to inspect the contents of say 4f7dec20.

Thanks

API interaction system adjustment

The Problem

Our current API interaction system is too verbose and not very consistent.
Basically, we just call fetch and pass all the options manually, which is tedious and error-prone.
Also, we need to intercept 401s and log user out automatically.

Possible Solution

A possible solution is to abstract API interaction into a "service" (injected by the provider component configured via props / context) with convenient interface (HOC that accepts some sort of request mappings / queries).

Design goals

1. Declarative configuration

For ease of use and minimal amount of boilerplate, the configuration should be declarative, in a form of passing some sort of request mappings / queries to a HOC constructor.

2. Consistency

We need a consistent way to make requests and handle responses / errors.

3. Centralised request / response handling

We need a way to hook into request and response flow in a single place to inject custom processing (auth hooks, json parsing, header injection, etc).

Implementation options

We have, basically, 3 different options:

a wrapper around react-refetch;
homegrown redux service / provider for abstracting the calls in some way;
GraphQL / Apollo

1. `react-refetch`

Quote:

A simple, declarative, and composable way to fetch data for React components

Example implementation

API connector (provider component and consumer HOC):

export const Provider = composeComponent('APIProvider',
  // get config from props and put it into context
);

export default (mappings) => composeHOC('withAPI',
  // ... inject config from context
  refetch.defaults({
    buildRequest: ({ url, headers, ...options }) => {
      // get config from props
      const { base } = ...;
      // get auth headers
      const authHeaders = ...;
      return new Request(urlJoin(base, url), { headers: { ...authHeaders, headers }, ...options });
    },
    handleResponse: async (resp)  => {
      // handle authentication loss
      if (resp.status === 401) {
        // access `dispatch` in some way
        dispatch(authLost());
        throw new NotAuthenticated();
      }
      // maybe some json processing and another stuff from current `utils/request`
      return resp;
    },
  })(mappings),
);

Usage (Provider should be present somewhere up the hierarchy when rendering this component):

composeComponent('Profile',
  // declarative config, can be factored out if necessary
  withAPI((props) => ({
    packages:  `/profile`,
    updatePayment: () => ({
      updatePaymentResponse: {
        url: `/payment/`,
        method: 'POST',
        body: formData,
      },
    }),
  })),
  ({ packages, updatePayment, updatePaymentResponse }) => (
    <div>
      // quite convenient response state handling
      {renderPromiseState(packages, {
        pending: () => <Loading />,
        rejected: (reason) => <Error error={reason} />,
        fulfilled: (value) => <Packages packages={value} />,
      })}
      <button onClick={updatePayment}>Update payment info</button>
      {renderPromiseState(updatePaymentResponse, {
        pending: () => <p>updating...</p>,
        rejected: (reason) => <p>error updating payment info: {reason}</p>,
        fulfilled: (value) => <p>payment info updated</p>,
      })}
    </div>
  ));

Pros

quite small investment
familiarity (http requests look more familiar than QraphQL queries)

Cons

the library API seems a little inflexible, so it'll require some workarounds to wire things up properly
requests state is kept outside of redux (doesn't seem like a serious problem, but our architecture is based upon redux ecosystem, so it would be nice to use it whenever possible)
the data is not structured / normalized properly

2. Homegrown redux service / provider

Implementation

We'll need a provider that runs the saga that issues requests and handles responses (with reducer that holds the requests state).

Also, we'll need a consumer HOC which

selects proper slice of state and maps it to props based on config
binds component lifecycle to request (re)fetching
binds request functions to component props and arguments

Basically, we'd write our own opinionated version of redux-refetch or something similar.

Pros

we can do anything we want in any way we want
everything will be within our redux-based framework

Cons

large investment

Doesn't seem like a viable option given there are established feature-rich 3rd-party tools that can be used for this.

3. GraphQL / Apollo

GraphQL is a model querying language and Apollo is a set of technologies for working with GraphQL all across the stack (includes client and server tools).

Pros

result is a properly structured and documented API with consistent type system and data normalization capabilities
Apollo stack is mature and very flexible
Apollo client uses redux, so its state and transitions are more or less transparent

Cons

quite large investment
unfamiliarity (compared to http requests)
more complex than the other options

Implementation

To implement a GraphQL API we need several things:

a GraphQL schema of the API;
a configured Apollo provider that connects the client to the API;
processing GraphQL language (queries and possibly schema) by our webpack setup;
a GraphQL API itself, implemented as one of these:
1. a client-side executable schema (considered quite heavyweight, the query execution environment adds ~200kB to the bundle size);
2. an API gateway server (probably using nodejs), which issues requests to the current REST API under the hood;
3. a GraphQL endpoint in the current flask-based registry server;

Connecting component to data looks like this:

// import query from a file
import query from './query.graphql';
// or use inline query
const query = gql`
  query Component__data {
    package { owner name }
  }
`;

composeComponent('ComponentWithData',
  graphql(query),
  ({ data }) => (
    <div>
      {renderData(data, {
        loading: () => <p>loading</p>,
        error: (error) => <p>error: {error}</p>,
        data: ({ package: { name, owner } }) => <p>package: {owner}/{name}</p>,
      })}
    </div>
  ));

1. Schema

Looks like this:

# quilt.graphql

type Package {
  # ...
}

# other types

type Query {
  package(owner: String!, name: String!): Package
  packages(query: String): [Package!]!
  # ...
}

type Mutation {
  deletePackage(owner: String!, name: String!): DeletePackageResult!
  # ...
  signIn(email: String!, password: String!): SignInResult!
  signUp(...): SignUpResult!
}

2. Apollo provider

Nothing unusual, just configure an Apollo client and inject it using ApolloProvider.

3. GraphQL processing

We'd need to use GraphQL loader for webpack to be able to store queries and fragments in separate files.

4. GraphQL API

4.1. Client-side executable schema

To implement executable schema on the client-side, we'd need to provide a resolver map looking like this:

{
  Query: {
    package: (root, params, context) => {
      // make a request and process the data
    },
    packages: (root, params, context) => {
      // make a request and process the data
    },
    // ...
}

The execution environment is quite heavyweight, so this solution should be considered temporary.
The resolvers logic can be easily moved to the server-side when implementing nodejs-based GraphQL API gateway.

Pros:

smaller investment than a server-side API gateway: doesn't require backend changes and new infrastructure for running the gateway server

Cons:

considerable runtime overhead
GraphQL API is not exposed, so it cannot be consumed by arbitrary GraphQL client

4.2. GraphQL gateway over existing REST API

This one is quite straightforward. Suggested stack: koa or express + apollo-server / apollo-tools. The implementation is similar to the client-side solution (writing resolvers for schema that fetch and transform data), but the facade is a separate server that listens for GraphQL requests (via http or ws), makes http requests to the REST API and then responds with collected and transformed data.

Pros:

better performance and smaller overhead compared to the client-side solution
exposed API can be consumed by any GraphQL client
better DX: something like graphiql can be used to explore the API

Cons:

larger investment (infrastructure adjustment required to run a new server)
overhead is larger compared to the native solution (see next section)

4.3. GraphQL endpoint in the current flask-based registry server

Python ecosystem has decent GraphQL support, so we can serve the GraphQL stuff right from our flask app, querying the data directly using SQLA and avoiding unnecessary overhead related to REST requests.

Pros:

best performance and least overhead
smaller investment than the previous option (infrastructure adjustments not required, just add a new endpoint(s) to the existing backend)
all the rest from the previous option

Cons:

possible performance degradation of the current registry functions due to added complexity of GraphQL handling and possible deeply nested queries
larger investment than the client-side option

Alternative (quick-fix style) solution

Alternatively, we can adjust our request functions (utils/request) to inject headers and handle 401s in some way. It is quite quick and easy to implement, but it doesn't solve the problem as stated here in its entirety, just its auth handling aspect. Also, this doesn't seem too good architecture-wise. Nevertheless, it's a viable short-term solution.

`quilt rm` appears to succeed on non-existent package

Group Nodes as Iterators

Currently when I want to loop through a list of nodes contained in a parent node I have to do something like this:

for node_name in pkg.group_node._group_keys():
   actual_node = getattr(pkg.group_node, node_name)
   # do actual work with the node here

A nice touch would be to make GroupNode, and I guess PackageNode as well, work like iterators and when I want to loop through something it gives me the node instead of my having to do a roundabout getattr.

for node in pkg.group_node:
   # do work

Support messages on `quilt push`

Display in quilt log
Display on package landing page

ODBC connector to Quilt packages

This will allow Quilt to connecto Java, Excel, and other environments.

CC @hadim per #632.

Package Creator Detailed Stats

Great work on the summary stats for packages!

I also realized it would be neat to see more detailed insight into my own package data. On the user settings page, a user can see all of their packages in a list. If next to each package name there were details on number of downloads, maybe some sorting mechanism, so instead of seeing the list in alphabetical order, I can order by most or least downloaded. Peak traffic to the package page + peak traffic to download the data? Etc.

Issues with the Registry folder contents

Server Error is returned when trying to push a data repo in the local test environment
After login, navigating to localhost:3000 and clicking log in on the UI, it creates an infinite loop between http://localhost:3000/profile and http://localhost:3000/oauth
On the Dev Environment setup for Ubuntu 16.04.3 on step 4, there is a file referenced that does not exist on the most up-to-date branch
source quilt_server/flask_dev.sh

Has anyone else noticed these issues or have solutions available? Would like to create a private hub to store sensitive data via docker and s3 but have been unable to work around these issues

Anchors in markdown

To have working anchor links in markdown we have, basically, two options: manual anchor insertion and automatic generation (based on the heading text).

Manual

Insertion

Just insert the anchor: ## Section 1 <a name="section1"></a>, -- it's simple and explicit, we just need to enable html in remarkable options.

Usage

To use these anchors, we simply reference them by given names: [Section 1](#section1).

Pros

Simplicity and explicitness
Maintenance: anchors don't change implicitly

Cons

Extra manual markup
Possible security issue: need to enable html support in markdown renderer

Implementation

Enable html in remarkable options

Automatic

Generation

We can generate the anchors with the help of something like markdown-toc (see example implementation).
This will require render customization to expose the generated names.

Usage

The usage is essentially the same as with the first option, but since the names are implicit, we should make the anchors accessible in some way, like anchor icons shown when hovering headings on github.

Pros

Automatic
Stylish and professional (especially with anchor icons shown on hover)
No need to enable html parsing

Cons

Added complexity: requires custom rendering for headings and extra styling for anchors
Maintenance: when heading text changes, anchor changes, so one must adjust the links accordingly

Implementation

Configure renderer to generate and insert anchors
Style the anchors appropriately

Caveats

Hash routing works out-of-the-box if the contents are already rendered (so navigating using TOC is ok), but doesn't work on the cold start (navigating directly to /page#section is not ok), so we need some (probably hacky) workaround (our existing one won't work, because it's react-specific, and markdown rendering is happening outside of react context). Server-side rendering will fix this.

Site-local links

Also, it makes sense to intercept link clicks inside markdown and use react-router's push if the link is local (relative, absolute without schema or if origin matches the current one) to avoid page reload.

Python int too large to convert to C long

I took test_v2.csv from the Kaggle Loan Default Prediction dataset (https://www.kaggle.com/c/loan-default-prediction/data).
It is a 1Gb file. I wanted to upload it to my free quilt account, to test it.

import pandas as pd
df = pd.read_csv('./test_v2.csv')
quilt.build('...', df)
quilt.push('...', is_public=True)

See screenshot. Restricting the dataframe to df[:1000] works just fine.

Support parquet files transform

It would be great to have parquet files recognized natively like csv files and be able to pull them directly into a Pandas DataFrame.

Java support

Are you planning to add support for Java? I only need it to download the data. In the meantime, is that possible to just download the uploaded package from its url?

Add `_parent_node_` Attribute to GroupNode and DataNode Objects

Would allow people to build custom navigation functions that traverse the package tree for them.

Example:

def get_associate(node):
    # return the parent of parent other associated node
    return node._parent_node_._parent_node_.sub_node_1.data

node = dataset.initial.sub_node_0.data
associate = get_associate(node)

This is a simple example, but allowing this functionality means we can write very detailed custom navigation functions if we wanted to add more parameters and such to determine which associate to go to, etc.

I will try to dig through the GroupNode and DataNode creation code and add the attribute unless someone else gets it first.

Link images in package readme

It would be great to be able to link images and/or other files in the package in the readme such that they show up in the UI.

Inspiration: https://bl.ocks.org/mbostock/b2fee5dae98555cf78c9e4c5074b87c3

Display tags on package landing page

Provide truncated view of data from package overview page

Currently, the only ways to see an example row of data are by downloading the package or if the maintainer provided an example in the description. There should be a way to view the head of tabular data from the package file tree or similar mechanism.

It could be a new page that displays df.head(n) for tabular data. Unstructured data could be loaded as-is if below a certain file size.

Package List Filter

As I am an odd use case and I make 20000 packages to build a single package in some cases, I know this is an odd request. However would be great to have some sort of filtering options put on the quilt.ls() function.

# return only packages created under the namespace 'aics'
quilt.ls(owner='aics')

# return all packages that begin with the word 'cell'
quilt.ls(grep='cell*')

Update quilt docs with self-hosting a quilt registry

My first impressions from https://docs.quiltdata.com/ are that there are no self-hosting options since I don't see anything about quilt config or the information in https://github.com/quiltdata/quilt/tree/master/registry. So perhaps it could be updated to reflect instructions about self-hosting from the README docs in quilt/registry.

Instructions for production registry deployment

pyarrow is not support win32 or win64, so quilt is the same ?

Hi,
When I pip install quilt in win7, quilt need pyarrow<0.8.0,>=0.4.0, but pyarrow seems only support linux/mac/win-amd, but my computer is intel-win64~

help!

thanks~
Si

Basic question: how to configure where packages are locally stored?

I apologize for asking such a basic question, but I cannot find an answer in the docs or via quilt help.

I digging around the Allen Cell Explorer dataset, and it is rather large. On my mac laptop, packages are stored at ~/Library/Application Support/QuiltCl. I want to instead use a path on an external drive with plenty of space.

Is there an environment variable I am missing? quilt config seemed promising, but I do not think my answer is there either.

--- Mark

slow quilt(1) startup times due to import pyarrow (and pandas?)

1+ secs to startup quilt, which impacts use as a general purpose *nix tool.

I'm not seeing much on pyarrow but here's some info on pandas:
pandas-dev/pandas#7282
pandas-dev/pandas#16764

Duplicate node names in directory

It seems I have some funky names in my dataset

Failed to build the package: Duplicate node names in directory '/efs/images/961'. 'S06_shift_1.nii.gz' was renamed to 'S06_shift_1_nii_gz', which overlaps with 'S06_shift_-1.nii.gz'

Could there be a better way of escaping "-" than removing it? Or using hashes for ids with names as metadata?

Package contents cannot have a node named 'file'

Attempting to push a package in python on 2.8.4 fails with traceback:

Inferring 'transform: id' for README.md
Copying /home/jovyan/data/README.md...
Traceback (most recent call last):
  File "push_package.py", line 3, in <module>
    quilt.build('aicsjackson/aics_11_test', '/home/jovyan/data/build.yml')
  File "/opt/conda/lib/python3.6/site-packages/quilt/tools/command.py", line 425, in build
    _build_internal(package, path, dry_run, env)
  File "/opt/conda/lib/python3.6/site-packages/quilt/tools/command.py", line 450, in _build_internal
    build_from_path(package, path, dry_run=dry_run, env=env)
  File "/opt/conda/lib/python3.6/site-packages/quilt/tools/command.py", line 522, in build_from_path
    build_package(owner, pkg, path, dry_run=dry_run, env=env)
  File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 319, in build_package
    checks_contents=checks_contents, dry_run=dry_run, env=env)
  File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 345, in build_package_from_contents
    checks_contents=checks_contents, dry_run=dry_run, env=env)
  File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 110, in _build_node
    checks_contents=checks_contents, dry_run=dry_run, env=env, ancestor_args=group_args)
  File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 110, in _build_node
    checks_contents=checks_contents, dry_run=dry_run, env=env, ancestor_args=group_args)
  File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 121, in _build_node
    path = os.path.join(build_dir, rel_path)
  File "/opt/conda/lib/python3.6/posixpath.py", line 92, in join
    genericpath._check_arg_types('join', a, *p)
  File "/opt/conda/lib/python3.6/genericpath.py", line 149, in _check_arg_types
    (funcname, s.__class__.__name__)) from None
TypeError: join() argument must be str or bytes, not 'dict'

Code:

import quilt

quilt.build('aicsjackson/aics_11_test', '/home/jovyan/data/build.yml')
quilt.push('aicsjackson/aics_11_test', public=True)

Python: 3.6.3
OS: Dockerized Linux (base jupyter/minimal-notebook)

Note: Without README.md file it still terminates the package push.

Package Build.yml:

contents:
  README:
    file: README.md
  data:
    modified_AICS_11_0:
      basic_info:
        file: data/modified-AICS-11-0/basic_info.json
      file:
        file: data/modified-AICS-11-0/file.tif
    modified_AICS_11_1:
      basic_info:
        file: data/modified-AICS-11-1/basic_info.json
      file:
        file: data/modified-AICS-11-1/file.tif
    modified_AICS_11_10:
      basic_info:
        file: data/modified-AICS-11-10/basic_info.json
      file:
        file: data/modified-AICS-11-10/file.tif
    modified_AICS_11_11:
      basic_info:
        file: data/modified-AICS-11-11/basic_info.json
      file:
        file: data/modified-AICS-11-11/file.tif
    modified_AICS_11_12:
      basic_info:
        file: data/modified-AICS-11-12/basic_info.json
      file:
        file: data/modified-AICS-11-12/file.tif
    modified_AICS_11_13:
      basic_info:
        file: data/modified-AICS-11-13/basic_info.json
      file:
        file: data/modified-AICS-11-13/file.tif
    modified_AICS_11_14:
      basic_info:
        file: data/modified-AICS-11-14/basic_info.json
      file:
        file: data/modified-AICS-11-14/file.tif
    modified_AICS_11_15:
      basic_info:
        file: data/modified-AICS-11-15/basic_info.json
      file:
        file: data/modified-AICS-11-15/file.tif
    modified_AICS_11_16:
      basic_info:
        file: data/modified-AICS-11-16/basic_info.json
      file:
        file: data/modified-AICS-11-16/file.tif
    modified_AICS_11_17:
      basic_info:
        file: data/modified-AICS-11-17/basic_info.json
      file:
        file: data/modified-AICS-11-17/file.tif
    modified_AICS_11_18:
      basic_info:
        file: data/modified-AICS-11-18/basic_info.json
      file:
        file: data/modified-AICS-11-18/file.tif
    modified_AICS_11_19:
      basic_info:
        file: data/modified-AICS-11-19/basic_info.json
      file:
        file: data/modified-AICS-11-19/file.tif
    modified_AICS_11_2:
      basic_info:
        file: data/modified-AICS-11-2/basic_info.json
      file:
        file: data/modified-AICS-11-2/file.tif
    modified_AICS_11_20:
      basic_info:
        file: data/modified-AICS-11-20/basic_info.json
      file:
        file: data/modified-AICS-11-20/file.tif
    modified_AICS_11_21:
      basic_info:
        file: data/modified-AICS-11-21/basic_info.json
      file:
        file: data/modified-AICS-11-21/file.tif
    modified_AICS_11_22:
      basic_info:
        file: data/modified-AICS-11-22/basic_info.json
      file:
        file: data/modified-AICS-11-22/file.tif
    modified_AICS_11_23:
      basic_info:
        file: data/modified-AICS-11-23/basic_info.json
      file:
        file: data/modified-AICS-11-23/file.tif
    modified_AICS_11_24:
      basic_info:
        file: data/modified-AICS-11-24/basic_info.json
      file:
        file: data/modified-AICS-11-24/file.tif
    modified_AICS_11_25:
      basic_info:
        file: data/modified-AICS-11-25/basic_info.json
      file:
        file: data/modified-AICS-11-25/file.tif
    modified_AICS_11_26:
      basic_info:
        file: data/modified-AICS-11-26/basic_info.json
      file:
        file: data/modified-AICS-11-26/file.tif
    modified_AICS_11_27:
      basic_info:
        file: data/modified-AICS-11-27/basic_info.json
      file:
        file: data/modified-AICS-11-27/file.tif
    modified_AICS_11_28:
      basic_info:
        file: data/modified-AICS-11-28/basic_info.json
      file:
        file: data/modified-AICS-11-28/file.tif
    modified_AICS_11_29:
      basic_info:
        file: data/modified-AICS-11-29/basic_info.json
      file:
        file: data/modified-AICS-11-29/file.tif
    modified_AICS_11_3:
      basic_info:
        file: data/modified-AICS-11-3/basic_info.json
      file:
        file: data/modified-AICS-11-3/file.tif
    modified_AICS_11_4:
      basic_info:
        file: data/modified-AICS-11-4/basic_info.json
      file:
        file: data/modified-AICS-11-4/file.tif
    modified_AICS_11_5:
      basic_info:
        file: data/modified-AICS-11-5/basic_info.json
      file:
        file: data/modified-AICS-11-5/file.tif
    modified_AICS_11_6:
      basic_info:
        file: data/modified-AICS-11-6/basic_info.json
      file:
        file: data/modified-AICS-11-6/file.tif
    modified_AICS_11_7:
      basic_info:
        file: data/modified-AICS-11-7/basic_info.json
      file:
        file: data/modified-AICS-11-7/file.tif
    modified_AICS_11_8:
      basic_info:
        file: data/modified-AICS-11-8/basic_info.json
      file:
        file: data/modified-AICS-11-8/file.tif
    modified_AICS_11_9:
      basic_info:
        file: data/modified-AICS-11-9/basic_info.json
      file:
        file: data/modified-AICS-11-9/file.tif

Any help and info would be appreciated.

Filtering of packages

There is a real desire from my team to have a MongoDB style filtering mechanism for both downloaded packages and installing packages.

# package previously downloaded
is_cell = {'field': 'node_name',
          'check': 'contains',
          'value': 'cell'}

subset = pkg.filter([is_cell, ..., other_filters])
print(len(pkg))
200
print(len(subset))
30

# package to be installed
quilt.install('organization/pkg', filters=[is_cell, ..., other_filters], rename='subset_of_pkg')

Having a native querying system of packages would allow for better reproducibility between individuals using the same package as currently we are each writing our own filtering functions.

CSVs from empty DataFrames crash build

CSVs generated by calling to_csv(index=False) will crash quilt.build if no kwargs are passed in to guide Pandas parsing. It would be helpful to detect this case and skip over the empty file during build.

DataNode and GroupNode should give access to original names

DataNode should be extended to allow access to the original file name via a .name property

It doesn't seem possible to get back the extension of the original file otherwise, and extensions matter for video files.

the metadata in the json file seem to already have this information

does quilt supports HDFS big data and Spark (pySpark)?

It's not clear from the documentation

Build destination should default to user home and read from an environment variable

Suggest QUILT_BUILD_PATHas env vars. See #1 for details.

package import and location Clarity and Control

From the way I understood quilt to work, I expected to be able to call
"quilt build user/package folder"
from anywhere, and I would end up with an importable data package right away. However, this is working in a very sporadic way for me which I'll try to detail. These all seem to me to be related problems, but I'll be happy to break it up into multiple issues if that is easier to manage.

First, where ever I call the build command is where the quilt package is built. As the default I find this quite weird, as I quickly ended up with data packages all over the place. My expectation was that the packages would be managed in a way similar to python packages: some central place on disk.

Second, that default is also problematic as packages that I have built are not available for import from other locations (with one exception in 3rd) unless I set my current working directory manually. To access the standard library or any installed module, it doesn't matter what my current working directory is, and that is what I expected from the quilt packages. But so far, if I have the quilt packages in anywhere but one spot, the working directory of my code has to be the same as the quilt package folder or else the import fails. This breaks the promise of having data importable like modules.

Third On my macbook, I have found that all the packages in
"Users/username/quilt_packages"
are available for import regardless of my working directory. For the most part, this has served my purposes. However, I now have a need to have some of the quilt packages in another folder, so I am back to manipulating the working directory.

Fourth On my windows 10 laptop, there is no folder at all that allows a package import from anywhere, so I always have to manipulate the working directory in Windows.

Fifth I have also not found a way to use packages in PyCharm yet. Adding the quilt_packages folder as a source folder does not make the data packages visible to PyCharm.

Specified Area in DataNode Class for Custom Load Functions

While supporting dataframes as first class objects, and I have heard rumors of json -> dict and npz -> ndarray are excellent to have. There are all sorts of files that would be hard to support as first class in-memory solutions but would be great to have a standard loading mechanism.

If in the DataNode class there was a simple solution for community members to add/ edit standard loading functions that would be a great addition.

class DataNode():
   # put your functions here community
   def json(self):
      import json
      return json.load(self())

   def tiff(self):
      import tifffile
      return new tifffile.TiffFile(self())

This would now make it so that the below functionality may work.

# current behaviour
print(pkg.data_node())
'/quilt_packages/objs/lkjh21398123jknloaisjdfp81jhiuqsdf'

# json behaviour
print(pkg.data_node.json())
{'hello': 'world', 'foo': 'bar'}

# tiff behaviour
print(pkg.data_node.tiff())
<TiffFile object>

So while having first class support for certain types of objects that can live in memory easily, dataframes, ndarrays, dict, etc. rare file types would have pseudo first class support through helper functions.

Consolidate API and Python doc files

Merge docs/api-python.md and docs/api-cli.md with something like https://github.com/GitbookIO/plugin-codetabs

Problem: will not render nicely on GitHub :-/

Rate limit to build package

I'm trying to build a 1.58 GB csv file, but I have the following result:

$ sudo quilt build alifersales/burj20141 build.yml 
Inferring 'transform: id' for README.md
Registering README.md...
Inferring 'transform: csv' for rj_bu_2014_1.csv
Serializing rj_bu_2014_1.csv...
  1%|▋                                                                                                           | 9.18M/1.58G [00:00<00:27, 56.4MB/s]
Warning: failed fast parse on input rj_bu_2014_1.csv.
Switching to Python engine.
Killed

But I can't understand which is the problem and why the process was killed.

My file is:

$head rj_bu_2014_1.csv
"data_geracao","hora_geracao","codigo_pleito","codigo_eleicao","sigla_uf","codigo_cargo","descricao_cargo","numero_zona","numero_secao","numero_local","numero_partido","nome_partido","codigo_municipio","nome_municipio","data_bu_recebido","qtde_eleitores_aptos","qtde_eleitores_faltosos","qtde_eleitores_comparecimento","codigo_tipo_eleicao","codigo_tipo_urna","descricao_tipo_urna","numero_votavel","nome_votavel","qtde_votos","codigo_tipo_votavel","numero_urna_efetivada","codigo_carga_urna_1","codigo_carga_urna_2","data_carga_urna","codigo_flashcard","cargo_pergunta_secao"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","99","","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","96","NULO","17","3","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","99","","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","95","BRANCO","9","2","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","50","PSOL","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","50","LUCIANA GENRO","9","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","45","PSDB","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","45","A�CIO NEVES","74","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","13","PT","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","13","DILMA","76","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","40","PSB","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","40","MARINA SILVA","64","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","28","PRTB","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","28","LEVY FIDELIX","1","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","21","PCB","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","21","MAURO IASI","1","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","20","PSC","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","20","PASTOR EVERALDO","1","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"

I'm suspecting that it's because a rate limit to build, but it looks strange for me.

Can someone help me?

Stats for DataNodes

This relates to #427.

Has there been thought into adding a similar system like mentioned in the above issue but on the DataNode level.

print(pkg.group_node.data_node.inspect())

object_type: file (json)
original_name: 'this_was_a_file.json'
size: 123 MB
default_load: filepath

Would allow users that want to view the details of a node without opening it up a method to do so.

Quilt.load(subpackage_path) returns top-level package instead of subpackage

Currently, I am using subpackages to store experimental data, labeled by an auto-generated date/timestamp. When reloading a data set it would be great to be able to run:

label = "data_20180806_131131"
data = quilt.load(f"guen/test/{label}")
plot_data(data)

Instead, currently, quilt.load returns the top-level package, which requires the following:

label = "data_20180806_131131"
data = quilt.load(f"guen/test/{label}")
data = getattr(data, label)
plot_data(data)

Automate splitting datasets into training, test and validation sets for ML in build

For data packages used in machine learning, it would be useful for Quilt build to support splitting inputs into fixed sets for model training, and validation. For structured data, the various sets (training, test and validation) could be children of a common parent so that the entire dataset is available (by calling _data on the parent). Thanks to @rhiever for the suggestion.

`quilt.load` should support hashes, tags, versions

e.g. so users can diff across versions

Columns QC check

I have checks that look like the following to check that the desired columns are present in my dataframe. It would be nice if there was a built-in qc function for that.

checks:
  has_common_cols: |
    goal = ['col1', 'col2', 'col3']
    qc.check(sum([current in qc.data.columns for current in goal]) == len(goal))

CLI - Team login with custom registry

The _check_team_exists and get_registry_url functions use a hard-coded string template to build the registry url rather than loading the registry_url from the config, preventing teams logins when using a custom registry.

quiltdata / quilt Goto Github PK

quilt's Introduction

Quilt is a data mesh for connecting people with actionable data

Python Quick start, tutorials

Quilt in action

Who is Quilt for?

What does Quilt do?

How does Quilt work?

Use cases

quilt's People

Contributors

Stargazers

Watchers

Forkers

quilt's Issues

The Problem

Possible Solution

Design goals

1. Declarative configuration

2. Consistency

3. Centralised request / response handling

Implementation options

1. react-refetch

Example implementation

Pros

Cons

2. Homegrown redux service / provider

Implementation

Pros

Cons

3. GraphQL / Apollo

Pros

Cons

Implementation

1. Schema

2. Apollo provider

3. GraphQL processing

4. GraphQL API

4.1. Client-side executable schema

4.2. GraphQL gateway over existing REST API

4.3. GraphQL endpoint in the current flask-based registry server

Alternative (quick-fix style) solution

Manual

Insertion

Usage

Pros

Cons

Implementation

Automatic

Generation

Usage

Pros

Cons

Implementation

Caveats

Site-local links

Recommend Projects

Recommend Topics

Recommend Org

1. `react-refetch`