quiltdata / quilt Goto Github PK
View Code? Open in Web Editor NEWQuilt is a data mesh for connecting people with actionable data
Home Page: https://quiltdata.com
License: Apache License 2.0
Quilt is a data mesh for connecting people with actionable data
Home Page: https://quiltdata.com
License: Apache License 2.0
Hi
I am trying to import the package from the link. But its not accessing the variables in sat_6_full further. Its a data node. I am trying it as
from quilt.data.kmoore import sat4_6_airborne
sat4_6_airborne.sat_6_full.train_x()
Currently, I am using subpackages to store experimental data, labeled by an auto-generated date/timestamp. When reloading a data set it would be great to be able to run:
label = "data_20180806_131131"
data = quilt.load(f"guen/test/{label}")
plot_data(data)
Instead, currently, quilt.load
returns the top-level package, which requires the following:
label = "data_20180806_131131"
data = quilt.load(f"guen/test/{label}")
data = getattr(data, label)
plot_data(data)
The current implementation of searching, at least from what I can tell, doesn't allow term chaining.
If I search for myosin
, I get the following results back:
These are correct results, however if I wanted to search for differing sets of data with a single search, I cannot. Example being: maybe I am interested in both myosin
and junctions
so I search myosin junctions
(note these terms don't really relate, but results for either myosin
or junctions
should return in my opinion)
I am returned results that only contain them both, not the individual lines.
To my understanding this is because all search terms must not only all be in the README, node names, package name, etc. But all the search terms must be in the direct order they are given in.
As seen here. I know myosin
is in the aics/random_sample
package, so I search for random myosin
.
random
is in the name of the package, and myosin
is labeled as a keyword in the README, but because they are not directly ordered like the search, the package aics/random_sample
is not returned.
I haven't looked into ts_vector too much but I would believe that ts_vector is per-word not per-document. In which case I think this is an issue of breaking down the search into individual terms?
CSVs generated by calling to_csv(index=False) will crash quilt.build
if no kwargs are passed in to guide Pandas parsing. It would be helpful to detect this case and skip over the empty file during build.
Our current API interaction system is too verbose and not very consistent.
Basically, we just call fetch
and pass all the options manually, which is tedious and error-prone.
Also, we need to intercept 401s and log user out automatically.
A possible solution is to abstract API interaction into a "service" (injected by the provider component configured via props / context) with convenient interface (HOC that accepts some sort of request mappings / queries).
For ease of use and minimal amount of boilerplate, the configuration should be declarative, in a form of passing some sort of request mappings / queries to a HOC constructor.
We need a consistent way to make requests and handle responses / errors.
We need a way to hook into request and response flow in a single place to inject custom processing (auth hooks, json parsing, header injection, etc).
We have, basically, 3 different options:
a wrapper around react-refetch;
homegrown redux service / provider for abstracting the calls in some way;
GraphQL / Apollo
react-refetch
Quote:
A simple, declarative, and composable way to fetch data for React components
API connector (provider component and consumer HOC):
export const Provider = composeComponent('APIProvider',
// get config from props and put it into context
);
export default (mappings) => composeHOC('withAPI',
// ... inject config from context
refetch.defaults({
buildRequest: ({ url, headers, ...options }) => {
// get config from props
const { base } = ...;
// get auth headers
const authHeaders = ...;
return new Request(urlJoin(base, url), { headers: { ...authHeaders, headers }, ...options });
},
handleResponse: async (resp) => {
// handle authentication loss
if (resp.status === 401) {
// access `dispatch` in some way
dispatch(authLost());
throw new NotAuthenticated();
}
// maybe some json processing and another stuff from current `utils/request`
return resp;
},
})(mappings),
);
Usage (Provider
should be present somewhere up the hierarchy when rendering this component):
composeComponent('Profile',
// declarative config, can be factored out if necessary
withAPI((props) => ({
packages: `/profile`,
updatePayment: () => ({
updatePaymentResponse: {
url: `/payment/`,
method: 'POST',
body: formData,
},
}),
})),
({ packages, updatePayment, updatePaymentResponse }) => (
<div>
// quite convenient response state handling
{renderPromiseState(packages, {
pending: () => <Loading />,
rejected: (reason) => <Error error={reason} />,
fulfilled: (value) => <Packages packages={value} />,
})}
<button onClick={updatePayment}>Update payment info</button>
{renderPromiseState(updatePaymentResponse, {
pending: () => <p>updating...</p>,
rejected: (reason) => <p>error updating payment info: {reason}</p>,
fulfilled: (value) => <p>payment info updated</p>,
})}
</div>
));
We'll need a provider that runs the saga that issues requests and handles responses (with reducer that holds the requests state).
Also, we'll need a consumer HOC which
Basically, we'd write our own opinionated version of redux-refetch
or something similar.
Doesn't seem like a viable option given there are established feature-rich 3rd-party tools that can be used for this.
GraphQL is a model querying language and Apollo is a set of technologies for working with GraphQL all across the stack (includes client and server tools).
To implement a GraphQL API we need several things:
Connecting component to data looks like this:
// import query from a file
import query from './query.graphql';
// or use inline query
const query = gql`
query Component__data {
package { owner name }
}
`;
composeComponent('ComponentWithData',
graphql(query),
({ data }) => (
<div>
{renderData(data, {
loading: () => <p>loading</p>,
error: (error) => <p>error: {error}</p>,
data: ({ package: { name, owner } }) => <p>package: {owner}/{name}</p>,
})}
</div>
));
Looks like this:
# quilt.graphql
type Package {
# ...
}
# other types
type Query {
package(owner: String!, name: String!): Package
packages(query: String): [Package!]!
# ...
}
type Mutation {
deletePackage(owner: String!, name: String!): DeletePackageResult!
# ...
signIn(email: String!, password: String!): SignInResult!
signUp(...): SignUpResult!
}
Nothing unusual, just configure an Apollo client and inject it using ApolloProvider
.
We'd need to use GraphQL loader for webpack to be able to store queries and fragments in separate files.
To implement executable schema on the client-side, we'd need to provide a resolver map looking like this:
{
Query: {
package: (root, params, context) => {
// make a request and process the data
},
packages: (root, params, context) => {
// make a request and process the data
},
// ...
}
The execution environment is quite heavyweight, so this solution should be considered temporary.
The resolvers logic can be easily moved to the server-side when implementing nodejs-based GraphQL API gateway.
Pros:
Cons:
This one is quite straightforward. Suggested stack: koa
or express
+ apollo-server
/ apollo-tools
. The implementation is similar to the client-side solution (writing resolvers for schema that fetch and transform data), but the facade is a separate server that listens for GraphQL requests (via http or ws), makes http requests to the REST API and then responds with collected and transformed data.
Pros:
Cons:
Python ecosystem has decent GraphQL support, so we can serve the GraphQL stuff right from our flask app, querying the data directly using SQLA and avoiding unnecessary overhead related to REST requests.
Pros:
Cons:
Alternatively, we can adjust our request functions (utils/request
) to inject headers and handle 401s in some way. It is quite quick and easy to implement, but it doesn't solve the problem as stated here in its entirety, just its auth handling aspect. Also, this doesn't seem too good architecture-wise. Nevertheless, it's a viable short-term solution.
I have checks that look like the following to check that the desired columns are present in my dataframe. It would be nice if there was a built-in qc function for that.
checks:
has_common_cols: |
goal = ['col1', 'col2', 'col3']
qc.check(sum([current in qc.data.columns for current in goal]) == len(goal))
Suggest QUILT_BUILD_PATH
as env vars. See #1 for details.
It's not clear from the documentation
I took test_v2.csv from the Kaggle Loan Default Prediction dataset (https://www.kaggle.com/c/loan-default-prediction/data).
It is a 1Gb file. I wanted to upload it to my free quilt account, to test it.
import pandas as pd
df = pd.read_csv('./test_v2.csv')
quilt.build('...', df)
quilt.push('...', is_public=True)
See screenshot. Restricting the dataframe to df[:1000] works just fine.
To have working anchor links in markdown we have, basically, two options: manual anchor insertion and automatic generation (based on the heading text).
Just insert the anchor: ## Section 1 <a name="section1"></a>
, -- it's simple and explicit, we just need to enable html in remarkable
options.
To use these anchors, we simply reference them by given names: [Section 1](#section1)
.
remarkable
optionsWe can generate the anchors with the help of something like markdown-toc (see example implementation).
This will require render customization to expose the generated names.
The usage is essentially the same as with the first option, but since the names are implicit, we should make the anchors accessible in some way, like anchor icons shown when hovering headings on github.
Hash routing works out-of-the-box if the contents are already rendered (so navigating using TOC is ok), but doesn't work on the cold start (navigating directly to /page#section
is not ok), so we need some (probably hacky) workaround (our existing one won't work, because it's react-specific, and markdown rendering is happening outside of react context). Server-side rendering will fix this.
Also, it makes sense to intercept link clicks inside markdown and use react-router
's push
if the link is local (relative, absolute without schema or if origin matches the current one) to avoid page reload.
It would be great to have parquet files recognized natively like csv files and be able to pull them directly into a Pandas DataFrame.
1+ secs to startup quilt, which impacts use as a general purpose *nix tool.
I'm not seeing much on pyarrow but here's some info on pandas:
pandas-dev/pandas#7282
pandas-dev/pandas#16764
It seems I have some funky names in my dataset
Failed to build the package: Duplicate node names in directory '/efs/images/961'. 'S06_shift_1.nii.gz' was renamed to 'S06_shift_1_nii_gz', which overlaps with 'S06_shift_-1.nii.gz'
Could there be a better way of escaping "-" than removing it? Or using hashes for ids with names as metadata?
A majority of my packages are quite large so the largest desire is for there to be at least an indication of package size before install on the Quilt registry page.
If this is the current package stats area:
Latest update
date
1/10/2018, 2:07:10 PM
author
@aicsjackson
version
2aed0783417118de00913d2ffe7be9968c881eb990993d6f5fdb957d0f49354
Would be nice to also see items like:
Latest update
date
1/10/2018, 2:07:10 PM
author
@aicsjackson
version
2aed0783417118de00913d2ffe7be9968c881eb990993d6f5fdb957d0f49354
size
200 GB
files
tiff | 400
json | 1600
Would allow for better insight into the package contents prior to downloading.
This relates to #427.
Has there been thought into adding a similar system like mentioned in the above issue but on the DataNode level.
print(pkg.group_node.data_node.inspect())
object_type: file (json)
original_name: 'this_was_a_file.json'
size: 123 MB
default_load: filepath
Would allow users that want to view the details of a node without opening it up a method to do so.
Using pip-compile to generate my requirements.txt resulted in pandas==0.19.2
for my project.
Trying to start up quilt's CLI resulted in:
(sourceress) sourceress jason/model-info-amount: quilt
Traceback (most recent call last):
File "/Users/jasonbenn/.pyenv/versions/sourceress/bin/quilt", line 7, in <module>
from quilt.tools.main import main
File "/Users/jasonbenn/.pyenv/versions/3.6.3/envs/sourceress/lib/python3.6/site-packages/quilt/__init__.py", line 83, in <module>
from .tools.command import (
File "/Users/jasonbenn/.pyenv/versions/3.6.3/envs/sourceress/lib/python3.6/site-packages/quilt/tools/command.py", line 33, in <module>
from .build import (build_package, build_package_from_contents, generate_build_file,
File "/Users/jasonbenn/.pyenv/versions/3.6.3/envs/sourceress/lib/python3.6/site-packages/quilt/tools/build.py", line 11, in <module>
from pandas.errors import ParserError
ModuleNotFoundError: No module named 'pandas.errors'
This pandas.errors
module exists in 0.22.0 but not 0.19.2.
Where is this project's setup.py or requirements.txt?
My first impressions from https://docs.quiltdata.com/ are that there are no self-hosting options since I don't see anything about quilt config or the information in https://github.com/quiltdata/quilt/tree/master/registry. So perhaps it could be updated to reflect instructions about self-hosting from the README docs in quilt/registry.
Currently, the only ways to see an example row of data are by downloading the package or if the maintainer provided an example in the description. There should be a way to view the head of tabular data from the package file tree or similar mechanism.
It could be a new page that displays df.head(n) for tabular data. Unstructured data could be loaded as-is if below a certain file size.
There is a real desire from my team to have a MongoDB style filtering mechanism for both downloaded packages and installing packages.
# package previously downloaded
is_cell = {'field': 'node_name',
'check': 'contains',
'value': 'cell'}
subset = pkg.filter([is_cell, ..., other_filters])
print(len(pkg))
200
print(len(subset))
30
# package to be installed
quilt.install('organization/pkg', filters=[is_cell, ..., other_filters], rename='subset_of_pkg')
Having a native querying system of packages would allow for better reproducibility between individuals using the same package as currently we are each writing our own filtering functions.
From the way I understood quilt to work, I expected to be able to call
"quilt build user/package folder"
from anywhere, and I would end up with an importable data package right away. However, this is working in a very sporadic way for me which I'll try to detail. These all seem to me to be related problems, but I'll be happy to break it up into multiple issues if that is easier to manage.
First, where ever I call the build command is where the quilt package is built. As the default I find this quite weird, as I quickly ended up with data packages all over the place. My expectation was that the packages would be managed in a way similar to python packages: some central place on disk.
Second, that default is also problematic as packages that I have built are not available for import from other locations (with one exception in 3rd) unless I set my current working directory manually. To access the standard library or any installed module, it doesn't matter what my current working directory is, and that is what I expected from the quilt packages. But so far, if I have the quilt packages in anywhere but one spot, the working directory of my code has to be the same as the quilt package folder or else the import fails. This breaks the promise of having data importable like modules.
Third On my macbook, I have found that all the packages in
"Users/username/quilt_packages"
are available for import regardless of my working directory. For the most part, this has served my purposes. However, I now have a need to have some of the quilt packages in another folder, so I am back to manipulating the working directory.
Fourth On my windows 10 laptop, there is no folder at all that allows a package import from anywhere, so I always have to manipulate the working directory in Windows.
Fifth I have also not found a way to use packages in PyCharm yet. Adding the quilt_packages folder as a source folder does not make the data packages visible to PyCharm.
Great work on the summary stats for packages!
I also realized it would be neat to see more detailed insight into my own package data. On the user settings page, a user can see all of their packages in a list. If next to each package name there were details on number of downloads, maybe some sorting mechanism, so instead of seeing the list in alphabetical order, I can order by most or least downloaded. Peak traffic to the package page + peak traffic to download the data? Etc.
Merge docs/api-python.md and docs/api-cli.md with something like https://github.com/GitbookIO/plugin-codetabs
Problem: will not render nicely on GitHub :-/
While supporting dataframes as first class objects, and I have heard rumors of json -> dict and npz -> ndarray are excellent to have. There are all sorts of files that would be hard to support as first class in-memory solutions but would be great to have a standard loading mechanism.
If in the DataNode class there was a simple solution for community members to add/ edit standard loading functions that would be a great addition.
class DataNode():
# put your functions here community
def json(self):
import json
return json.load(self())
def tiff(self):
import tifffile
return new tifffile.TiffFile(self())
This would now make it so that the below functionality may work.
# current behaviour
print(pkg.data_node())
'/quilt_packages/objs/lkjh21398123jknloaisjdfp81jhiuqsdf'
# json behaviour
print(pkg.data_node.json())
{'hello': 'world', 'foo': 'bar'}
# tiff behaviour
print(pkg.data_node.tiff())
<TiffFile object>
So while having first class support for certain types of objects that can live in memory easily, dataframes, ndarrays, dict, etc. rare file types would have pseudo first class support through helper functions.
I have a multi-user Linux server that I will be using for a Data Science class. I would like to use quilt to distribute datasets with Jupyter Notebooks to students. But I don't want all students to download their own copies of the data when using quilt. I see that quilt uses appdirs.user_data_dir()
to get the directory to use, and that I can set XDG_DATA_HOME
to override that location:
https://github.com/ActiveState/appdirs/blob/master/appdirs.py#L92
Thanks!
It would be great to be able to link images and/or other files in the package in the readme such that they show up in the UI.
Inspiration: https://bl.ocks.org/mbostock/b2fee5dae98555cf78c9e4c5074b87c3
I apologize for asking such a basic question, but I cannot find an answer in the docs or via quilt help
.
I digging around the Allen Cell Explorer dataset, and it is rather large. On my mac laptop, packages are stored at ~/Library/Application Support/QuiltCl
. I want to instead use a path on an external drive with plenty of space.
Is there an environment variable I am missing? quilt config
seemed promising, but I do not think my answer is there either.
--- Mark
These datasets would be even more discoverable if at least their metadata were described with Linked Data.
Here's an example of adding RDFa to an HTML page (and links to background info and validators for Schema.org, OpenGraph, and Twitter Cards structured data):
https://github.com/CodeForAntarctica/codeforantarctica.github.io/pull/3
There's a lot of linked open data. Here's a view of the LODCloud:
This should maybe be another issue.
If we have URIs for column names, datatypes, and physical units (also accuracy and precision), we can find and compare and maybe even concatenate datasets.
I think what is needed are dataset headers that map this metadata onto columns in existing formats. CSVW (CSV on the Web) is built on the Tabular Data Model; which specify how this metadata can be expressed as JSONLD (and thus any RDF representation).
SPARQL is one query language for linked data.
I wrote up a bunch of this background information in pandas-dev/pandas#3402 "ENH: Linked Datasets (RDF)". And mentioned the opportunity for CSVW-like headers in Arrow and Parquet to @wesm on Twitter awhile back.
Currently GroupNodes and DataNodes are callable only by attribute. With an executable function attached if you want the actual data.
from quilt.data.aics import prod
# retrieves the image load node
print(prod.fovs.unique_fov.image.load)
# retrieves the image load node file path
print(prod.fovs.unique_fov.image.load())
This current implementation is great. However it would also be nice to traverse nodes as dict-like.
from quilt.data.aics import prod
# retrieves the image load node
print(prod['fovs']['unique_fov']['image']['load'])
# retrieves the image load node file path
print(prod['fovs']['unique_fov']['image']['load']())
I'm trying to build a 1.58 GB csv file, but I have the following result:
$ sudo quilt build alifersales/burj20141 build.yml
Inferring 'transform: id' for README.md
Registering README.md...
Inferring 'transform: csv' for rj_bu_2014_1.csv
Serializing rj_bu_2014_1.csv...
1%|▋ | 9.18M/1.58G [00:00<00:27, 56.4MB/s]
Warning: failed fast parse on input rj_bu_2014_1.csv.
Switching to Python engine.
Killed
But I can't understand which is the problem and why the process was killed.
My file is:
$head rj_bu_2014_1.csv
"data_geracao","hora_geracao","codigo_pleito","codigo_eleicao","sigla_uf","codigo_cargo","descricao_cargo","numero_zona","numero_secao","numero_local","numero_partido","nome_partido","codigo_municipio","nome_municipio","data_bu_recebido","qtde_eleitores_aptos","qtde_eleitores_faltosos","qtde_eleitores_comparecimento","codigo_tipo_eleicao","codigo_tipo_urna","descricao_tipo_urna","numero_votavel","nome_votavel","qtde_votos","codigo_tipo_votavel","numero_urna_efetivada","codigo_carga_urna_1","codigo_carga_urna_2","data_carga_urna","codigo_flashcard","cargo_pergunta_secao"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","99","","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","96","NULO","17","3","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","99","","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","95","BRANCO","9","2","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","50","PSOL","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","50","LUCIANA GENRO","9","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","45","PSDB","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","45","A�CIO NEVES","74","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","13","PT","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","13","DILMA","76","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","40","PSB","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","40","MARINA SILVA","64","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","28","PRTB","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","28","LEVY FIDELIX","1","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","21","PCB","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","21","MAURO IASI","1","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
"14-10-2014","13:55:55","157","143","RJ","1","PRESIDENTE","1","1","1295","20","PSC","60011","RIO DE JANEIRO","05-10-2014","426","173","253","1","1","APURADA","20","PASTOR EVERALDO","1","1","1474540","685.479.799.740.465.658.","223.454","22-09-2014","AC0F2D99","1 - 1"
I'm suspecting that it's because a rate limit to build, but it looks strange for me.
Can someone help me?
Attempting to push a package in python on 2.8.4 fails with traceback:
Inferring 'transform: id' for README.md
Copying /home/jovyan/data/README.md...
Traceback (most recent call last):
File "push_package.py", line 3, in <module>
quilt.build('aicsjackson/aics_11_test', '/home/jovyan/data/build.yml')
File "/opt/conda/lib/python3.6/site-packages/quilt/tools/command.py", line 425, in build
_build_internal(package, path, dry_run, env)
File "/opt/conda/lib/python3.6/site-packages/quilt/tools/command.py", line 450, in _build_internal
build_from_path(package, path, dry_run=dry_run, env=env)
File "/opt/conda/lib/python3.6/site-packages/quilt/tools/command.py", line 522, in build_from_path
build_package(owner, pkg, path, dry_run=dry_run, env=env)
File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 319, in build_package
checks_contents=checks_contents, dry_run=dry_run, env=env)
File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 345, in build_package_from_contents
checks_contents=checks_contents, dry_run=dry_run, env=env)
File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 110, in _build_node
checks_contents=checks_contents, dry_run=dry_run, env=env, ancestor_args=group_args)
File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 110, in _build_node
checks_contents=checks_contents, dry_run=dry_run, env=env, ancestor_args=group_args)
File "/opt/conda/lib/python3.6/site-packages/quilt/tools/build.py", line 121, in _build_node
path = os.path.join(build_dir, rel_path)
File "/opt/conda/lib/python3.6/posixpath.py", line 92, in join
genericpath._check_arg_types('join', a, *p)
File "/opt/conda/lib/python3.6/genericpath.py", line 149, in _check_arg_types
(funcname, s.__class__.__name__)) from None
TypeError: join() argument must be str or bytes, not 'dict'
Code:
import quilt
quilt.build('aicsjackson/aics_11_test', '/home/jovyan/data/build.yml')
quilt.push('aicsjackson/aics_11_test', public=True)
Python: 3.6.3
OS: Dockerized Linux (base jupyter/minimal-notebook)
Note: Without README.md file it still terminates the package push.
Package Build.yml:
contents:
README:
file: README.md
data:
modified_AICS_11_0:
basic_info:
file: data/modified-AICS-11-0/basic_info.json
file:
file: data/modified-AICS-11-0/file.tif
modified_AICS_11_1:
basic_info:
file: data/modified-AICS-11-1/basic_info.json
file:
file: data/modified-AICS-11-1/file.tif
modified_AICS_11_10:
basic_info:
file: data/modified-AICS-11-10/basic_info.json
file:
file: data/modified-AICS-11-10/file.tif
modified_AICS_11_11:
basic_info:
file: data/modified-AICS-11-11/basic_info.json
file:
file: data/modified-AICS-11-11/file.tif
modified_AICS_11_12:
basic_info:
file: data/modified-AICS-11-12/basic_info.json
file:
file: data/modified-AICS-11-12/file.tif
modified_AICS_11_13:
basic_info:
file: data/modified-AICS-11-13/basic_info.json
file:
file: data/modified-AICS-11-13/file.tif
modified_AICS_11_14:
basic_info:
file: data/modified-AICS-11-14/basic_info.json
file:
file: data/modified-AICS-11-14/file.tif
modified_AICS_11_15:
basic_info:
file: data/modified-AICS-11-15/basic_info.json
file:
file: data/modified-AICS-11-15/file.tif
modified_AICS_11_16:
basic_info:
file: data/modified-AICS-11-16/basic_info.json
file:
file: data/modified-AICS-11-16/file.tif
modified_AICS_11_17:
basic_info:
file: data/modified-AICS-11-17/basic_info.json
file:
file: data/modified-AICS-11-17/file.tif
modified_AICS_11_18:
basic_info:
file: data/modified-AICS-11-18/basic_info.json
file:
file: data/modified-AICS-11-18/file.tif
modified_AICS_11_19:
basic_info:
file: data/modified-AICS-11-19/basic_info.json
file:
file: data/modified-AICS-11-19/file.tif
modified_AICS_11_2:
basic_info:
file: data/modified-AICS-11-2/basic_info.json
file:
file: data/modified-AICS-11-2/file.tif
modified_AICS_11_20:
basic_info:
file: data/modified-AICS-11-20/basic_info.json
file:
file: data/modified-AICS-11-20/file.tif
modified_AICS_11_21:
basic_info:
file: data/modified-AICS-11-21/basic_info.json
file:
file: data/modified-AICS-11-21/file.tif
modified_AICS_11_22:
basic_info:
file: data/modified-AICS-11-22/basic_info.json
file:
file: data/modified-AICS-11-22/file.tif
modified_AICS_11_23:
basic_info:
file: data/modified-AICS-11-23/basic_info.json
file:
file: data/modified-AICS-11-23/file.tif
modified_AICS_11_24:
basic_info:
file: data/modified-AICS-11-24/basic_info.json
file:
file: data/modified-AICS-11-24/file.tif
modified_AICS_11_25:
basic_info:
file: data/modified-AICS-11-25/basic_info.json
file:
file: data/modified-AICS-11-25/file.tif
modified_AICS_11_26:
basic_info:
file: data/modified-AICS-11-26/basic_info.json
file:
file: data/modified-AICS-11-26/file.tif
modified_AICS_11_27:
basic_info:
file: data/modified-AICS-11-27/basic_info.json
file:
file: data/modified-AICS-11-27/file.tif
modified_AICS_11_28:
basic_info:
file: data/modified-AICS-11-28/basic_info.json
file:
file: data/modified-AICS-11-28/file.tif
modified_AICS_11_29:
basic_info:
file: data/modified-AICS-11-29/basic_info.json
file:
file: data/modified-AICS-11-29/file.tif
modified_AICS_11_3:
basic_info:
file: data/modified-AICS-11-3/basic_info.json
file:
file: data/modified-AICS-11-3/file.tif
modified_AICS_11_4:
basic_info:
file: data/modified-AICS-11-4/basic_info.json
file:
file: data/modified-AICS-11-4/file.tif
modified_AICS_11_5:
basic_info:
file: data/modified-AICS-11-5/basic_info.json
file:
file: data/modified-AICS-11-5/file.tif
modified_AICS_11_6:
basic_info:
file: data/modified-AICS-11-6/basic_info.json
file:
file: data/modified-AICS-11-6/file.tif
modified_AICS_11_7:
basic_info:
file: data/modified-AICS-11-7/basic_info.json
file:
file: data/modified-AICS-11-7/file.tif
modified_AICS_11_8:
basic_info:
file: data/modified-AICS-11-8/basic_info.json
file:
file: data/modified-AICS-11-8/file.tif
modified_AICS_11_9:
basic_info:
file: data/modified-AICS-11-9/basic_info.json
file:
file: data/modified-AICS-11-9/file.tif
Any help and info would be appreciated.
Per multiple requests from HN, Twitter
Perhaps I missed it but is there/will there be functionality for pushing and installing from self-hosted private data registries?
As I am an odd use case and I make 20000 packages to build a single package in some cases, I know this is an odd request. However would be great to have some sort of filtering options put on the quilt.ls()
function.
# return only packages created under the namespace 'aics'
quilt.ls(owner='aics')
# return all packages that begin with the word 'cell'
quilt.ls(grep='cell*')
e.g. so users can diff across versions
No way currently to remove a set of packages.
Current process:
$ quilt ls
...
$ quilt rm usr/pkg
$ quilt rm usr/pkg
Would like the ability to remove where condition matches regex:
$ quilt rm ^usr/*
Or some other programmatic interface for removal. Not necessarily regex.
Currently, when I: print(pkg.group_node)
I am returned something like this:
<GroupNode>
node_0
node_1
node_2
node_3
node_4
node_5
node_6
node_7
node_8
node_9
Where every sub-node is printed.
However, for packages like mine, it isn't uncommon to have hundreds of nodes nested under each other and thus this printing structure isn't the best for viewing.
Two options in my mind:
When I print a group node, have it print a structure like that of the contents section of the webpage preview of a package.
print(pkg.group_node)
Would return:
<GroupNode>
node_0
subnode_0
subsubnode_0
subsubnode_1
subnode_1
subsubnode_0
subsubnode_1
...
node_1
subnode_0
subsubnode_0
subsubnode_1
subnode_1
subsubnode_0
subsubnode_1
...
...
Or have it print first n and last n subnodes of the node.
print(pkg.group_node)
Would return:
<GroupNode>
node_0
node_1
node_2
...
node_n-2
node_n-1
node_n
Maybe implement both and have a quilt.set_prints('show_depth')
/ quilt.set_prints('show_length')
to indicate which printing method the user wants.
Server Error is returned when trying to push a data repo in the local test environment
After login, navigating to localhost:3000 and clicking log in on the UI, it creates an infinite loop between http://localhost:3000/profile and http://localhost:3000/oauth
On the Dev Environment setup for Ubuntu 16.04.3 on step 4, there is a file referenced that does not exist on the most up-to-date branch
source quilt_server/flask_dev.sh
Has anyone else noticed these issues or have solutions available? Would like to create a private hub to store sensitive data via docker and s3 but have been unable to work around these issues
The _check_team_exists and get_registry_url functions use a hard-coded string template to build the registry url rather than loading the registry_url from the config, preventing teams logins when using a custom registry.
DataNode
should be extended to allow access to the original file name via a .name
property
It doesn't seem possible to get back the extension of the original file otherwise, and extensions matter for video files.
the metadata
in the json file seem to already have this information
Could be useful to be able to inspect the contents of a package at a particular point in time, say by passing the hash:
quilt inspect [USER]/[PKG] --hash [HASH]
More concretely:
I have a bunch of versions of the same package
hunan/data 2c2caeb3
hunan/data 4f7dec20
hunan/data 6aa0f6e2
hunan/data 7860bf0f
hunan/data a07a0847
hunan/data latest 0faf1a74
And would like to inspect the contents of say 4f7dec20
.
Thanks
For data packages used in machine learning, it would be useful for Quilt build to support splitting inputs into fixed sets for model training, and validation. For structured data, the various sets (training, test and validation) could be children of a common parent so that the entire dataset is available (by calling _data on the parent). Thanks to @rhiever for the suggestion.
Would allow people to build custom navigation functions that traverse the package tree for them.
Example:
def get_associate(node):
# return the parent of parent other associated node
return node._parent_node_._parent_node_.sub_node_1.data
node = dataset.initial.sub_node_0.data
associate = get_associate(node)
This is a simple example, but allowing this functionality means we can write very detailed custom navigation functions if we wanted to add more parameters and such to determine which associate to go to, etc.
I will try to dig through the GroupNode and DataNode creation code and add the attribute unless someone else gets it first.
Currently when I want to loop through a list of nodes contained in a parent node I have to do something like this:
for node_name in pkg.group_node._group_keys():
actual_node = getattr(pkg.group_node, node_name)
# do actual work with the node here
A nice touch would be to make GroupNode, and I guess PackageNode as well, work like iterators and when I want to loop through something it gives me the node instead of my having to do a roundabout getattr
.
for node in pkg.group_node:
# do work
Are you planning to add support for Java? I only need it to download the data. In the meantime, is that possible to just download the uploaded package from its url?
Hi,
When I pip install quilt in win7, quilt need pyarrow<0.8.0,>=0.4.0, but pyarrow seems only support linux/mac/win-amd, but my computer is intel-win64~
help!
thanks~
Si
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.