qri-io / qri Goto Github PK

View Code? Open in Web Editor NEW

1.1K 27.0 66.0 44.13 MB

you're invited to a data party!

Home Page: https://qri.io

License: GNU General Public License v3.0

Go 99.79% Makefile 0.07% Dockerfile 0.06% Shell 0.07%

golang service data-science ipfs p2p web3 opendata qri trust dataset

qri's Introduction

qri

Qri CLI

a dataset version control system built on the distributed web

Website | Packages | Contribute | Issues | Docs | Download

Welcome

Question	Answer
"I want to learn about Qri"	Read the official documentation
"I want to download Qri"	Download Qri or `brew install qri-io/qri/qri`
"I have a question"	Create an issue and use the label 'question'
"I found a bug"	Create an issue and use the label 'bug'
"I want to help build the Qri backend"	Read the Contributing guides
"I want to build Qri from source"	Build Qri from source

qri is a global dataset version control system built on the distributed web

Breaking that down:

global so that if anyone, anywhere has published work with the same or similar datasets, you can discover it.
Specific to datasets because data deserves purpose-built tools
version control to keep data in sync, attributing all changes to authors
On the distributed web to make all of the data published on qri simultaneously available, letting peers work on data together.

If you’re unfamiliar with version control, particularly the distributed kind, well you're probably viewing this document on github — which is a version control system intended for code. Its underlying technology – git – popularized some magic sauce that has inspired a generation of programmers and popularized concepts at the heart of the distributed web. Qri is applying that family of concepts to four common data problems:

Discovery Can I find data I’m looking for?
Trust Can I trust what I’ve found?
Friction Can I make this work with my other stuff?
Sync How do I handle changes in data?

Because qri is global and content-addressed, adding data to qri also checks the entire network to see if someone has added it before. Since qri is focused solely on datasets, it can provide meaningful search results. Every change on qri is associated with a peer, creating an audit-able trail you can use to quickly see what has changed and who has changed it. All datasets on qri are automatically described at the time of ingest using a flexible schema that makes data naturally inter-operate. Qri comes with tools to turn all datasets on the network into a JSON API with a single command. Finally, all changes in qri are tracked & synced.

Building From Source

To build qri you'll need the go programming language on your machine.

$ git clone https://github.com/qri-io/qri
$ cd qri
$ make install

If this is your first time building, this command will have a lot of output. That's good! Its means it's working :) It'll take a minute or two to build.

After this is done, there will be a new binary qri in your ~/go/bin directory if using go modules, and $GOPATH/bin directory otherwise. You should be able to run:

$ qri help

and see help output.

Building on Windows

To start, make sure that you have enabled Developer Mode. A library that we depend on needs it enabled in order to properly handle symlinks. If not done, you'll likely get the error message "A required privilege is not held by the client".

You should not need to Run As Administrator to build or run qri. We do not recommend using administrator to run qri.

Shell

For your shell, we recommend using msys2. Other shells, such as cmd, Powershell, or cygwin may also be usable, but msys2 makes it easy to install our required dependencies. IPFS also recommends msys2, and qri is built on top of IPFS.

Dependencies

Building depends upon having git and make installed. If using msys2, you can easily install these by using the package manager "pacman". In a shell, type:

pacman -S git make

Assuming you've also installed go using the official Windows installer linked above, you will also need to add go to your PATH by modifying your environment variable. See the next section on "Environment variables" for more information.

Due to how msys2 treats the PATH variable, you also need to add a new environment variable MSYS2_PATH_TYPE, with the value inherit, using the same procedure.

Once these steps are complete, proceed to building.

Building on Rasberry PI

On a Raspberry PI, you'll need to increase your swap file size in order to build. Normal desktop and server linux OSes should be fine to proceed to building.

One symptom of having not enough swap space is the go install command producing an error message ending with:

link: signal: killed

To increase your swapfile size, first turn off the swapfile:

sudo dphys-swapfile swapoff

Then edit /etc/dphys-swapfile as root and set CONF_SWAPSIZE to 1024.

Finally turn on the swapfile again:

sudo dphys-swapfile swapon

Otherwise linux machines with reduced memory will have other ways to increase their swap file sizes. Check documentation for your particular machine.

Packages

Qri is comprised of many specialized packages. Below you will find a summary of each package.

Package	Go Docs	Go Report Card	Description
`api`			user accessible layer, primarily made for communication with our frontend webapp
`cmd`			our command line interface
`config`			user configuration details, includes peer's profile
`lib`			takes arguments from the cmd and api layer and forms proper requests to call to the action layer
`p2p`			the peer to peer communication layer of qri
`repo`			the repository: saving, removing, and storing datasets, profiles, and the config
`dataset`			the blueprint for a dataset, the atoms that make up qri
`registry`			the blueprint for a registry: the service that allows profiles to be unique and datasets to be searchable
`starlib`			the starlark standard library available for qri transform scripts
`qfs`			"qri file sytem" is Qri's file system abstraction for getting & storing data from different sources
`ioes`			package to handle in, out, and error streams: gives us better control of where we send output and errors
`jsonschema`			used to describe the structure of a dataset, so we can validate datasets and determine dataset interop

Outside Libraries

The following packages are not under Qri, but are important dependencies, so we display their latest versions for convenience.

Package	Version
`ipfs`

This documentation has been adapted from the Cycle.js documentation.

qri's People

Contributors

Stargazers

Watchers

Forkers

dwojdak machawk1 nim23 mafm ebenp hexagon6 dustmop jonnycrunch renesugar leucotic crazcalm aborruso lanzafame il-dar songjiayang rahulsoibam tarsbase spencerx saromanov hhy5277 imanner szhorizon neuroradiology mbrukman dav009 andrew uniblockchain cuulee dolanor-galaxy awesome-p2p willemneal boomsquared helpful-bus saalejo sgammon hackerpilot advanderveer sptkl mr0grog mfdz varsmolta awesomegolang zhutony orblivion gsusin othello1111 jcardoz eltociear suzaku coding-to-music shotechs finhorsley daotlresearch baajarmeh isgasho salespaulo jonhoyc dityara11 shism2 abirdcfly mananal sumonst21 wahello iq-scm erhyfspubf

qri's Issues

naming tables should either throw an error when special characters are used or allow the use of quotes/an unescaping mechanism

for example
this works

qri init \
-f 8d582c89-2a55-4647-922e-a696fe89d908.csv \
-m 8d582c89-2a55-4647-922e-a696fe89d908.json \
-n 8d582c89-2a55-4647-922e-a696fe89d908

but then these fail

qri run "select * from 8d582c89-2a55-4647-922e-a696fe89d908"

qri run "select * from '8d582c89-2a55-4647-922e-a696fe89d908'"

Restore regular SQL syntax

prior version of qri used a strange variation on sql syntax for namespace purposes, need to restore normal sql syntax to the dataset_sql package

'ORDER BY' functionality

Qri ingest pipeline

This is a placeholder issue for thinking about building a robust data ingest pipeline for qri. Things we're interested in:

metadata type detection (eg. "That's a project-open-data file, or "that's a dcat file")
if we do get a recognized metadata type, validate that schema, warn user if invalid
attempt to uncover data download link and auto-acquire data from there
Harvard FITS-based content detection
checking for presence of the data on the network
solid generic fallbacks
we want to be able to retry things
we want a solid set of "Checkpoints" for UX, where users can interrupt the ingest process where sensible
configurable settings for batching, solid defaults for batching, output to logs for batching.
resumable batch import based on logs from previous imports
configurable network-based metadata cross-referencing
metadata title / other field based search / matching
transaction support, should be able to cancel a partially-completed process & have it revert to it's prior state

Resources

Think about library "acession" pipeline, this is well covered territory in the libraries space.

Outstanding questions

where does this run?
how does this integrate with the baseline qri init CLI function, if at all?
do we host this as a central service, but publish the results to the d web?
ML based metadata inference?

Refactored Namespace/Dataset reference functionality

With the paper refactor comes the removal of any globally-accepted notion of "repositories", and the namespace convention that came with it. While this may be reintroduced in the future, for now we need to provide users a plausible way to identify & work with data.

As a first proposal we'll into a concept of "datasets" to the CLI, which are to be thought of as the users's personal collection of datasets. Users can name these datasets whatever they please (so long as names don't overlap), and can add and remove a dataset as needed.

The cli should have a few commands to support:

qri search [query] -> Should search the network for datasets based on keywords or phrases, should display human-readable dataset info. For now this should just be based on a local registry, but display linked metadata.
qri dataset add [name] [resource hash] -> add a dataset to the user's current namespace
qri dataset remove [name] -> remove a dataset from the users current namespce

Investigate feasibility of running qri as an electron app

Need to check if electron gives us a method to call into go code.

[cli] test suite

Our CLI needs a test suite if we're going to avoid shipping regressions. We should set one up that operates out of os.TempDir()

Native integration with local IPFS node

qri should default to interacting with / creating a local IPFS node that it uses to resolve hashes over the network, and to add & pin content to the local node

Export & Download

We should be able to "export" from the CLI in raw data and package formats, we'll then get this to work on the frontend as well

map list of fields from data.gov to populate metadata

determine required fields (check with @b5 )
find available mappings to required fields
create metadata for sample
create metadata for all initial gov datasets

CRUD Dataset Metadata

Need Local capacity to edit the metadata of an existing dataset.

new command: qri ds update [name/hash] -m [metadata file] -f [dataset file]
expose this same functionality as an api command PUT /datasets/ipfs/:hash/metadata & PUT /datasets/:hash/data

Turn this repository into main repo for running a qri node

Fold New server code into this repository
Set up webapp communication
Get the whole thing to run locally with a networked ipfs node

add default no-save option for query execution

Currently all queries are pinned to the IPFS repo, we should make the default not save, and instead provide a --save flag in the CLI.

finish #29
adjust castore interface to accept a pin bool argument in the Put method
add a Pin method to castore interface
modify dataset_sql.Exec to not pin by default
a methods that wrap dataset_sql.Exec should listen to the "save" arg in the QueryExecOptions of a dataset and pin the dataset result hash

Need clear error message when IPFS daemon is already running

right now it errors with "IPFS lockfile blah blah blah", would be much nicer if it just said "hey are you already running an IPFS daemon?"

sql aggregate function support

From dataset_sql source:

// Aggregates is a map of all aggregate functions.
var Aggregates = map[string]bool{
	"avg":          true,
	"bit_and":      true,
	"bit_or":       true,
	"bit_xor":      true,
	"count":        true,
	"group_concat": true,
	"max":          true,
	"min":          true,
	"std":          true,
	"stddev_pop":   true,
	"stddev_samp":  true,
	"stddev":       true,
	"sum":          true,
	"var_pop":      true,
	"var_samp":     true,
	"variance":     true,
}

It'd be great if we could land support for these, as many are commonly-used functions that would make for solid demos

Query History Log

Once #30 lands, we should think about a qri queries command that shows a historical log of queries run. Let's construct the answer to #30 with this in mind.

[cli] can't qri init from outside of current working directory

add tsv support

Basic working skeleton function

To start with, let's just get an "optionless" skeleton function that works when you run qri run that will only do one thing, but prove the model & elucidate the path forward. It should:

Construct an IPFS node with networking deactivated, connected to an fsrepo as a backing store.
Add structured data to the IPFS repo, retrieving a hash
Build a resource using the resulting hash that properly describes the structured data
Construct a Query on that data
Add the Query to the repo, returning the hash
Execute the query
Add the resulting resource & structured data to the repo
Link the query hash to the resulting resource hash in a query lookup table
output the resulting data to the console.

Refactor Qri repository interface & format

repo package has fallen out of step with current thinking, need to have a big revisit on structure & interface.

Prepare an initial set of Default Datasets

qri should ship with default datasets in it's library for demonstration purposes

Initial qri data.gov ingest test

Initial Task list. This should be broken out into issues:

Start with a download of data.gov linked data
Filter for only direct references to csv files
Spot check metadata for those references
de-duplicate?
work with @b5 to understand what qri init does
figure out how to add metadata on qri-init: --meta flag
map list of fields from data.gov entries to qri metadata entries
determine size of ingest (for sanity's sake), possibly filtering out massive datasets
~~[ ] Reduce that set to only epa.gov (for now)~~
build a script that downloads and runs qri init against all identified resources

modifications to the dataset definition

add fields to dataset:

identifier
language (using iso 2 letter language codes--check if that is currently in use)

modifications to dataset:

change license to string
add theme object
add accessUrl, downloadUrl, remove generic 'url'

modifications to json at processsing time:

change keyword to list of strings using the name property of each object
format (check values, then to_lower)
remove any semicolons, whitespace

qri init interprets this incorrect flag as a valid name rather than throwing an error

forgot the second dash on name and ran
qri init -f survey_cleaned.csv -m meta_survey_cleaned.json -name researchTools
which named the dataset 'ame' from 'name' rather than telling me '-name' was an invalid flag

Dataset queries command

We need a way to show the queries that have been run on a given dataset. qri queries [dataset alias or hash] should list queries that have been asked of this dataset, showing the row & col count of their results.

queries shouldn't be abstracted on the way in

queryString is currently being stored in it's abstract form, need to improve on that.

Restore webapp

Need to bring webapp back up to speed post-refactor.

Search should also return results from local datasets as well as from peers

Basic p2p & local Dataset Search

We need a basic search feature for qri, this means first building in the infrastructure to do search�. Later on we'll actually work out sending the search terms themselves across the network or something, but for now keeping a deduplicated list of dataset references seems like a good idea. Dataset histories are going to mess with that a bunch, but we'll cross that bridge later.

p2p dataset list exchange
local dataset caching
regex-based dataset search
CLI-based results display

dataset.Dataset.Save

Things would be greatly simplified if we had a single save function for a dataset in the dataset package that accepted a castore as it's only argument. Let's do that.

refactor castore interface to actually reflect a content-addressed store
create an in-memory castore for testing purposes
refactor castore/ipfs to conform to the new interface
change dataset package to pointer references
add dataset.Dataset.Save method to dataset package, have it properly pull apart the dataset components into references

update to match new dataset, dataset_sql repos

Get codebase running against recently-merged dataset, dataset_sql repos

Need instructions for installing & building

Right now it's pretty difficult to download & build qri, we should:

map the steps to make construct a build
simplify that list if at all possible with things like shell scripts
add installation instructions to the readme

Revised Query Result hashes

Hash comparison of query results was lost in the refactor, need to get 'em back so we can dedupe queries.

Initial Resource Definition, Metadata, and Query registry

We need distributed lookup tables for datasets, metadata, and queries. This concept is currently under-developed, and needs to exist ASAP, so let's start by building a "local only" registry. This'll help think through the needs of the feature, while providing a way to demonstrate the CLI for now.

Once in place the query engine should check hash of query against the registry and avoid extra execution if a result is found, which will be a big win in-and-of itself.

IPFS filestore pin / unpin foreign hashes

Need to implement the Pinner interface on castore/ipfs

Structure datasets as pathable trees

currently castore just writes & pins all components of a datastore tree to the top level /ipfs/, that's silly. We should save datasets as everything in the tree except the data itself, which should be a plain 'ol IPFS path. This'll depend on landing qri-io/cafs#1 & qri-io/cafs#2

qri init should allow user-provided column descriptions

add support for sqlite3 db import

Document qri dataset definitions in relation to other

This page contains a nice example comparison / mapping between POD spec & others. We should do the same for qri's metadata spec.

https://project-open-data.cio.gov/v1.1/metadata-resources/

Refactored qri init

qri init used to be the way to run schema detection & validation on a dataset, and add datasets to their local namespace. We need to refactor this to work with the new "white paper refactored" code. qri init should still run validation & schema detection, but this time successful dataset initialization should add the dataset, resource def & metadata to the local IPFS node & broadcast it's existence onto some sort of distributed dataset registry.

Placeholder SQL fmt

Need a placeholder SQL fmt command that "formats" a parsed SQL AST.

meta issue to write an issue about frame.py

https://github.com/pandas-dev/pandas/blob/64c8a8d6fecacb796da8265ace870a4fcab98092/pandas/core/frame.py

make an issue outlining a side project for kasey/an engineer learning the qri codebase to get an understanding of the Pandas DataFrame implementation (represents 2D tabular data/functionally similar to SQL but with 'pythonic' and numpy-inspired syntax/data structures/conventions) and assess the interoperability of low level functions and data structures of the qri engine/ the difficulty or feasibility of making a python-to-golang wrapper or adapter.

Qri init --meta, or qri init --file

Need to be able to initialize a qri dataset with metadata, will be necessary to complete #13

[cli] pipe data directly into qri commands

it'd be nice to pipe data directly into the qri init function, and get this going as a general pattern these should work:

qri init < data.csv
qri run < query.sql
...
And so on

Happens when trying to search:
http://localhost:3000/search?p=movies&page=1&pageSize=100

This turned out to be qri had allowed something to be added to the dataset that wasn't a dataset.
Need better validation before allowing something to be added as a dataset

check for collisions

no-network flag

need a way to disable all networking for local testing purposes

[cli] ipfs init fails without error message when no arg is provided

(output)

osterbit Desktop $ qri init
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/qri-io/qri/cmd.glob..func8(0x250ab20, 0x255aec0, 0x0, 0x0)
	/Users/b5/go/src/github.com/qri-io/qri/cmd/init.go:45 +0xcb5
github.com/spf13/cobra.(*Command).execute(0x250ab20, 0x255aec0, 0x0, 0x0, 0x250ab20, 0x255aec0)
	/Users/b5/go/src/github.com/spf13/cobra/command.go:651 +0x23a
github.com/spf13/cobra.(*Command).ExecuteC(0x250b3a0, 0x1, 0x0, 0x0)
	/Users/b5/go/src/github.com/spf13/cobra/command.go:726 +0x339
github.com/spf13/cobra.(*Command).Execute(0x250b3a0, 0x0, 0x11)
	/Users/b5/go/src/github.com/spf13/cobra/command.go:685 +0x2b
github.com/qri-io/qri/cmd.Execute()
	/Users/b5/go/src/github.com/qri-io/qri/cmd/root.go:47 +0x2d
main.main()
	/Users/b5/go/src/github.com/qri-io/qri/main.go:20 +0x20