dotmesh-io / dotmesh Goto Github PK

View Code? Open in Web Editor NEW

538.0 538.0 29.0 39.08 MB

dotmesh (dm) is like git for your data volumes (databases, files etc) in Docker and Kubernetes

Home Page: https://dotmesh.com

License: Apache License 2.0

Go 95.20% Shell 4.19% Makefile 0.19% JavaScript 0.04% Dockerfile 0.35% Starlark 0.03%

dotmesh's People

Contributors

Stargazers

Watchers

dotmesh's Issues

dm push often panics

When instantiated with wrong (or not enough) arguments, dm push has a predisposition to panic. Fix this, by making it give sensible error messages.

dm remote add has weird spacing issue

API key:
         Remote added

dev stack documentation & potential unification

We want the full on frontend build container running in all circumstances (for dev trim) - the developer will always see the site via the frontend proxy

Ensure transfers (and any other long-term operations) are rename-safe

A volume could be renamed while transfers are in progress. This should be fine if, after setting up the transfer, everything is done in terms of volume UUIDs, but we're not sure if this is the case.

In general: Do all volume name resolution at the start of any operation, and use UUIDs thereafter, to avoid renames breaking in-progress things.

In particular: Do an audit of the code to ensure this is the case for already-written things.

provide a kubectl one-liner for installing datamesh

problem: annoyingly if you try and create an etcd cluster in the same breath as installing the etcd operator, it fails because the operator hasn’t started yet

possible solution: use init containers to wait for the etcd operator to start up before trying to create an etcd cluster

downside: this means we need to bundle kubectl or some code which uses the kube API

but we need to do that anyway, because we're writing a controller...

alternative: just document creating the etcd cluster manually, make it less magic and more explicit?

older docker versions will set mount flags badly

iow, moby/moby#19625 could affect datamesh users. check this and add a pre-flight check to dm cluster

datamesh can provide a bind-mount that isn't a zfs volume sometimes

luke@mashin-1:~$ docker run -ti -v stress-test:/foo --volume-driver dm ubuntu sh -c 'echo HELLO > /foo/WORLD'
luke@mashin-1:~$ logout
Connection to mashin-1 closed.
luke@cube:~$ while true;   do for X in 1 2 3 4; do     echo $X; ssh mashin-$X docker run -v stress-test:/foo       --volume-driver dm ubuntu cat /foo/WORLD;   done; done
1
HELLO
2
cat: /foo/WORLD: No such file or directory
3
cat: /foo/WORLD: No such file or directory
4
cat: /foo/WORLD: No such file or directory
1
HELLO
2
cat: /foo/WORLD: No such file or directory
3
cat: /foo/WORLD: No such file or directory
4
cat: /foo/WORLD: No such file or directory
1

it happens that stress-test pre-existed as a volume on the node, and so was somehow getting mounted as an ext4 volume (ie a regular bind-mount from the host). not sure how this happened, it shouldn't. but it was after manually blowing away /var/lib/docker/containers in attempt to workaround #23

leaving this open.

branches and size info oscillate

i think this happens when you have more than one branch

dm push doesn't respect branch value

When you give dm push a branch name as the third argument, it disregards it and uses whatever the currently checked out branch is.

Allow deletion of volumes

See design document at https://docs.google.com/document/d/1ocHxqVM50k4aciA7ejiJIEN_bQ3zgC02zjjue29XKG4/edit#

fix dm cluster reset on nixos

On NixOS, dm cluster reset doesn't clear out ~/.datamesh for some reason.

[good first issue] Common go library for API tools

We should make a library that can be used from everywhere that needs it; it should be a proper Go SDK with the following properties:

Nicely object-oriented design, calling API methods on a "Dotmesh Cluster" object
Encapsulates the transfer logic, where a transfer request is set up and then its status polled (currently in cmd/dm/pkg); set up a go channel to push in a cancellation, and one which is used to push progress updates back.
Types used in the API (such as DotmeshVolume) should be defined in the client library and not duplicated elsewhere in any of our repos (in particular, the RPC server in cmd/datamesh-server/pkg/main/rpc.go should reference the definition from the go library as well).

Existing duplicated code needs to be refactored out into the library, so that we use the go library ourselves. Any code that contains the string DotmeshRPC is quite likely an API user (apart from cmd/datamesh-server/pkg/main/rpc.go...)

startup time is slow when there are many snapshots

1599 snapshots in one filesystem yields a ~2 minute delay between /ux showing empty list of volumes and it showing the complete list. dm list starts responding sooner, oddly, but perhaps on a different node.

rename CloneVolume to CloneFilesystem in RPC API

because the glossary says so https://docs.google.com/document/d/1OATfqls_EJx8DVmm8ZU00L9G5be4DWYSF6KvlOfulGk/edit#

[1d timebox for research] implement docker plugin v2

https://github.com/docker/docker/blob/1.13.x/docs/extend/config.md

because then apparently plugins get started before application containers

stopping the untoward outcome of restarting docker when you have any datamesh containers breaking docker completely

dm cluster reset breaks extant docker volumes

Uninstalling dotmesh with dm cluster reset when there are docker volumes which exist which reference dotmesh volumes leaves Docker in a state where it constantly hangs for long periods of time, looking for the dm plugin with an exponential backoff.

Make dm cluster reset enumerate docker volumes and warn about removing the references before uninstalling dotmesh. It should refuse to do it unless run with -f.

I typed dm volume list

And expected the same results as dm list - perhaps we should alias dm dot X -> dm X

commit metadata missing

when following the demo on the website, the commit metadata for the first commit is missing from datamesh cloud. Figure out why and fix it please 😄

helm chart

we should have one

example: https://github.com/kubernetes/charts/tree/master/stable/gitlab-ce/templates

maybe get rid of 'dm clone'?

dm clone might get confusing, because it means "pull to a new volume" and not "create a new branch/clone". for folks familiar with the concept of a filesystem clone, it will do the wrong thing.

maybe we could overload dm pull so that it works for both the pull (new commits in an existing volume) and clone (new volume) cases?

If I type an unknown remote command - it should error

I typed dm remote remove (which does not exists) and it gave me dm remote list - it should instead error no such command - please run help

push seems to error a few times and then succeed

tested pushing from macOS to cloud.datamesh.io:

2017/09/11 07:32:54 [updatePollResult] => /datamesh.io/filesystems/transfers/549cc00c-57bf-4ffc-4238-1c289e851eee, serialized: {"TransferRequestId":"549cc00c-57bf-4ffc-4238-1c289e851eee","Peer":"cloud.datamesh.io","User":"lukemarsden","ApiKey":"[redacted]","Direction":"push","LocalFilesystemName":"mydata","LocalCloneName":"","RemoteFilesystemName":"mydata","RemoteCloneName":"","FilesystemId":"b21d5469-0b91-4416-53fe-b0310263b94e","InitiatorNodeId":"3f1d7183af011a60","PeerNodeId":"","StartingSnapshot":"START","TargetSnapshot":"324aa0ad-048a-4470-41ab-ac6e97ffccdf","Index":1,"Total":1,"Status":"finished","NanosecondsElapsed":38859296692,"Size":221573656,"Sent":222119627,"Message":"Attempting to push b21d5469-0b91-4416-53fe-b0310263b94e got \u003cEvent error-pushing-posting: responseBody: Host is master for this filesystem (b21d5469-0b91-4416-53fe-b0310263b94e), can't write to it. State is backoff.\n, statusCode: 404, responseHeaders: map[Date:[Mon, 11 Sep 2017 07:32:00 GMT] Content-Length:[112] Content-Type:[text/plain; charset=utf-8]], requestURL: http://cloud.datamesh.io:6969/filesystems/b21d5469-0b91-4416-53fe-b0310263b94e/START/324aa0ad-048a-4470-41ab-ac6e97ffccdf\u003e"}

it looked like it worked, but there was maybe a 30 second delay after it finished pushing:

luke@starry:~$ dm push cloud
Calculating...
finished 211.31 MB / 211.31 MB [======================] 100.00% 5.45 MiB/s (1/1)
Done!

set up [email protected]

as google group

ensure containers can't be started on volumes that are being handed off or rolled back

~~rollback doesn't stop and start running containers: it thereby corrupts running databases :(~~

I think rollback is more reliable now.

HOWEVER

if a container tries to start on a filesystem while it's being rolled back, we should make it wait until the rollback is complete.

neither does moving a container between hosts

is this true?

developer docs for how push/pull works

it's a bit like this

`dm cluster`: flags should get written to a config file

so that upgrades are seamless and you don't have to remember to pass e.g. --checkpoint-url and --pool-name etc or any other flags

and/or the config file should just be updated to support everything currently doable with flags.

extend smoke tests to try pushing

to catch bugs in macos implementation that would otherwise remain hidden

push to the dothub (and then cleanup) instead!

also make the smoke tests run every hour.

invalid argument

Calculating...
Maximum retry attempts exceeded: &{error-from-send-nP %!s(*main.EventArgs=&map[err:0xc420533d80 output:internal error: Invalid argument

suspect caused by too-new zfs kernel modules on macos, mismatching with zfs binary bundled in docker image

re-using volume names seems to break things

to reproduce, on macOS at least:

dm cluster init
docker run -v name:...
sudo dm cluster reset
dm cluster init
docker run -v name:...

the second invocation seems to fail to refer somehow, e.g. files written to the volume don't then show up as dirtying the filesystem

turn on on-disk compression & enable compressed send

depends: openzfs/openzfs@9896f2b

EPIC: Kubernetes support

Install Datamesh on Kubernetes with a kubectl one-liner
Have Datamesh provide FlexVolume and Dynamic Provisioning
Figure out how to expose all Datamesh features through Kube YAML
StorageClass or another API object which provides automatic snapshots & replication

Design doc: here.

dm log should pipe into a pager

in interactive mode, at least. like git log.

attempting to move an unsnapshotted repo causes an infinite loop

it previously caused a "can't find snapshots" error...
not sure what's going on here

fix race condition in plugin startup

https://github.com/lukemarsden/datamesh-server/commit/9f021997698a5ec612742a791c742262ef30387b introduces a race condition if there are no datamesh-using app containers running. fix it by waiting for the dummy plugin to come up before proceeding to run docker stuff.

move all the scripts into the scripts folder

things like prime.sh still hanging around

need to implement VolumeDriver.Get

luke@hackintosh:~/gocode/src/github.com/lukemarsden/datamesh$ docker start db
Error response from daemon: get mydata: VolumeDriver.Get: 404 page not found
Error: failed to start containers: db

also need to implement VolumeDriver.Capabilities

don't automatically switch to a remote when adding it

it's not consistent with git behaviour, nor intuitive.

possible exception: if there are no remotes yet.

resetting a cluster doesn't destroy zfs data somehow

docker run ... -v foo:/data busybox touch /data/foo
dm cluster reset
docker run ... -v foo:/data busybox ls -alh /data/foo

on macOS, at least, this shows a file where there should be none! dm cluster reset should totally wipe out the datamesh state.

maybe docker plugin doesn't work at all with docker in swarm mode?

not sure, docker run worked but the dm volume never got mounted in the container, ext4 was in its place

volume namespacing

It would be nice to be able to organise volumes into namespaces, so that different users can have volumes with the same names. This is particularly pertinent for multi-tenant public hosting setups!

See https://docs.google.com/document/d/1qtE096-8xLH5Ml6NLAkHRJhO7TGjLx7PfsblYdIRa24/edit# for a design document.

dogfood datamesh

use datamesh to capture logs in elasticsearch and zipkin traces in mysql and prometheus metrics from acceptance test/soak test/stress test runs

and maybe snapshot/backup etcd. also the etcd behind discovery

requests to /ui/foo/bar should return the main HTML

Currently /ui/foo/bar is a 404, but the pushState URLs in the app resolve to these; this breaks reloading & linking. Fix it by making anything underneath /ui return the homepage.

dm switch enforces volume naming rules

but docker run doesn't

so:

$ sudo docker run -ti --volume-driver=dm -v ☃:/bar ubuntu bash
[...]
$ dm list
  VOLUME             SERVER             BRANCH      CONTAINERS
  ☃                  828634e156efc02c   newbranch
  foo                828634e156efc02c   newbranch
* monday             828634e156efc02c   newbranch
  monday@newbranch   828634e156efc02c   newbranch
$ dm switch ☃
Error: ☃ is not a valid name

We could relax the restrictions on names in switch and other places that use it.

Avoid the starting race condition when "dm" does a transfer

When a transfer is initiated, the Transfer RPC method injects a transfer request into etcd and returns. PollTransfer then proceeds to call GetTransfer in order to report stuff to the user, but it often gets in before the transfer has initiated, causing this to happen:

Calculating...
Got error, trying again: Response '{"jsonrpc":"2.0","error":{"code":-32000,"message":"No such intercluster transfer 9586f07d-7a5a-4733-41ec-59f16725ca06","data":null},"id":8674665223082153551}
' yields error No such intercluster transfer 9586f07d-7a5a-4733-41ec-59f16725ca06
finished 9.50 KB / 9.50 KB  100.00% 3.15 MiB/s (1/1)
Done!

It would be nice if the error return from GetTransfer had an error type code, so we can distinguish "transfer does not exist" errors from others without needing to parse the string. Then we could make the PollTransfer loop work like this:

Print "Calculating..."
Loop until GetTransfer returns something other than "transfer doesn't exist", with a timeout.
If the result of the final GetTransfer was success, then proceed with the loop as usual (but move the GetTransfer-then-sleep-for-1-second to the bottom of the loop, as we already have an initial GetTransfer result when entering the loop)
If it was an error or a timeout, complain and die.

stale symlinks cause barf

luke@glow:~$ docker run -d -v mydata:/var/lib/mysql \
    --volume-driver=dm --name=db -e MYSQL_ROOT_PASSWORD=secret mysql

gives:
docker: Error response from daemon: symlink /var/lib/docker/datamesh/mnt/dmfs/d446e530-c475-499a-74fe-149b02c730af /var/datamesh/mydata: file exists. See 'docker run --help'.

retire dm cluster join

(old title: mint new credentials )

In dm cluster join, currently the certificates that were first generated on the server where dm cluster init was run are used.

Instead, we should mint new certificates using the extant CA for two reasons:

Currently, any etcd nodes trying to proxy requests on to the cluster currently have a SPOF – the certificate all nodes are using only has the IP address of the first node.
Security best practice says that different machines should have different certs. And not all machines need the root CA key.

Ideally, these certs should be minted in the discovery service to avoid handing out keys to the kingdom, but a good first step here would be to move to minting them join-side to fix the proxy-SPOF issue.

movement of data within clusters has no locking

node001$ docker run --name x -v foo:/...
node002$ docker run --name x -v foo:/... should fail (because container is running on node001, and has acquired a lock)

also:
currently running the container in two places at once and then later trying to move it again causes cannot receive incremental stream: destination pool/dmfs/26c9a9ee-6486-4316-420f-2a51c6f8ae9e has been modified in the logs

must be something off with the handoff logic, it should bail if it can't unmount the filesystem...

There is logic to detect (somewhat asynchronously) what Docker containers are using what volumes, but that's not the only way volumes are used - Procure and MountCommit are available in the API but there's no API call to say you're finished with it. Ideally, Procure and MountCommit (even when called via the Docker volume plugin) should increment a reference count or other similar tracking mechanism, and an explicit ReleaseMount API call be invoked to release it.

The "lock" held by these calls should also inhibit incoming transfers; letting them happen is a race to see whether the transfer drops the new snapshots or the running workload touches a file to make it dirty first.

(atomically?) commit multiple dots at once

As a user of a distributed database using dotmesh for backend storage, I'd like to be able to take consistent snapshots of all the dots used by my database nodes.

"Atomicity" is undefined in a distributed environment; we probably just need to be "as quick as possible" at doing all the snapshots at once, but maybe some atomicity could be arranged by synchronously talking to the database cluster itself, requesting that it prepare a stable global state, waiting for it to confirm it's done so, snapshotting, then telling the DB it no longer needs to maintain that stable global state.

rpc and streaming endpoints aren't encrypted

meaning that data sent over them is vulnerable to being sniffed

add -H flags everywhere in dm

to facilitate scripting

this is in dm list now.

dotmesh-io / dotmesh Goto Github PK

dotmesh's People

Contributors

Stargazers

Watchers

Forkers

dotmesh's Issues

Recommend Projects

Recommend Topics

Recommend Org