oneconcern / datamon Goto Github PK

View Code? Open in Web Editor NEW

15.0 23.0 6.0 10.09 MB

Datamon manages infinite reflections of data

License: MIT License

Go 91.13% Shell 5.82% Makefile 1.48% Dockerfile 0.91% HTML 0.16% Mustache 0.49%

data-management-platform

datamon's Introduction

Artificial Intelligence platform for Disasters

Branch Status

master	staging

Local Build Setup

Clone this repository

git clone [email protected]:acidjazz/oneconcern.git

Install dependencies

yarn install

Generate routes and lever job listings

yarn cash

Serve your dev environment with HMR at http://localhost:3000

yarn dev -o

Deployment

Continuous Deployment is setup using the git-flow workflow with aeonian via Circle-CI

*** Edited 06/01/2018 by @zivagolee

datamon's People

Contributors

Stargazers

Watchers

Forkers

kerneltime chandresh-pancholi divya1c fredbi ransomw ndamclean

datamon's Issues

Circle CI integration

Problem: Pull Requests and Branch health is evaluated automatically
Success:

All PRs for merging are evaluated via automated testing in line with other repos
WIP PRs are not validated.

Git Commit Authenticity

Problem: We need to have an independent (from GitHub) mechanism to authenticate commits to the repo.
Success:

1. All existing committers have moved to the new model.
2. New model is SAS (validation can be done locally without a SAS company) independent for validation
3. Process to on board new developers is documented.
4. Validations are enforced on Pull Requests.
5. PR is the only way to get push data to all branches (forks are mandatory)

Config directories

For global configuration of datamon we will need to provide a place to store the configuration in the home directory.

Config directories

Use OS specific config directories.

OSX

This means asking for the Application Support directory:

Linux

This means using the contents of $XDG_CONFIG_DIRS:

Windows

%APPDATA%

C:\Users\username\AppData\Roaming

file_op_test.go needs to be refactored

Focused tests that do not duplicate testing
Testing to the spec, in specific relative paths
golangci-lint should pass cleanly

Docs for datamon

We need to articulate the current design and the way forward in the docs folder.

--logs flag enable in datamon

Understand --logs flag and show all the model logs in a user CLI

list all the models

create a command datamon model ls or datamon model list to list all the models.

Automated system tests

CI/CD needs a way to run automated system tests.
Test need to be written for existing functionalities.

Code refactor: Extensible storage

Currently the definition of a bundle is limited to a fuse mounted FS backed by CAFS stored in an object store. In the future a bundle can be a volume snapshot, shared fs snapshot or a DB snapshot.
The CSI layer can be a more generic layer as well allowing for mounting and unmounting varying backends

Remove code that is not in play for MVP

The direction of Datamon is slightly different than the one outlined in the current code.
Also, following the model of master is always shippable and code is "done done", a lot of the code currently in play needs to be archived and removed from the master branch.

Bulk import into datamon

Bulk import use case should

Authenticate user
Use datamon creds to upload (daemon in k8s?)
Efficiently upload data.
Ability to describe the source path to target repo

Zip files from glob of content

Issue:
The input file has a content array of globs. we need to zip all the files and directory.

datamon deploy function --from-file <example.yaml>

Steps

Datamon reads content from example.yaml & zip all the files from content attribute in example.yaml
example of glob of content from example.yaml

content:
  - vendor/*
  - scripts/*
  - bin/*
  - requirements.txt
  - app.py

Success Criteria

Successfully able to zip all the given globs
Unit tests
Corner cases like 0 files check
Proper logging statements, Successfully uploaded & failed uploaded files prometheus custom metrics

Custom runtime to execute functions

The developers can get the help of the operation team to create custom runtime to run their functions.
MVP is going to use default Kubeless runtimes

Supported Runtimes are:

python2.7, python3.4, python3.6, nodejs6, nodejs8, ruby2.4, php7.2, go1.10, dotnetcore2.0, java1.8, ballerina0.981.0

Move blobs to it's own bucket

Adding a prefix to cafs blobs can have performance issue when large numbers of blobs are uploaded. Since the cafs blobs can be generated in parallel and at scale it makes sense to move blobs to their own bucket.

Ignore file support

A data repository should support a .datamonignore file

This seems like it would work: https://github.com/codeskyblue/dockerignore

Add licensing details

Need to

Choose license
Complete formalities if any
Add details to gitHub and to Contributing.md

End to end checksum and bit rot.

For upload and download of files we need to insure we are setting up the checksum correctly and validating any bit rot that might occur.

Tracking metrics

We need metrics organized by
Per Run:

Time to download
Amount downloaded
Size of input and output repo
Datamon wide numbers:

Basic pipeline for Job runs in Kubernetes

Add a daemon pod, init container. In the init container pull in data from a remote URL

Run to Bundle generation integration design

At the end of a function run, output data needs to be written to a new bundle. There are some open questions we need to address

In a on going function what is the handshake to indicate a bundle is done
When creating the run.json where should the metadata for run come from
How is the output folder maintained.
This could be all done in the filter layer in the tubeless runtime that runs the functions. It can call the datamon cli create a new bundle, extract the metadata and move the output folder off to provide a fresh new output folder (or backup a snapshot of a DB if there is do data out folder)

Upload archive files to S3

Issue
After Zipping/Archiving globs files, push it to Amazon S3

Once datamon zip all the files & directories, it pushes them to S3 for storage and code execution.

Success Criteria

Successfully upload the archive to S3
Unit test
Prometheus metrics like file size, 99th %tile latency, success/failure

AWS support for store package needs completion.

Store package needs to support all interfaces across GCS, S3 and AWS

Implement the Merkel tree for a commit

Each file is described as a hash but the entire set of file hashes needs to be hashed to effectively give a Merkel tree. This is to help calculate changes between commits.

Performance: Improve cache utilization minimize wait time for cache population

This can be achieved by

Scheduling Pods to nodes where data is already cached
Using PVC to make cache available on multiple nodes
This needs to be measured before and after to better understand if there are any improvements.

Implement upload a net new/stand alone bundle

Implement bundle upload

The bundle is a net new bundle or a stand alone bundle
The bundle is in a default master branch
A run description can be attached as a json file
Non Goals
Caching
Daemon set to monitor runs (once the runtime is nailed down the best spot for creating bundle can be finalized and implemented)

datamon coding checkstyle

We should have general Checkstyle for Datamon code.

Example: https://github.com/qiniu/checkstyle

create CD pipeline for datamon

Create a CI/CD pipeline for datamon.

Current thinking is to try and do this with the KNative stack
https://cloud.google.com/knative/

Improvements to Key enumeration of store.go

The localfs implementation needs to

More efficiently walk the space for localfs (faster is better)
The API should allow for pagination to walk an arbitrarily large number of files.
Support for prefix key enumeration

Datamon Argo workflow

Investigate the role of chunk size streaming.

Currently the upload to backend of CAFS waits on the buffer == leaf size being read in
Look into if sending the data to the backend without waiting for the buffer to be read in. The eventual key can be copied over by S3 (unclear if this will be faster).
Alternatively the chunk size can be played with.

cafs/store: Duplicate blob handling

Store needs to have a pre defined error for putting an object that already exists when pre condition needs it to be a new upload.
Cafs needs to detect this error a successful case since the blob is already present.
~~for duplicate 0 byte blobs avoid creating an object~~

High level epic for what is coming next.

Datamon plan going forward.

This is the result of re-evaluating what we should focus on, this list is not sorted by priority or the need for them. They are a summary of some of the options ahead.

Goals

Move Flood off EFS and onto GKE
1. Save cost
2. Bring insight into the pipeline
3. Launch datamon
Allow Jupyter notebooks to access Datamon bundles
1. Allow move convenient access to data sets

Rough tasks

Provide a way for a pod to consume a datamon repo:commit and for the pod to generate a repo:commit
The adoption of datamon should be simple and straightforward. Couple of options
1. Model the repo:commit as a volume and write a CSI plugin for datamon
  1. Benefit: Fits well with the kubernetest storage management model and will work with vanilla pods as we as argo based deployments
2. Custom runtime management of volume mounts. Based on the execution within a pod, mount or umount commits
  1. Benefits: Works well for kubeless styled long duration pods that use different commits.
  2. If the mount and unmounts are initiated from outside of a pod, datamon will have to ship a daemon set that monitors the state of a pod and decide when initialize new mounts and when to create new commits.
Fuse based mounts vs. bulk download and upload
1. Allows for low time to first byte.
2. View of the data can be changed dynamically for the same mount
3. Better insight into actual usage of data
Address diamond style scale out of workflows. A stage in the pipeline might have multiple independent pods that generate data for the same commit. This is doable within Datamon without writing a full fledged distributed filesystem as long as the data written is part of the same commit but isolated from each other. The data written can be aggregated post run into a single commit.
Add a developer friendly CLI that will allow Jupyter notebooks to work.
1. TBD: What it takes to integrate with Jupyter
Caching model and scheduler co-ordination
1. Customize scheduling of pods based of locality of caching or
2. Mechanism to have a lower latency cache within a k8s cluster (shared volumes mounted as read only volumes).

Caching scheme

Blobs for the CAFS should be cached. There are many aspects to caching that we need to decide

In GKE is done via a disk attached to each node? Or is there a common caching disk that a daemon populates on request
When are blobs kicked off from the cache?
- Access time?

Document Design and define MVP in README

Documented the current design for Datamon and define the MVP.

Exit Criteria:

Developers interested review the document
Nomenclature, user experience and workflows are finalized

Goals:

Datamon is put to use with a user experience that is expected to be the final user experience.

Non Goals:

MVP is representative of all the parts that will eventually be built out on the backend.

Permission to customise S3 bucket name to developers

Currently, we are using "oneconcern-datamon-dev" S3 bucket to publish all the zip files. It would be great if we this power to the developers so they can customise where they want to push their code archive.

What's your thought?

Add support for metadata to be attached to blob objects

Current blob interface does not allow attaching attributes to objects. This is useful for capturing permissions when the blob refers to a local store. This is relevant for the rework under review #28
The metadata is captured for the file in the Bundle but when recreating the bundle locally, it would be best to do it one shot when writing the data file back vs setting it post.

K8S based minio should be tested in CI/CD or deprecated

In the spirit of not having code that decays, k8s based minio install for testing should be part of CI/CD or deprecated/deleted

Data path improvements

For MVP the only implementation done is a bulk download of chunks and creation of a filesystem,
Going forward, we need to decide if a more stream friendly implementation is needed or if other optimizations in the data path will better server the overall performance.

Integrate with pipelines common repo

To share common building blocks across services, reuse packages from pipelines.
Example

Logging
Tracing
Microservice lifecycle management

Decide policy around runtime environment

When running datamon, the source and target information needs to be populated.
Source:

Bucket for datamon
Repo
Commit/Bundle
Destination:
Bucker
Repo
Commiter's meta information
Depending on the meta information it can be extracted from the pod yaml or can be fed in via config files to the pieces of datamon.

--keep-file flag for a user to save zip file locally

Current scenario : Datamon CLI stores model zip file into /tmp dir.

Feature scenario: Provide a flag(--keep-file) to a user. Based on the flag datamon decides to store file in /tmp dir or not.

Implement bundle download

Implement the Init Container that will

Use the bucker:repo:branch:bundle configured to download the bundle
1. Folders need to be on the same filesystem and need to mounted as a single mount
2. The small files are direct linked to the blobs
3. Larger files need to be copied over post download.
Configurable path to cache of blobs
Configurable path to the directory to dump the bundle
Use Google Object Store with credentials for the pod
1. https://cloud.google.com/kubernetes-engine/docs/tutorials/authenticating-to-cloud-platform
Container image published to the gcr registry via an automated process from GitHub
1. Via Release mechanism?

Non Goals:
Speeding up downloads
Tracking request processing back to google bucket logs

Run repo support

Model repo needs to store

Metadata about the model run
Ability to reproduce the container runtime
1. Store the entire container image
2. Store enough Metadata to reliably reconstruct the container image

Create config parser for content and entrypoint

User Story: A developer in the flood repo can do the following:

Define a yaml as spec-ed below to describe the code needed and the command to run.

# each entry is a glob pattern to select files you want included
# the paths will be preserved
content:
  - vendor/*
  - scripts/*
  - bin/*
  - requirements.txt
  - app.py
# the command to run
command:
  - python
  - app.py

Developer can datamon model deploy --config path-to-config-file --context rootpath that will create a package (zip file, or some other format) based on the yaml and additional logic. The --context flag can be omitted and then it will use the current directory as root to look up relative paths in the config file.
The package can be deployed onto kubeless.
The script can be named datamon deploy as this functionality will eventually be integrated into the datamon CLI. This can be how it is implemented for this issue or a stand alone script for now.

Exit criteria:

Documentation for developers (flood) for how to use this script
Functioning command in datamon
One sample hello world app that is implemented using the script on kubeless.
Validations are run
1. All files are present that are defined in yaml
2. Clear error messages that are helpful to developers
3. Deployment is successful in kubeless else report appropriate error.

Goals:
Developers can start using this script to deploy their apps written to spec defined here and their experience is more or less identical once datamon backend is ready.

non goals:

Functioning datamon backend

Caching of cafs blobs

Caching layer needs to be aded to the CAFS.
This can be done external to CAFS with CAFS proving interface to interact with it or internally within CAFS.
This will be needed for fuse base RW FS

Trigger Specification

Ability to create a new event/webhook needs to be fleshed out

K8S CRD spec
Datamon metadata for an event
Registering to other events to be dependent on them.

Create kubeless function & deploy it on Kubernetes

Issue
Using the config file, generate Kubeless Function Object and use it to create Function CRD in Kubernetes.

Success Criteria

Kubeless function deployment on Kubernetes
Unit test for SHA-256/File SHA-256, content type, Kubeless function,
Prometheus metrics

Look into the performance of the google Cloud SDK and our usage of it

We need to study the CPU spent on buffer management. This is pre GA. Though the latency numbers for datamon will not be heavily impacted the CPU spent could be reduced by improving the overall buffer management. This might mean we have to write our own Rest Client. Also, for large objects we need to be sure that SDK or our use of it is not bringing in the entire object into memory.
Filling this now, as I spend some time digging into the google Cloud SDK and was not quite happy with the thought put into it.

Event tree

The entire tree of events and dependency between events, data repos needs to be presentable to

Track the end to end code of an event occurring
To better inspect the pipeline
This can be done via integration into Jaegar/Istio
Also, all past runs can be inspected to know the $ cost to run.

Rename bundle to commit.

Rename bundle to commit which is easier to explain.

oneconcern / datamon Goto Github PK

datamon's Introduction

Branch Status

Local Build Setup

Deployment

datamon's People

Contributors

Stargazers

Watchers

Forkers

datamon's Issues

Config directories

OSX

Linux

Windows

Datamon plan going forward.

Goals

Rough tasks

Recommend Projects

Recommend Topics

Recommend Org