Coder Social home page Coder Social logo

oneconcern / datamon Goto Github PK

View Code? Open in Web Editor NEW
15.0 23.0 6.0 10.09 MB

Datamon manages infinite reflections of data

License: MIT License

Go 91.13% Shell 5.82% Makefile 1.48% Dockerfile 0.91% HTML 0.16% Mustache 0.49%
data-management-platform

datamon's Introduction

1C PNG

Artificial Intelligence platform for Disasters

GitHub issues CircleCI aeonian status

Branch Status

master staging
CircleCI CircleCI

Local Build Setup

  • Clone this repository
git clone [email protected]:acidjazz/oneconcern.git
  • Install dependencies
yarn install
  • Generate routes and lever job listings
yarn cash
yarn dev -o

Deployment

Continuous Deployment is setup using the git-flow workflow with aeonian via Circle-CI

*** Edited 06/01/2018 by @zivagolee

datamon's People

Contributors

aakarshgupta97 avatar casualjim avatar chandresh-pancholi avatar dependabot[bot] avatar fredbi avatar joaoh821c avatar kerneltime avatar kostas-theo avatar ndamclean avatar ransomw1c avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datamon's Issues

Circle CI integration

Problem: Pull Requests and Branch health is evaluated automatically
Success:

  1. All PRs for merging are evaluated via automated testing in line with other repos
  2. WIP PRs are not validated.

Git Commit Authenticity

Problem: We need to have an independent (from GitHub) mechanism to authenticate commits to the repo.
Success:

  • 1. All existing committers have moved to the new model.
  • 2. New model is SAS (validation can be done locally without a SAS company) independent for validation
  • 3. Process to on board new developers is documented.
  • 4. Validations are enforced on Pull Requests.
  • 5. PR is the only way to get push data to all branches (forks are mandatory)

Docs for datamon

We need to articulate the current design and the way forward in the docs folder.

list all the models

create a command datamon model ls or datamon model list to list all the models.

Automated system tests

  • CI/CD needs a way to run automated system tests.
  • Test need to be written for existing functionalities.

Code refactor: Extensible storage

Currently the definition of a bundle is limited to a fuse mounted FS backed by CAFS stored in an object store. In the future a bundle can be a volume snapshot, shared fs snapshot or a DB snapshot.
The CSI layer can be a more generic layer as well allowing for mounting and unmounting varying backends

Remove code that is not in play for MVP

The direction of Datamon is slightly different than the one outlined in the current code.
Also, following the model of master is always shippable and code is "done done", a lot of the code currently in play needs to be archived and removed from the master branch.

Bulk import into datamon

Bulk import use case should

  1. Authenticate user
  2. Use datamon creds to upload (daemon in k8s?)
  3. Efficiently upload data.
  4. Ability to describe the source path to target repo

Zip files from glob of content

Issue:
The input file has a content array of globs. we need to zip all the files and directory.

datamon deploy function --from-file <example.yaml>

Steps

  • Datamon reads content from example.yaml & zip all the files from content attribute in example.yaml

  • example of glob of content from example.yaml

content:
  - vendor/*
  - scripts/*
  - bin/*
  - requirements.txt
  - app.py

Success Criteria

  • Successfully able to zip all the given globs

  • Unit tests

  • Corner cases like 0 files check

  • Proper logging statements, Successfully uploaded & failed uploaded files prometheus custom metrics

Custom runtime to execute functions

The developers can get the help of the operation team to create custom runtime to run their functions.
MVP is going to use default Kubeless runtimes

Supported Runtimes are:

python2.7, python3.4, python3.6, nodejs6, nodejs8, ruby2.4, php7.2, go1.10, dotnetcore2.0, java1.8, ballerina0.981.0

Move blobs to it's own bucket

Adding a prefix to cafs blobs can have performance issue when large numbers of blobs are uploaded. Since the cafs blobs can be generated in parallel and at scale it makes sense to move blobs to their own bucket.

Add licensing details

Need to

  1. Choose license
  2. Complete formalities if any
  3. Add details to gitHub and to Contributing.md

End to end checksum and bit rot.

For upload and download of files we need to insure we are setting up the checksum correctly and validating any bit rot that might occur.

Tracking metrics

We need metrics organized by
Per Run:

  • Time to download
  • Amount downloaded
  • Size of input and output repo
    Datamon wide numbers:

Run to Bundle generation integration design

At the end of a function run, output data needs to be written to a new bundle. There are some open questions we need to address

  1. In a on going function what is the handshake to indicate a bundle is done
  2. When creating the run.json where should the metadata for run come from
  3. How is the output folder maintained.
    This could be all done in the filter layer in the tubeless runtime that runs the functions. It can call the datamon cli create a new bundle, extract the metadata and move the output folder off to provide a fresh new output folder (or backup a snapshot of a DB if there is do data out folder)

Upload archive files to S3

Issue
After Zipping/Archiving globs files, push it to Amazon S3

Once datamon zip all the files & directories, it pushes them to S3 for storage and code execution.

Success Criteria

  • Successfully upload the archive to S3

  • Unit test

  • Prometheus metrics like file size, 99th %tile latency, success/failure

Implement the Merkel tree for a commit

Each file is described as a hash but the entire set of file hashes needs to be hashed to effectively give a Merkel tree. This is to help calculate changes between commits.

Implement upload a net new/stand alone bundle

Implement bundle upload

  1. The bundle is a net new bundle or a stand alone bundle
  2. The bundle is in a default master branch
  3. A run description can be attached as a json file
    Non Goals
  4. Caching
  5. Daemon set to monitor runs (once the runtime is nailed down the best spot for creating bundle can be finalized and implemented)

Improvements to Key enumeration of store.go

The localfs implementation needs to

  • More efficiently walk the space for localfs (faster is better)

  • The API should allow for pagination to walk an arbitrarily large number of files.

  • Support for prefix key enumeration

cafs/store: Duplicate blob handling

  • Store needs to have a pre defined error for putting an object that already exists when pre condition needs it to be a new upload.
    Cafs needs to detect this error a successful case since the blob is already present.
  • for duplicate 0 byte blobs avoid creating an object

High level epic for what is coming next.

Datamon plan going forward.

This is the result of re-evaluating what we should focus on, this list is not sorted by priority or the need for them. They are a summary of some of the options ahead.

Goals

  1. Move Flood off EFS and onto GKE
    1. Save cost
    2. Bring insight into the pipeline
    3. Launch datamon
  2. Allow Jupyter notebooks to access Datamon bundles
    1. Allow move convenient access to data sets

Rough tasks

  1. Provide a way for a pod to consume a datamon repo:commit and for the pod to generate a repo:commit
  2. The adoption of datamon should be simple and straightforward. Couple of options
    1. Model the repo:commit as a volume and write a CSI plugin for datamon
      1. Benefit: Fits well with the kubernetest storage management model and will work with vanilla pods as we as argo based deployments
    2. Custom runtime management of volume mounts. Based on the execution within a pod, mount or umount commits
      1. Benefits: Works well for kubeless styled long duration pods that use different commits.
      2. If the mount and unmounts are initiated from outside of a pod, datamon will have to ship a daemon set that monitors the state of a pod and decide when initialize new mounts and when to create new commits.
  3. Fuse based mounts vs. bulk download and upload
    1. Allows for low time to first byte.
    2. View of the data can be changed dynamically for the same mount
    3. Better insight into actual usage of data
  4. Address diamond style scale out of workflows. A stage in the pipeline might have multiple independent pods that generate data for the same commit. This is doable within Datamon without writing a full fledged distributed filesystem as long as the data written is part of the same commit but isolated from each other. The data written can be aggregated post run into a single commit.
  5. Add a developer friendly CLI that will allow Jupyter notebooks to work.
    1. TBD: What it takes to integrate with Jupyter
  6. Caching model and scheduler co-ordination
    1. Customize scheduling of pods based of locality of caching or
    2. Mechanism to have a lower latency cache within a k8s cluster (shared volumes mounted as read only volumes).

Caching scheme

Blobs for the CAFS should be cached. There are many aspects to caching that we need to decide

  • In GKE is done via a disk attached to each node? Or is there a common caching disk that a daemon populates on request
  • When are blobs kicked off from the cache?
    • Access time?

Document Design and define MVP in README

Documented the current design for Datamon and define the MVP.

Exit Criteria:

  1. Developers interested review the document
  2. Nomenclature, user experience and workflows are finalized

Goals:

  1. Datamon is put to use with a user experience that is expected to be the final user experience.

Non Goals:

  1. MVP is representative of all the parts that will eventually be built out on the backend.

Permission to customise S3 bucket name to developers

Currently, we are using "oneconcern-datamon-dev" S3 bucket to publish all the zip files. It would be great if we this power to the developers so they can customise where they want to push their code archive.

What's your thought?

Data path improvements

For MVP the only implementation done is a bulk download of chunks and creation of a filesystem,
Going forward, we need to decide if a more stream friendly implementation is needed or if other optimizations in the data path will better server the overall performance.

Decide policy around runtime environment

When running datamon, the source and target information needs to be populated.
Source:

  • Bucket for datamon
  • Repo
  • Commit/Bundle
    Destination:
  • Bucker
  • Repo
  • Commiter's meta information
    Depending on the meta information it can be extracted from the pod yaml or can be fed in via config files to the pieces of datamon.

Implement bundle download

Implement the Init Container that will

  1. Use the bucker:repo:branch:bundle configured to download the bundle
    1. Folders need to be on the same filesystem and need to mounted as a single mount
    2. The small files are direct linked to the blobs
    3. Larger files need to be copied over post download.
  2. Configurable path to cache of blobs
  3. Configurable path to the directory to dump the bundle
  4. Use Google Object Store with credentials for the pod
    1. https://cloud.google.com/kubernetes-engine/docs/tutorials/authenticating-to-cloud-platform
  5. Container image published to the gcr registry via an automated process from GitHub
    1. Via Release mechanism?

Non Goals:
Speeding up downloads
Tracking request processing back to google bucket logs

Run repo support

Model repo needs to store

  1. Metadata about the model run
  2. Ability to reproduce the container runtime
    1. Store the entire container image
    2. Store enough Metadata to reliably reconstruct the container image

Create config parser for content and entrypoint

User Story: A developer in the flood repo can do the following:

  1. Define a yaml as spec-ed below to describe the code needed and the command to run.
# each entry is a glob pattern to select files you want included
# the paths will be preserved
content:
  - vendor/*
  - scripts/*
  - bin/*
  - requirements.txt
  - app.py
# the command to run
command:
  - python
  - app.py
  1. Developer can datamon model deploy --config path-to-config-file --context rootpath that will create a package (zip file, or some other format) based on the yaml and additional logic. The --context flag can be omitted and then it will use the current directory as root to look up relative paths in the config file.
  2. The package can be deployed onto kubeless.
  3. The script can be named datamon deploy as this functionality will eventually be integrated into the datamon CLI. This can be how it is implemented for this issue or a stand alone script for now.

Exit criteria:

  1. Documentation for developers (flood) for how to use this script
  2. Functioning command in datamon
  3. One sample hello world app that is implemented using the script on kubeless.
  4. Validations are run
    1. All files are present that are defined in yaml
    2. Clear error messages that are helpful to developers
    3. Deployment is successful in kubeless else report appropriate error.

Goals:
Developers can start using this script to deploy their apps written to spec defined here and their experience is more or less identical once datamon backend is ready.

non goals:

  1. Functioning datamon backend

Caching of cafs blobs

Caching layer needs to be aded to the CAFS.
This can be done external to CAFS with CAFS proving interface to interact with it or internally within CAFS.
This will be needed for fuse base RW FS

Trigger Specification

Ability to create a new event/webhook needs to be fleshed out

  1. K8S CRD spec
  2. Datamon metadata for an event
  3. Registering to other events to be dependent on them.

Look into the performance of the google Cloud SDK and our usage of it

We need to study the CPU spent on buffer management. This is pre GA. Though the latency numbers for datamon will not be heavily impacted the CPU spent could be reduced by improving the overall buffer management. This might mean we have to write our own Rest Client. Also, for large objects we need to be sure that SDK or our use of it is not bringing in the entire object into memory.
Filling this now, as I spend some time digging into the google Cloud SDK and was not quite happy with the thought put into it.

Event tree

The entire tree of events and dependency between events, data repos needs to be presentable to

  1. Track the end to end code of an event occurring
  2. To better inspect the pipeline
    This can be done via integration into Jaegar/Istio
    Also, all past runs can be inspected to know the $ cost to run.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.