Coder Social home page Coder Social logo

kfdata's Introduction

kfdata

Prototype implementation of KFData proposal - see pachdm.com/kfdata

Design

KFData design

User stories

  1. User attaches dataset as reader to a pipeline spec, gets dataset env vars auto-populated (can use built-in Minio for this without Pachyderm)
  2. Pachyderm triggers an incremental (per-datum) pipeline run in KFData, gets env vars with capability-like access to subset of data to be processed

Discussion

How does dataset get specified in a pipeline spec?

KFP pipelines typically specify data as pipeline parameters, normally a reference to a GCS bucket (which, by the way fails out of the box on Kubeflow).

So, to get us started let's figure out how we'll specify datasets to pipeline components...

Well, KFP pipelines specify InputPath and OutputPath params to func_to_container_op methods. It then seems that it's KFP's job to read/write files to object storage. So that's fine for passing data between pipeline steps, but how about getting data into the pipeline in the first place?

Hmm, I'm having trouble finding a sensible KFP example!

This is the kind of thing I want, but that doesn't use KFP.

This might work, but is quite GCP-specific.

The typical Iris sample uses TFX, and I sorta don't want to go down the TFX rabbithole right now. (How does TFX/TFJob get work done on Kubeflow? Is it by starting its own pods or something? That will make our proposal hard.)

I want something which is "typical", in the expected sense that:

  • It uses object storage and InputPath/OutputPath parameters (not PVs like the Canonical example).
  • It's not tied to GCP (I'm not using GCP, I'm using an "on-prem" Kubeflow with local minio, so the Cisco example but it doesn't use KFP).
  • It uses real input data. A lot of the ML ones don't use data at all really, they just use toy datasets e.g. Iris.

Financial time series might be the best bet. Let's see if we can get it to run without GCP.

It does:

It doesn't:

  • Use func_to_container_op, so there's manual docker build steps

Hmm: this looks promising... but it's documented as not working.

Maybe extending the "Data passing in python components" example to read input data from somewhere...

So let's make our own example.

  • pytorch text model
  • inference pipeline which just does sentiment analysis on strings of text
  • kfp func_to_container_op with InputPath and OutputPath annotations, in a python script we can run easily to test KFData
  • make it use minio on the kubeflow cluster

Does this help us decide how to attach Datasets as readers/writers to Pipelines, and DatasetSpec to a pipeline run creation request?

Well, for one thing they should be somewhat orthogonal to InputPath and OutputPath.

If KFP itself has code to read and write objects to/from object storage (which "data passing in python components" suggests), we should provide a way to connect Datasets to {Input,Output}Paths somehow... on the "outside" of a pipeline.

So Datasets should be able to connect to the "outside ports" of a KFP. How we provide Dataset attachment to pipeline specs depends on the specifics of how the {Input,Output}Path internals work.

Next step: go read the code for {Input,Output}Path.

Here's one idea: the convention seems to be that you pass gs://foo paths as pipeline parameters, and it's up to the pipeline to list objects and open one and feed it to further pipeline steps as InputPath and InputTextFile types etc (if KFP/Argo uses object storage for data passing, can't it read from it for pipeline inputs too? Clarified in this thread: artifacts are only tracked after they're ingested into the system with e.g. a download_blob component).

We could invent a new convention that a pipeline parameter kfdata://foo:r attaches KFData dataset foo as a reader, and kfdata://bar:w attaches bar as a writer, populating the appropriate env vars for, say, an S3 client library.

There could be a corresponding kfp Python SDK module which is responsible for implementing the translation as a first pass. It could just go and find the Dataset in the Kubernetes API. Or, to begin with we could just implement it as a generic component that people can start using in their pipelines! Actually, that won't work, unless the KFData component can convert a KFData dataset name into environment variables and output them as outputValues, and then have the next component along take those env vars as inputValues and do something like os.setenv(...). This might work for a POC but we'd want less boilerplate in the ultimate implementation.

In the future it would be better, of course, if Datasets were a top level object in Kubeflow, creatable/listable in the UI and attachable in the pipeline run builder UI (just like the experiment picker).

NB: in terms of Pachyderm integrating with Kubeflow Artifacts (for intermediate data passing), perhaps we can configure Argo/Kubeflow or whatever to use the Pachyderm S3 gateway as a place to write that intermediate data. And perhaps we can thread through some provenance information so that Pachyderm's provenance can "see through" the indirection of going via KFP Artifacts. This would allow Kubeflow provenance to extend "further back" than where it's ingested.

Might we be able to do better than looking for magic pipeline parameters of a certain kfdata:// form and instead prototype adding a new Datasets field to a pipeline spec, implemented with an admission controller? There's nothing stopping an admission controller stripping off a modified form of a pipeline spec, and then applying it to the Argo pods created, I suppose.

Implementation

  • Try using Admission Controllers to minimize changes needed to KFP
    • Create operator for CRDs for Dataset Types, Datasets; admission controller to process compiled Argo pipelines to add env vars as v1
  • Likely end up using MLMD
    • Treat each Dataset as an Artifact
  • Hopefully reuse work across TFX/KFData efforts
    • We recognize that TFX has some similarities, but don’t want to force users to rewrite their pipelines in TFX DSL to take advantage of KFData
    • NB: added TFX section to proposal doc
  • On October 14, we’ll aim to bring a POC & demo to this WG
    • Target delivery for inclusion and promotion in next appropriate Kubeflow release -- if the KFP WG is supportive

kfdata's People

Contributors

lukemarsden avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.