Prototype implementation of KFData proposal - see pachdm.com/kfdata
- User attaches dataset as reader to a pipeline spec, gets dataset env vars auto-populated (can use built-in Minio for this without Pachyderm)
- Pachyderm triggers an incremental (per-datum) pipeline run in KFData, gets env vars with capability-like access to subset of data to be processed
How does dataset get specified in a pipeline spec?
KFP pipelines typically specify data as pipeline parameters, normally a reference to a GCS bucket (which, by the way fails out of the box on Kubeflow).
So, to get us started let's figure out how we'll specify datasets to pipeline components...
Well, KFP pipelines specify InputPath and OutputPath params to func_to_container_op
methods.
It then seems that it's KFP's job to read/write files to object storage.
So that's fine for passing data between pipeline steps, but how about getting data into the pipeline in the first place?
Hmm, I'm having trouble finding a sensible KFP example!
This is the kind of thing I want, but that doesn't use KFP.
This might work, but is quite GCP-specific.
The typical Iris sample uses TFX, and I sorta don't want to go down the TFX rabbithole right now. (How does TFX/TFJob get work done on Kubeflow? Is it by starting its own pods or something? That will make our proposal hard.)
I want something which is "typical", in the expected sense that:
- It uses object storage and InputPath/OutputPath parameters (not PVs like the Canonical example).
- It's not tied to GCP (I'm not using GCP, I'm using an "on-prem" Kubeflow with local minio, so the Cisco example but it doesn't use KFP).
- It uses real input data. A lot of the ML ones don't use data at all really, they just use toy datasets e.g. Iris.
Financial time series might be the best bet. Let's see if we can get it to run without GCP.
It does:
- Define a Kubeflow pipeline
- Ingest data from a GCS bucket
It doesn't:
- Use
func_to_container_op
, so there's manual docker build steps
Hmm: this looks promising... but it's documented as not working.
Maybe extending the "Data passing in python components" example to read input data from somewhere...
So let's make our own example.
- pytorch text model
- inference pipeline which just does sentiment analysis on strings of text
- kfp
func_to_container_op
with InputPath and OutputPath annotations, in a python script we can run easily to test KFData - make it use minio on the kubeflow cluster
Does this help us decide how to attach Datasets as readers/writers to Pipelines, and DatasetSpec to a pipeline run creation request?
Well, for one thing they should be somewhat orthogonal to InputPath and OutputPath.
If KFP itself has code to read and write objects to/from object storage (which "data passing in python components" suggests), we should provide a way to connect Datasets to {Input,Output}Path
s somehow... on the "outside" of a pipeline.
So Datasets should be able to connect to the "outside ports" of a KFP.
How we provide Dataset attachment to pipeline specs depends on the specifics of how the {Input,Output}Path
internals work.
Next step: go read the code for {Input,Output}Path
.
Here's one idea: the convention seems to be that you pass gs://foo
paths as pipeline parameters, and it's up to the pipeline to list objects and open one and feed it to further pipeline steps as InputPath and InputTextFile types etc (if KFP/Argo uses object storage for data passing, can't it read from it for pipeline inputs too? Clarified in this thread: artifacts are only tracked after they're ingested into the system with e.g. a download_blob
component).
We could invent a new convention that a pipeline parameter kfdata://foo:r
attaches KFData dataset foo
as a reader, and kfdata://bar:w
attaches bar
as a writer, populating the appropriate env vars for, say, an S3 client library.
There could be a corresponding kfp
Python SDK module which is responsible for implementing the translation as a first pass. It could just go and find the Dataset in the Kubernetes API. Or, to begin with we could just implement it as a generic component that people can start using in their pipelines! Actually, that won't work, unless the KFData component can convert a KFData dataset name into environment variables and output them as outputValues, and then have the next component along take those env vars as inputValues and do something like os.setenv(...)
. This might work for a POC but we'd want less boilerplate in the ultimate implementation.
In the future it would be better, of course, if Datasets were a top level object in Kubeflow, creatable/listable in the UI and attachable in the pipeline run builder UI (just like the experiment picker).
NB: in terms of Pachyderm integrating with Kubeflow Artifacts (for intermediate data passing), perhaps we can configure Argo/Kubeflow or whatever to use the Pachyderm S3 gateway as a place to write that intermediate data. And perhaps we can thread through some provenance information so that Pachyderm's provenance can "see through" the indirection of going via KFP Artifacts. This would allow Kubeflow provenance to extend "further back" than where it's ingested.
Might we be able to do better than looking for magic pipeline parameters of a certain kfdata://
form and instead prototype adding a new Datasets field to a pipeline spec, implemented with an admission controller? There's nothing stopping an admission controller stripping off a modified form of a pipeline spec, and then applying it to the Argo pods created, I suppose.
- Try using Admission Controllers to minimize changes needed to KFP
- Create operator for CRDs for Dataset Types, Datasets; admission controller to process compiled Argo pipelines to add env vars as v1
- Likely end up using MLMD
- Treat each Dataset as an Artifact
- Hopefully reuse work across TFX/KFData efforts
- We recognize that TFX has some similarities, but don’t want to force users to rewrite their pipelines in TFX DSL to take advantage of KFData
- NB: added TFX section to proposal doc
- On October 14, we’ll aim to bring a POC & demo to this WG
- Target delivery for inclusion and promotion in next appropriate Kubeflow release -- if the KFP WG is supportive