kubeflow / pipelines Goto Github PK

Machine Learning Pipelines for Kubeflow

Home Page: https://www.kubeflow.org/docs/components/pipelines/

License: Apache License 2.0

Makefile 0.15% Dockerfile 0.27% Shell 1.64% Python 47.34% Jupyter Notebook 0.75% Batchfile 0.01% Mustache 0.28% Go 16.92% Smarty 0.06% Jinja 0.01% JavaScript 3.91% TypeScript 28.57% HTML 0.02% PowerShell 0.02% CSS 0.01% Starlark 0.01% MDX 0.03%

kubeflow-pipelines mlops kubeflow machine-learning kubernetes pipeline data-science

pipelines's Introduction

Overview of the Kubeflow pipelines service

Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

Kubeflow pipelines are reusable end-to-end ML workflows built using the Kubeflow Pipelines SDK.

The Kubeflow pipelines service has the following goals:

End to end orchestration: enabling and simplifying the orchestration of end to end machine learning pipelines
Easy experimentation: making it easy for you to try numerous ideas and techniques, and manage your various trials/experiments.
Easy re-use: enabling you to re-use components and pipelines to quickly cobble together end to end solutions, without having to re-build each time.

Installation

Install Kubeflow Pipelines from choices described in Installation Options for Kubeflow Pipelines.
The Docker container runtime has been deprecated on Kubernetes 1.20+. Kubeflow Pipelines has switched to use Emissary Executor by default from Kubeflow Pipelines 1.8. Emissary executor is Container runtime agnostic, meaning you are able to run Kubeflow Pipelines on Kubernetes cluster with any Container runtimes.

Documentation

Get started with your first pipeline and read further information in the Kubeflow Pipelines overview.

See the various ways you can use the Kubeflow Pipelines SDK.

See the Kubeflow Pipelines API doc for API specification.

Consult the Python SDK reference docs when writing pipelines using the Python SDK.

Refer to the versioning policy and feature stages documentation for more information about how we manage versions and feature stages (such as Alpha, Beta, and Stable).

Contributing to Kubeflow Pipelines

Before you start contributing to Kubeflow Pipelines, read the guidelines in How to Contribute. To learn how to build and deploy Kubeflow Pipelines from source code, read the developer guide.

Kubeflow Pipelines Community Meeting

The meeting is happening every other Wed 10-11AM (PST) Calendar Invite or Join Meeting Directly

Meeting notes

Kubeflow Pipelines Slack Channel

#kubeflow-pipelines

Blog posts

Getting started with Kubeflow Pipelines (By Amy Unruh)
How to create and deploy a Kubeflow Machine Learning Pipeline (By Lak Lakshmanan)
Tekton optimizations for Kubeflow Pipelines 2.0 (By Tommy Li)

Acknowledgments

Kubeflow pipelines uses Argo Workflows by default under the hood to orchestrate Kubernetes resources. The Argo community has been very supportive and we are very grateful. Additionally there is Tekton backend available as well. To access it, please refer to Kubeflow Pipelines with Tekton repository.

pipelines's People

Contributors

Stargazers

Watchers

Forkers

github30 amygdala richardsliu jlewi quantum-fusion fendaq stevencasey kuangniaokuang dtrawins saeedseyyedi gavinljj codeaudit kkasravi forging2012 enghabu romeopatrick11 wenke1020 lakshmanok ubaidsayyed54 devhttps yupbank chethansr bhaveshraj kent-vuong jakjh101 wycharry bbpbbd tashby sshrdp karthikv2k texasmichelle sonalirungta neuromage yebrahim sarahmaddox ggaaooppeenngg gyliu513 ssghost whummer jinchihe rmoorman chikitaisaac moonok33 jpj98 14784203189 devops8012 awesomemachinelearning ficklecan rileyjbauer ibeefer monjoybme hongye-sun aggarwalntn jainal09 ark-kun darkstarmv marcromeyn ironpan qimingj xiaozhoux nick-harvey denkensk krenshaw2018 cechu66 chenzhiwei cbreuel conqueror swiftdiaries mingram8 ampsight xhts18 awesomegolang myin2000 awesome-archive hougangliu mbrukman xwu0226 ohmystack micseb limx59 vicaire zhenghuiwang seufagner arrikto rakelkar elviraux xuoutput joejoehu pedrolelis e2forks terrytangyuan gabrielwen grabds lwflwflwf vsingh63 k82cn andreyvelich ritazh paveldournov iyera

pipelines's Issues

Popped out TFMA is too small

Switch to Go 1.11 modules and package management.

See https://github.com/golang/go/wiki/Modules

Argo has just switched recently: https://github.com/argoproj/argo/pull/1071/files

Give generated exit-handler a more special name

If an exit handler is specified in the DSL, the compiler currently uses a bit of a hack to ensure that it is always called before the pipeline terminates: an extra DAG is added to the compiled yaml that is named something like exit-handler-1 which is the only task of the entrypoint DAG, and wraps all other steps within the pipeline (they are tasks within the exit handler's DAG).

This works alright as a workaround, but it clutters the UI in a rather unhelpful way, so we currently hide this node by checking if the name starts with "exit-handler".

This is potentially problematic as it's not the most unlikely name for a user to pick, so maybe something like "__exit-handler" would be better until a better solution is found to the exit handler problem.

Remove python op decorator

Let's remove dsl.python_op in favor of dsl.python_component.

How does it proceed with this project?

I'd like to contribute/help to this project, are there any milestones or actual codes for pipelines?

It looks like that there are two procedures:

Move argoproj/argo into here. Probably it's necessary to permit from argoproj members.
Implement the pipeline module from scratch using kubernetes APIs (e.g. CRD) like argo.

please reply if you don't mind.

Frontend should give 404s for non-existing pages/routes.

http://localhost:9090/pipelinez

ScheduledWorkflow CRD: CLI

Will it prepare the scheduledworkflows' CLIs in this repository to manage controller like argo? Or adding subcommands to argo.

"No runs were found for this experiment" is a misleading message when the list is still being loaded.

Date pickers in NewRun do not handle invalid days well

For example, if a user sets the start date to 2/31/2018, the form will show that date, but component will set the start date to undefined.

We should either add constraints on the days ourselves, or at least show an error message indicating that the date is invalid.

Create GCS path checker component

The Experiments/All runs tab should probably be the default tab in Experiments.

It's probably going to be the most frequently used tab. Especially in cases where the user only has one or two experiments.

Our test code and test images code is not always the same.

This issue manifested itself several times. The latest was today when the prow tests were fixed.

Out test images only use the branch code.

git clone https://github.com/kubeflow/pipelines /ml
git checkout 6e96b054fb2585f3577155fa92dd107c6e1b5dd2

But the tests that Prow runs are taken from the result of merging the base branch (master) with the PR branch.

I1103 23:36:06.094] Checkout: /workspace/github.com/googleprivate/ml master:296b540cd724fed645e1652f12428462fd5375ed,1532:5afe507591f58f76a12c9f0f3b6659a30b657060 to /workspace/github.com/googleprivate/ml
I1103 23:36:06.094] Call:  git init github.com/googleprivate/ml
I1103 23:36:06.101] Call:  git clean -dfx
I1103 23:36:06.105] Call:  git reset --hard
I1103 23:36:06.110] Call:  git config --local user.name 'K8S Bootstrap'
I1103 23:36:06.116] Call:  git config --local user.email k8s_bootstrap@localhost
I1103 23:36:06.122] Call:  git fetch --quiet --tags [email protected]:googleprivate/ml master +refs/pull/1532/head:refs/pr/1532
I1103 23:36:10.765] Call:  git checkout -B test 296b540cd724fed645e1652f12428462fd5375ed
I1103 23:36:11.199] Call:  git show -s --format=format:%ct HEAD
I1103 23:36:11.204] Call:  git merge --no-ff -m 'Merge +refs/pull/1532/head:refs/pr/1532' 5afe507591f58f76a12c9f0f3b6659a30b657060

This effectively means that the test code is always taken from master while test image code is taken from the branch and may be out of sync.

We should also do something like

git clone https://github.com/kubeflow/pipelines
cd pipelines
git merge --no-ff 321ca814db4955b3950b0fac06a2d289fe4db39a -m "Merged PR"

feature request: in SDK support 'get_or_create_experiment()'

feature request: In the SDK, support get_or_create_experiment() in addition to create_experiment()
(Bradley has the context).

Fasten release process for the image tag update.

Currently, we have hardcoded the release version as the image tag in the samples. We need to make it easy to update these image tags during releases.
As best, we can avoid double releases.

User should be able to upload a new pipeline when creating a new run.

http://localhost:9090/pipeline/#/runs/new?experimentId=8348da85-566b-4b11-a286-da8615ae9c6c

SDK should require kubernetes client lib

It doesn't look like this install:
pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.2/kfp.tar.gz --upgrade
includes the kubernetes client lib, which I think is intended to be included, as the %%docker "magic" requires it.

Return the total number of resources in list APIs

This is needed by the UI to show the total number of resources when paging through them. It's also needed when showing the total number of recurring runs in an experiment.

Pipeline input cleansing

It might make sense to cleanse the pipeline input.
For example, if there is a parameter that requires a GCS path, with a negligible SPACE before the GCS path(e.g. " gs://pipeline/input-bucket") would lead to the pipeline failure. This type of bugs are hard to detect.

We have ~40 GKE clusters (~=100 instances) running tests and our quotas are exhausted

Should our root URL be "/pipeline" or "/pipelines"?

I think it should be "/pipelines", because our project is called that. But we may also make the /pipeline redirect.

ScheduledWorkflow CRD: Investigate need for retries beyond the ones provided by Argo

Currently, the ScheduledWorkflow CRD reliably starts Argo workflows, but does not monitor that they complete successfully. It relies on retries embedded in the Argo workflow itself.

The ScheduledWorkflow CRD could provide a retry functionality.

Duplicate experiment name would return me a blank page

API would throw invalid input error if experiment name is duplicate. The UI need to handle it properly.

Compare experience – UX changes

Rows selected for comparison should not display selection styling in the 'Overview' section. The checkboxes remain active and rows aren't highlighted. See below for row styling.

Run overview section cannot be collapsed. Remove collapse action.
Parameters section table style below. Run (objects) are shown as rows, and parameters (attributes) are shown as columns. Match table font style with 'Overview' table (Roboto, 14px, for cell content)

Metrics is to be its own section and not part of overview section

Title of the aggregate view should read "All selected runs"
Vertical top-align all charts in a section

Better render pipeline description

The sample now has a link to the source code. The text is cropped.

Also we probably should render the URL as hyperlink, so user don't have to copy the path.

Remember the page I was on

In the All runs list, I click on a run, then use the browser's back button. This does not return me to the same page.

Add the ability to delete experiment.

User should be able to start a new run from the pipeline page.

SDK/Components/Python - Functions that do not return anything

The "Recurrent run configs" tab is showing inaccurate # of jobs

It looks like it's showing the total # of jobs across all experiments, instead of just that experiment.

Run list perf optimizations

This is a quick analysis of the experiment details page performance. Most of the time spent is because we have to make multiple consecutive requests to load all the information we need. We currently do this:

Call getExperiment API to get the details (name, description.. etc).
Call listJobs API to get all recurring jobs in this experiment.
Call listRuns API to show the first page of runs in this experiment.
For each run (in parallel), call its getRun API to get its details (name, status, duration... etc).
For each run (in parallel), call getPipeline on its pipeline ID, in order to show the pipeline name.
For each run (in parallel), call getExperiment on its experiment ID, if any, to show the experiment name. This is not needed when listing runs of a given experiment, but it's technical debt we accumulated, since we're using the same component to list runs everywhere.

Some low-hanging perf improvements can be obtained by doing the following:

There is no need to do the first three steps in sequence, they're not interdependent.
There is no need to do the second three steps in sequence, they're all using the same metadata fields.
We can render the list of runs as we get them, before we get the details of each run and its pipeline. This will show their names and statuses, but not their run time or their pipeline name. Subsequent requests then re-render to fill out these fields.

Can't read full text in "Choose a pipeline" dialog

The parts that are visible in the Pipeline name and the Description columns are too short to decide which one to choose.

Maybe allow text wrapping with a max row height, and/or narrowing the timestamp column?

UI should remember my "Rows per page" choice.

Pipeline API Server Swagger Client (Go) for Pipeline Upload returns incomplete output

The Pipeline API Server Swagger Client (Go) returns incomplete output for the pipeline.upload method.

For instance, using the CLI in the directory:

kubeflow/pipelines/backend/src/cmd/ml

With the command:

go run main.go pipeline --namespace kubeflow upload ./samples/hello-world.yaml --name pipleline87 -o json

We get the output:

{
"created_at": "0001-01-01T00:00:00.000Z",
"parameters": null
}

feature request: restore the client method for creating a pipeline

In addition to run_pipeline (which doesn't actually create a pipeline object in the UI), bring back the client method for actually creating the pipeline.
People might want to share pipelines, later run other experiments based on that pipeline definition but don't have the original notebook to hand, etc.
(Bradley has the context on this).

Experiment list title should not change

When switching between the "Experiments" and "Runs" tab, the page title is changing and the "Create experiment" button is hidden. This is incorrect, the title and actions above tabs should not change based on tab selection since it breaks the design system rules.

Also, the Experiments link in the nav must remain highlighted as long as the user is in that section.

The Experiments tab should be the default tab.

The experiments tab would be used much more frequent than the Pipelines tab.

The Experiments tab also gives much more useful information.

Create a sample notebook

Our SDK samples need a notebook do demonstrate the ability to create components and pipelines.

User should be able to create a new Recurring Run from the pipeline page.

Run status tooltip should include creation time

-- EDIT
"Last 5 runs" - is the last one (the rightmost) the real last run?

Embeddable run view page

I'd like to insert the run view page into a notebook when I submit the run.

For this I'd like to have a minimal page without left or top navigation, with only the run view and the refresh button.

Doesn't remove old containers (> maxHistory)

does the parameter maxHistory assumes removing old argo workflows, or it's just defines number of records in workflowHistory?

i have following configuration (keep it short for simplicity):

apiVersion: kubeflow.org/v1alpha1
kind: ScheduledWorkflow
metadata:
  name: iris-trainer
  namespace: playground
spec:
  enabled: true
  maxHistory: 5
  trigger:
    cronSchedule:
      cron: "@hourly"
  workflow:
    spec:
    
      # argo workflow declaration    
      entrypoint: iris-train
      onExit: exit-handler

      arguments:
        parameters:
        - name: learning-rate
          value: "0.01"
        - name: num-boost-round
          value: "100"

      templates:

      - name: iris-train

In a workfloHistory is see 5 last records:

 trigger:
    LastIndex: 47
    LastTriggeredTime: 2018-09-03T03:00:00Z
    NextTriggeredTime: 2018-09-03T04:00:00Z
  workflowHistory:
    completed:
    - Phase: Succeeded
      createdAt: 2018-09-03T03:00:08Z
      finishedAt: 2018-09-03T03:00:28Z
      index: 47
      name: iris-trainer-47-991149392
      namespace: playground
      scheduledAt: 2018-09-03T03:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-47-991149392
      startedAt: 2018-09-03T03:00:08Z
      uid: 7702b405-af25-11e8-a9d4-06bcfad5caf4
    - Phase: Succeeded
      createdAt: 2018-09-03T02:00:07Z
      finishedAt: 2018-09-03T02:00:27Z
      index: 46
      name: iris-trainer-46-1007927011
      namespace: playground
      scheduledAt: 2018-09-03T02:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-46-1007927011
      startedAt: 2018-09-03T02:00:07Z
      uid: 1516d4a3-af1d-11e8-a9d4-06bcfad5caf4
    - Phase: Succeeded
      createdAt: 2018-09-03T01:00:07Z
      finishedAt: 2018-09-03T01:00:27Z
      index: 45
      name: iris-trainer-45-1024704630
      namespace: playground
      scheduledAt: 2018-09-03T01:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-45-1024704630
      startedAt: 2018-09-03T01:00:07Z
      uid: b35d925e-af14-11e8-a9d4-06bcfad5caf4
    - Phase: Succeeded
      createdAt: 2018-09-03T00:00:08Z
      finishedAt: 2018-09-03T00:00:26Z
      index: 44
      name: iris-trainer-44-1041482249
      namespace: playground
      scheduledAt: 2018-09-03T00:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-44-1041482249
      startedAt: 2018-09-03T00:00:08Z
      uid: 51fcd7bb-af0c-11e8-a9d4-06bcfad5caf4
    - Phase: Succeeded
      createdAt: 2018-09-02T23:00:08Z
      finishedAt: 2018-09-02T23:00:28Z
      index: 43
      name: iris-trainer-43-1058259868
      namespace: playground
      scheduledAt: 2018-09-02T23:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-43-1058259868
      startedAt: 2018-09-02T23:00:08Z
      uid: f01968c7-af03-11e8-a9d4-06bcfad5caf4

Unfortunately, old containers wasn't removed:

argo -n playground list
NAME                         STATUS      AGE    DURATION
iris-trainer-47-991149392    Succeeded   33m    20s
iris-trainer-46-1007927011   Succeeded   1h     20s
iris-trainer-45-1024704630   Succeeded   2h     20s
iris-trainer-44-1041482249   Succeeded   3h     18s
iris-trainer-43-1058259868   Succeeded   4h     20s
iris-trainer-42-1075037487   Succeeded   5h     18s
iris-trainer-41-1091815106   Succeeded   6h     19s
iris-trainer-40-1108592725   Succeeded   7h     19s
iris-trainer-39-3373026837   Succeeded   8h     19s
iris-trainer-38-3356249218   Succeeded   9h     19s
iris-trainer-37-3406582075   Succeeded   10h    20s
iris-trainer-36-3389804456   Succeeded   11h    20s
iris-trainer-35-3440137313   Succeeded   12h    21s
iris-trainer-34-3423359694   Succeeded   13h    18s
iris-trainer-33-3473692551   Succeeded   14h    18s
iris-trainer-32-3456914932   Succeeded   15h    19s
iris-trainer-31-3507247789   Succeeded   16h    19s
iris-trainer-30-3490470170   Succeeded   17h    19s
iris-trainer-29-3171842504   Succeeded   18h    18s
iris-trainer-28-3188620123   Succeeded   19h    18s
iris-trainer-27-3138287266   Succeeded   20h    18s
iris-trainer-26-3155064885   Succeeded   21h    18s
iris-trainer-25-3104732028   Succeeded   22h    19s
iris-trainer-24-3121509647   Succeeded   23h    18s
iris-trainer-23-3071176790   Succeeded   1d     19s
iris-trainer-22-3087954409   Succeeded   1d     18s
iris-trainer-21-3037621552   Succeeded   1d     19s
iris-trainer-20-3054399171   Succeeded   1d     20s
iris-trainer-19-1023865987   Succeeded   1d     18s
iris-trainer-18-1007088368   Succeeded   1d     20s
iris-trainer-17-1258752653   Succeeded   1d     19s
iris-trainer-16-1241975034   Succeeded   1d     18s
iris-trainer-15-1225197415   Succeeded   1d     21s
iris-trainer-14-1208419796   Succeeded   1d     18s
iris-trainer-13-1191642177   Succeeded   1d     20s
iris-trainer-12-1174864558   Succeeded   1d     19s
iris-trainer-11-1158086939   Succeeded   1d     20s
iris-trainer-10-1141309320   Succeeded   1d     19s
iris-trainer-9-3808217808    Succeeded   1d     20s
iris-trainer-8-3824995427    Succeeded   1d     19s
iris-trainer-7-4043104474    Succeeded   1d     18s
iris-trainer-6-4059882093    Succeeded   1d     20s
iris-trainer-5-4009549236    Succeeded   1d     18s
iris-trainer-4-4026326855    Succeeded   1d     19s
iris-trainer-3-3975993998    Succeeded   1d     25s
iris-trainer-2-3992771617    Succeeded   1d     20s
iris-trainer-1-3942438760    Succeeded   1d     21s

After the user creates a run, redirect to the run page, not the runs list page.

Need to add notebooks with end-to-end sample scenarios

Support cloning run started from notebook

Currently, runs have two ways of telling which pipeline was used to start them:

Either a pipeline was uploaded to the system first, in which case the run will include its id.
Or the run was started from a notebook (or CLI), in which case it will (can) embed the entire pipeline spec.

The UI should be passing that spec when cloning a run that does not have a pipeline id. We need to also figure out the UX, since a user might still be able to change their mind after starting a clone from a run, then want to switch to another pipeline.

@ajayalfred any thoughts here?

Instruction for running it on minikube

Hey. I would like to try this project out on minikube without GKE. Can't really find any docs around this.

Unsupported Scan Error While Listing the Jobs of an Experiment

Here is the raw HTTP:

GET /api/v1/namespaces/kubeflow/services/ml-pipeline:8888/proxy/apis/v1beta1/jobs?page_size=10&resource_reference_key.id=9333ecee-28b2-4c53-807d-bbfd2a45423f&resource_reference_key.type=EXPERIMENT HTTP/1.1
Host: 35.224.113.48
User-Agent: Go-http-client/1.1
Accept: application/json
Accept-Encoding: gzip

HTTP/2.0 500 Internal Server Error
Connection: close
Audit-Id: a32bebb7-3520-4819-9f2f-1003a0d39977
Content-Type: application/json
Date: Fri, 09 Nov 2018 08:55:32 GMT

{"error":"Failed to list jobs.: List jobs failed.: List data model failed.: InternalServerError: Failed to list jobs: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type \u003cnil\u003e into type *string: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type \u003cnil\u003e into type *string","code":13,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Internal Server Error","error_details":"Failed to list jobs.: List jobs failed.: List data model failed.: InternalServerError: Failed to list jobs: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type \u003cnil\u003e into type *string: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type \u003cnil\u003e into type *string"}]}
Raw error from the service: Failed to list jobs.: List jobs failed.: List data model failed.: InternalServerError: Failed to list jobs: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type into type *string: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type into type *string (code: 13)