kubeflow / pipelines Goto Github PK

Machine Learning Pipelines for Kubeflow

Home Page: https://www.kubeflow.org/docs/components/pipelines/

License: Apache License 2.0

Makefile 0.16% Dockerfile 0.27% Shell 1.64% Python 47.36% Jupyter Notebook 0.75% Batchfile 0.01% Mustache 0.28% Go 16.91% Smarty 0.06% Jinja 0.01% JavaScript 3.91% TypeScript 28.56% HTML 0.02% PowerShell 0.02% CSS 0.01% Starlark 0.01% MDX 0.03%

kubeflow-pipelines mlops kubeflow machine-learning kubernetes pipeline data-science

pipelines's Issues

Unsupported Scan Error While Listing the Jobs of an Experiment

Here is the raw HTTP:

GET /api/v1/namespaces/kubeflow/services/ml-pipeline:8888/proxy/apis/v1beta1/jobs?page_size=10&resource_reference_key.id=9333ecee-28b2-4c53-807d-bbfd2a45423f&resource_reference_key.type=EXPERIMENT HTTP/1.1
Host: 35.224.113.48
User-Agent: Go-http-client/1.1
Accept: application/json
Accept-Encoding: gzip

HTTP/2.0 500 Internal Server Error
Connection: close
Audit-Id: a32bebb7-3520-4819-9f2f-1003a0d39977
Content-Type: application/json
Date: Fri, 09 Nov 2018 08:55:32 GMT

{"error":"Failed to list jobs.: List jobs failed.: List data model failed.: InternalServerError: Failed to list jobs: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type \u003cnil\u003e into type *string: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type \u003cnil\u003e into type *string","code":13,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Internal Server Error","error_details":"Failed to list jobs.: List jobs failed.: List data model failed.: InternalServerError: Failed to list jobs: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type \u003cnil\u003e into type *string: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type \u003cnil\u003e into type *string"}]}
Raw error from the service: Failed to list jobs.: List jobs failed.: List data model failed.: InternalServerError: Failed to list jobs: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type into type *string: sql: Scan error on column index 0, name "UUID": unsupported Scan, storing driver.Value type into type *string (code: 13)

Support cloning run started from notebook

Currently, runs have two ways of telling which pipeline was used to start them:

Either a pipeline was uploaded to the system first, in which case the run will include its id.
Or the run was started from a notebook (or CLI), in which case it will (can) embed the entire pipeline spec.

The UI should be passing that spec when cloning a run that does not have a pipeline id. We need to also figure out the UX, since a user might still be able to change their mind after starting a clone from a run, then want to switch to another pipeline.

@ajayalfred any thoughts here?

"No runs were found for this experiment" is a misleading message when the list is still being loaded.

User should be able to start a new run from the pipeline page.

Create a sample notebook

Our SDK samples need a notebook do demonstrate the ability to create components and pipelines.

How does it proceed with this project?

I'd like to contribute/help to this project, are there any milestones or actual codes for pipelines?

It looks like that there are two procedures:

Move argoproj/argo into here. Probably it's necessary to permit from argoproj members.
Implement the pipeline module from scratch using kubernetes APIs (e.g. CRD) like argo.

please reply if you don't mind.

The "Recurrent run configs" tab is showing inaccurate # of jobs

It looks like it's showing the total # of jobs across all experiments, instead of just that experiment.

Doesn't remove old containers (> maxHistory)

does the parameter maxHistory assumes removing old argo workflows, or it's just defines number of records in workflowHistory?

i have following configuration (keep it short for simplicity):

apiVersion: kubeflow.org/v1alpha1
kind: ScheduledWorkflow
metadata:
  name: iris-trainer
  namespace: playground
spec:
  enabled: true
  maxHistory: 5
  trigger:
    cronSchedule:
      cron: "@hourly"
  workflow:
    spec:
    
      # argo workflow declaration    
      entrypoint: iris-train
      onExit: exit-handler

      arguments:
        parameters:
        - name: learning-rate
          value: "0.01"
        - name: num-boost-round
          value: "100"

      templates:

      - name: iris-train

In a workfloHistory is see 5 last records:

 trigger:
    LastIndex: 47
    LastTriggeredTime: 2018-09-03T03:00:00Z
    NextTriggeredTime: 2018-09-03T04:00:00Z
  workflowHistory:
    completed:
    - Phase: Succeeded
      createdAt: 2018-09-03T03:00:08Z
      finishedAt: 2018-09-03T03:00:28Z
      index: 47
      name: iris-trainer-47-991149392
      namespace: playground
      scheduledAt: 2018-09-03T03:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-47-991149392
      startedAt: 2018-09-03T03:00:08Z
      uid: 7702b405-af25-11e8-a9d4-06bcfad5caf4
    - Phase: Succeeded
      createdAt: 2018-09-03T02:00:07Z
      finishedAt: 2018-09-03T02:00:27Z
      index: 46
      name: iris-trainer-46-1007927011
      namespace: playground
      scheduledAt: 2018-09-03T02:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-46-1007927011
      startedAt: 2018-09-03T02:00:07Z
      uid: 1516d4a3-af1d-11e8-a9d4-06bcfad5caf4
    - Phase: Succeeded
      createdAt: 2018-09-03T01:00:07Z
      finishedAt: 2018-09-03T01:00:27Z
      index: 45
      name: iris-trainer-45-1024704630
      namespace: playground
      scheduledAt: 2018-09-03T01:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-45-1024704630
      startedAt: 2018-09-03T01:00:07Z
      uid: b35d925e-af14-11e8-a9d4-06bcfad5caf4
    - Phase: Succeeded
      createdAt: 2018-09-03T00:00:08Z
      finishedAt: 2018-09-03T00:00:26Z
      index: 44
      name: iris-trainer-44-1041482249
      namespace: playground
      scheduledAt: 2018-09-03T00:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-44-1041482249
      startedAt: 2018-09-03T00:00:08Z
      uid: 51fcd7bb-af0c-11e8-a9d4-06bcfad5caf4
    - Phase: Succeeded
      createdAt: 2018-09-02T23:00:08Z
      finishedAt: 2018-09-02T23:00:28Z
      index: 43
      name: iris-trainer-43-1058259868
      namespace: playground
      scheduledAt: 2018-09-02T23:00:00Z
      selfLink: /apis/argoproj.io/v1alpha1/namespaces/playground/workflows/iris-trainer-43-1058259868
      startedAt: 2018-09-02T23:00:08Z
      uid: f01968c7-af03-11e8-a9d4-06bcfad5caf4

Unfortunately, old containers wasn't removed:

argo -n playground list
NAME                         STATUS      AGE    DURATION
iris-trainer-47-991149392    Succeeded   33m    20s
iris-trainer-46-1007927011   Succeeded   1h     20s
iris-trainer-45-1024704630   Succeeded   2h     20s
iris-trainer-44-1041482249   Succeeded   3h     18s
iris-trainer-43-1058259868   Succeeded   4h     20s
iris-trainer-42-1075037487   Succeeded   5h     18s
iris-trainer-41-1091815106   Succeeded   6h     19s
iris-trainer-40-1108592725   Succeeded   7h     19s
iris-trainer-39-3373026837   Succeeded   8h     19s
iris-trainer-38-3356249218   Succeeded   9h     19s
iris-trainer-37-3406582075   Succeeded   10h    20s
iris-trainer-36-3389804456   Succeeded   11h    20s
iris-trainer-35-3440137313   Succeeded   12h    21s
iris-trainer-34-3423359694   Succeeded   13h    18s
iris-trainer-33-3473692551   Succeeded   14h    18s
iris-trainer-32-3456914932   Succeeded   15h    19s
iris-trainer-31-3507247789   Succeeded   16h    19s
iris-trainer-30-3490470170   Succeeded   17h    19s
iris-trainer-29-3171842504   Succeeded   18h    18s
iris-trainer-28-3188620123   Succeeded   19h    18s
iris-trainer-27-3138287266   Succeeded   20h    18s
iris-trainer-26-3155064885   Succeeded   21h    18s
iris-trainer-25-3104732028   Succeeded   22h    19s
iris-trainer-24-3121509647   Succeeded   23h    18s
iris-trainer-23-3071176790   Succeeded   1d     19s
iris-trainer-22-3087954409   Succeeded   1d     18s
iris-trainer-21-3037621552   Succeeded   1d     19s
iris-trainer-20-3054399171   Succeeded   1d     20s
iris-trainer-19-1023865987   Succeeded   1d     18s
iris-trainer-18-1007088368   Succeeded   1d     20s
iris-trainer-17-1258752653   Succeeded   1d     19s
iris-trainer-16-1241975034   Succeeded   1d     18s
iris-trainer-15-1225197415   Succeeded   1d     21s
iris-trainer-14-1208419796   Succeeded   1d     18s
iris-trainer-13-1191642177   Succeeded   1d     20s
iris-trainer-12-1174864558   Succeeded   1d     19s
iris-trainer-11-1158086939   Succeeded   1d     20s
iris-trainer-10-1141309320   Succeeded   1d     19s
iris-trainer-9-3808217808    Succeeded   1d     20s
iris-trainer-8-3824995427    Succeeded   1d     19s
iris-trainer-7-4043104474    Succeeded   1d     18s
iris-trainer-6-4059882093    Succeeded   1d     20s
iris-trainer-5-4009549236    Succeeded   1d     18s
iris-trainer-4-4026326855    Succeeded   1d     19s
iris-trainer-3-3975993998    Succeeded   1d     25s
iris-trainer-2-3992771617    Succeeded   1d     20s
iris-trainer-1-3942438760    Succeeded   1d     21s

Experiment list title should not change

When switching between the "Experiments" and "Runs" tab, the page title is changing and the "Create experiment" button is hidden. This is incorrect, the title and actions above tabs should not change based on tab selection since it breaks the design system rules.

Also, the Experiments link in the nav must remain highlighted as long as the user is in that section.

Duplicate experiment name would return me a blank page

API would throw invalid input error if experiment name is duplicate. The UI need to handle it properly.

Need to add notebooks with end-to-end sample scenarios

SDK/Components/Python - Functions that do not return anything

ScheduledWorkflow CRD: Investigate need for retries beyond the ones provided by Argo

Currently, the ScheduledWorkflow CRD reliably starts Argo workflows, but does not monitor that they complete successfully. It relies on retries embedded in the Argo workflow itself.

The ScheduledWorkflow CRD could provide a retry functionality.

After the user creates a run, redirect to the run page, not the runs list page.

The Experiments/All runs tab should probably be the default tab in Experiments.

It's probably going to be the most frequently used tab. Especially in cases where the user only has one or two experiments.

UI should remember my "Rows per page" choice.

Switch to Go 1.11 modules and package management.

See https://github.com/golang/go/wiki/Modules

Argo has just switched recently: https://github.com/argoproj/argo/pull/1071/files

Instruction for running it on minikube

Hey. I would like to try this project out on minikube without GKE. Can't really find any docs around this.

Run list perf optimizations

This is a quick analysis of the experiment details page performance. Most of the time spent is because we have to make multiple consecutive requests to load all the information we need. We currently do this:

Call getExperiment API to get the details (name, description.. etc).
Call listJobs API to get all recurring jobs in this experiment.
Call listRuns API to show the first page of runs in this experiment.
For each run (in parallel), call its getRun API to get its details (name, status, duration... etc).
For each run (in parallel), call getPipeline on its pipeline ID, in order to show the pipeline name.
For each run (in parallel), call getExperiment on its experiment ID, if any, to show the experiment name. This is not needed when listing runs of a given experiment, but it's technical debt we accumulated, since we're using the same component to list runs everywhere.

Some low-hanging perf improvements can be obtained by doing the following:

There is no need to do the first three steps in sequence, they're not interdependent.
There is no need to do the second three steps in sequence, they're all using the same metadata fields.
We can render the list of runs as we get them, before we get the details of each run and its pipeline. This will show their names and statuses, but not their run time or their pipeline name. Subsequent requests then re-render to fill out these fields.

Return the total number of resources in list APIs

This is needed by the UI to show the total number of resources when paging through them. It's also needed when showing the total number of recurring runs in an experiment.

Popped out TFMA is too small

Pipeline input cleansing

It might make sense to cleanse the pipeline input.
For example, if there is a parameter that requires a GCS path, with a negligible SPACE before the GCS path(e.g. " gs://pipeline/input-bucket") would lead to the pipeline failure. This type of bugs are hard to detect.

Give generated exit-handler a more special name

If an exit handler is specified in the DSL, the compiler currently uses a bit of a hack to ensure that it is always called before the pipeline terminates: an extra DAG is added to the compiled yaml that is named something like exit-handler-1 which is the only task of the entrypoint DAG, and wraps all other steps within the pipeline (they are tasks within the exit handler's DAG).

This works alright as a workaround, but it clutters the UI in a rather unhelpful way, so we currently hide this node by checking if the name starts with "exit-handler".

This is potentially problematic as it's not the most unlikely name for a user to pick, so maybe something like "__exit-handler" would be better until a better solution is found to the exit handler problem.

Fasten release process for the image tag update.

Currently, we have hardcoded the release version as the image tag in the samples. We need to make it easy to update these image tags during releases.
As best, we can avoid double releases.

feature request: in SDK support 'get_or_create_experiment()'

feature request: In the SDK, support get_or_create_experiment() in addition to create_experiment()
(Bradley has the context).

Persist pod logs after they finish

This is needed so that pods can be garbage collected after they've finished, and to remove the dependency on the cluster state.

Currently, we're already seeing issues when a cluster is resized where the frontend can't find pods started by some of the runs.

ScheduledWorkflow CRD: CLI

Will it prepare the scheduledworkflows' CLIs in this repository to manage controller like argo? Or adding subcommands to argo.

Pipeline API Server Swagger Client (Go) for Pipeline Upload returns incomplete output

The Pipeline API Server Swagger Client (Go) returns incomplete output for the pipeline.upload method.

For instance, using the CLI in the directory:

kubeflow/pipelines/backend/src/cmd/ml

With the command:

go run main.go pipeline --namespace kubeflow upload ./samples/hello-world.yaml --name pipleline87 -o json

We get the output:

{
"created_at": "0001-01-01T00:00:00.000Z",
"parameters": null
}

Remember the page I was on

In the All runs list, I click on a run, then use the browser's back button. This does not return me to the same page.

Cannot create a Notebook when starting KFP from cloud shell

I use Cloud Shell to start KFP. I am able to access KFP UI and complete workflows.
I can spin up Jupyter Notebook but I'm unable to create a notebook. And I am unable to open an uploaded Notebook – getting the error shown below

Show results of the entire workflow in one view

Today the user needs to click on every stop of the workflow to see it's output. This is not convenient for complex reusable workflows. We need to support a mode when the user can see the output of the entire graph in a single feed.

User should be able to upload a new pipeline when creating a new run.

http://localhost:9090/pipeline/#/runs/new?experimentId=8348da85-566b-4b11-a286-da8615ae9c6c

Can't read full text in "Choose a pipeline" dialog

The parts that are visible in the Pipeline name and the Description columns are too short to decide which one to choose.

Maybe allow text wrapping with a max row height, and/or narrowing the timestamp column?

Compare experience – UX changes

Rows selected for comparison should not display selection styling in the 'Overview' section. The checkboxes remain active and rows aren't highlighted. See below for row styling.

Run overview section cannot be collapsed. Remove collapse action.
Parameters section table style below. Run (objects) are shown as rows, and parameters (attributes) are shown as columns. Match table font style with 'Overview' table (Roboto, 14px, for cell content)

Metrics is to be its own section and not part of overview section

Title of the aggregate view should read "All selected runs"
Vertical top-align all charts in a section

Embeddable run view page

I'd like to insert the run view page into a notebook when I submit the run.

For this I'd like to have a minimal page without left or top navigation, with only the run view and the refresh button.

feature request: restore the client method for creating a pipeline

In addition to run_pipeline (which doesn't actually create a pipeline object in the UI), bring back the client method for actually creating the pipeline.
People might want to share pipelines, later run other experiments based on that pipeline definition but don't have the original notebook to hand, etc.
(Bradley has the context on this).

Remove python op decorator

Let's remove dsl.python_op in favor of dsl.python_component.

The Experiments tab should be the default tab.

The experiments tab would be used much more frequent than the Pipelines tab.

The Experiments tab also gives much more useful information.

Better render pipeline description

The sample now has a link to the source code. The text is cropped.

Also we probably should render the URL as hyperlink, so user don't have to copy the path.

SDK should require kubernetes client lib

It doesn't look like this install:
pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.2/kfp.tar.gz --upgrade
includes the kubernetes client lib, which I think is intended to be included, as the %%docker "magic" requires it.

Create GCS path checker component

Our test code and test images code is not always the same.

This issue manifested itself several times. The latest was today when the prow tests were fixed.

Out test images only use the branch code.

git clone https://github.com/kubeflow/pipelines /ml
git checkout 6e96b054fb2585f3577155fa92dd107c6e1b5dd2

But the tests that Prow runs are taken from the result of merging the base branch (master) with the PR branch.

I1103 23:36:06.094] Checkout: /workspace/github.com/googleprivate/ml master:296b540cd724fed645e1652f12428462fd5375ed,1532:5afe507591f58f76a12c9f0f3b6659a30b657060 to /workspace/github.com/googleprivate/ml
I1103 23:36:06.094] Call:  git init github.com/googleprivate/ml
I1103 23:36:06.101] Call:  git clean -dfx
I1103 23:36:06.105] Call:  git reset --hard
I1103 23:36:06.110] Call:  git config --local user.name 'K8S Bootstrap'
I1103 23:36:06.116] Call:  git config --local user.email k8s_bootstrap@localhost
I1103 23:36:06.122] Call:  git fetch --quiet --tags [email protected]:googleprivate/ml master +refs/pull/1532/head:refs/pr/1532
I1103 23:36:10.765] Call:  git checkout -B test 296b540cd724fed645e1652f12428462fd5375ed
I1103 23:36:11.199] Call:  git show -s --format=format:%ct HEAD
I1103 23:36:11.204] Call:  git merge --no-ff -m 'Merge +refs/pull/1532/head:refs/pr/1532' 5afe507591f58f76a12c9f0f3b6659a30b657060

This effectively means that the test code is always taken from master while test image code is taken from the branch and may be out of sync.

We should also do something like

git clone https://github.com/kubeflow/pipelines
cd pipelines
git merge --no-ff 321ca814db4955b3950b0fac06a2d289fe4db39a -m "Merged PR"

kubeflow / pipelines Goto Github PK

pipelines's Issues

Recommend Projects

Recommend Topics

Recommend Org