getindata / kedro-vertexai Goto Github PK

View Code? Open in Web Editor NEW

32.0 14.0 9.0 3.07 MB

Kedro Plugin to support running workflows on GCP Vertex AI Pipelines

Home Page: https://kedro-vertexai.readthedocs.io

License: Apache License 2.0

Python 100.00%

machinelearning mlops kedro kedro-plugin vertexai googlecloudplatform

kedro-vertexai's Introduction

Kedro Vertex AI Plugin

About

The main purpose of this plugin is to enable running kedro pipeline on Google Cloud Platform - Vertex AI Pipelines. It supports translation from Kedro pipeline DSL to kfp (pipelines SDK) and deployment to Vertex AI service with some convenient commands.

The plugin can be used together with kedro-docker to simplify preparation of docker image for pipeline execution.

Documentation

For detailed documentation refer to https://kedro-vertexai.readthedocs.io/

Usage guide

Usage: kedro vertexai [OPTIONS] COMMAND [ARGS]...

  Interact with Google Cloud Platform :: Vertex AI Pipelines

Options:
  -e, --env TEXT  Environment to use.
  -h, --help      Show this message and exit.

Commands:
  compile         Translates Kedro pipeline into JSON file with Kubeflow...
  init            Initializes configuration for the plugin
  list-pipelines  List deployed pipeline definitions
  run-once        Deploy pipeline as a single run within given experiment.
  ui              Open VertexAI Pipelines UI in new browser tab

Configuration file

kedro init generates configuration file for the plugin, but users may want to adjust it to match the run environment requirements. Check documentation for details - kedro-vertexai.readthedocs.io

kedro-vertexai's People

Contributors

Stargazers

Watchers

Forkers

szczeles michalbrys doxenix responsibleaiml millsks adrienpl neuron1c aranjandev mariusz89016

kedro-vertexai's Issues

[feature] Code upload instead of docker push workflow

This idea is borrowed from Azure ML (and this PR getindata/kedro-azureml#15 ) - where you define an Environment, which is a docker image which runs your image, but the code is not part of the image (only dependencies are present in the image).
The workflow for that will make Data Science iterations faster, as they will not have to build the docker image every time they want to run / debug something in Vertex AI. This issue itself will be partially addressed by #81 , but this would be a next iteration on that.

General workflow would work like this:

Docker image with dependencies is uploaded to the container registry.
User runs kedro vertexai run-once with some flag (or maybe we should have kedro vertexai run for docker and kedro vertexai run-once for this flow 💡)
The code of Kedro project is copied to GCS (first packaged and compressed) and the job is started within the container. The container should have a modified entrypoint which will first download the code from GCS and then execute it.

Please discuss the design with @em-pe and @szczeles before implemeting.

Set up Dependabot and CodeQL

Set up Dependabot and CodeQL scans to enable automated dependency updates and security alerts.

Add `wait_for_completion` option to pipeline config.

Add wait_for_completion boolean flag to config. When set true, after submitting the job process should query in loop for the pipeline status and upon completion log status message and return relevant exit code.

Implement a generic solution for external services authentication

We have cases where pipeline talks to external services like mlflow and this communication requires some kind of authentication. We've already done that in a bit hacky way in (src/kedro_vertexai/auth.py) but it's not very extensible and elegant, plus there are multiple scenarios to cover and we cannot foresee all of them so my idea is to a generic solution:

We provide an interface AuthProvider responsible for obtaining auth tokens (get_mlflow_tracking_token is a sample method to override)
We extend the configuration file with auth section:

auth: 
   provider: kedro_vertexai.auth.GoogleOAuthProvider
   params: <map>

Where provider is a subclass of AuthProvider implementing all the necessary methods and params is the map of constructor params for the class.
3. When pipeline runs provider is instantiated dynamically and injects auth tokens to all the necessary places.

Difficulty Defining CPU and GPU Machine Types in Kedro-Vertex (vertexai.yml)

Problem:
I'm encountering difficulty in defining the CPU and GPU machine types with respect to nodes and pipelines in vertexai.yml within the Kedro-Vertex framework.

Expected Behavior:
I expect to be able to specify the CPU and GPU machine types for nodes and pipelines in vertex.yml to effectively utilize CPU and GPU resources as needed.

Current Behavior:
I've searched through the documentation and codebase but haven't found clear instructions on how to achieve this. This makes it challenging to optimize the resource utilization for my specific workflow.

Steps to Reproduce:

Create a Kedro-Vertex project.
Attempt to define CPU and GPU types for nodes and pipelines in vertexai.yml.
Encounter difficulties or confusion in the process.

Additional Information:

I've reviewed the official documentation, but the guidance on this specific aspect seems to be lacking.
I've also searched for relevant examples or discussions on forums and GitHub issues but haven't found any direct solutions.

Environment:

Kedro version: 0.18.14
Kedro-Vertex version: 0.9.1
Python version: 3.8.18
Operating System: Mac

Suggested Solution:
It would be helpful to provide more detailed documentation or examples on how to define CPU and GPU machine types for nodes and pipelines in vertex.yml. Alternatively, if this feature is not yet supported, it would be great to know the current status and any workarounds.

Notes:
vertexai.yml is generated by command kedro vertexai init

This issue aims to improve resource management and clarity within Kedro-Vertex, making it easier for users to define CPU and GPU machine types for their nodes and pipelines. Your attention to this matter is greatly appreciated.

Refactor config to something less verbose

Config class is really verbose, a lot of repeatability could probably be avoided by using some library, e.g. pydantic.

[feature] Add --env-var to run-once

For feature parity with other plugins.

Assign pipelines to experiments

In Vertex AI there is a concept of experiments that helps to track pipeline runs metadata. We could add a new field to plugin config, allowing to specify experiment name, and use the value (if provided) to assign pipelines to this experiment.

Update documentation, installation instructions

Make it clearer that the plugin only reports commands when inside initialized kedro project. Add extra step to make a test project (like spaceflights instance) or point to the docs of kedro for steps to do that.

Fix issues with data catalog namespacing

When spaceflights starter in version 0.17.7 is used, running it on Vertex AI results in

ValueError: Pipeline input(s) {'data_processing.preprocessed_shuttles', 'data_processing.preprocessed_companies'} not found in the DataCatalog

Plug-in should be able to handle latest versions of the starters and pipeline namespacing.

Bumping version to python 3.11 leads to OverrideNeeded/Assertion error

error.txt

Figure out what blocks python 3.11. Probably could be related to kfp, but the error hints to jinja/mlflow.

Googling leads to not related issue: python-poetry/poetry#5791

Use Dynamic Workload Scheduler

Google launched Dynamic Workload Scheduler at the end of last year. It optimizes the allocation and utilization of AI/ML resources like GPUs and TPUs. It operates in two modes:

Flex Start Mode

Allows flexible and cost-effective access to GPUs/TPUs by requesting capacity as needed, with no minimum duration. This mode is ideal for shorter, less predictable tasks.

Calendar Mode

Enables users to reserve GPU/TPU capacity for specific future dates, ensuring availability for scheduled workloads, suitable for tasks with defined start times and durations.

Is this feature usable with the Kedro VertexAI plugin?

Unable to Define and Use Different Docker Images for Different Tasks in Kedro-Vertex (vertexai.yml)

Problem:
I am facing difficulties in Kedro-Vertex when trying to define and use different Docker images for distinct tasks, such as data preprocessing and model training, model inference.

Expected Behavior:
I expect to be able to specify and use separate Docker images for various tasks within my Kedro-Vertex workflow. This flexibility is crucial for optimizing resource utilization and dependencies for different stages of my vertex pipeline.

Current Behavior:
I have scoured the documentation and explored the codebase but have not found clear instructions on how to achieve this feature. As a result, I'm uncertain about how to implement different Docker images for different tasks.

Steps to Reproduce:

Set up a Kedro-Vertex project.
Attempt to define different Docker images for preprocessing and training tasks.
Encounter challenges or confusion during the process.

Additional Information:

I have reviewed the official documentation, but there is a lack of guidance on defining and using multiple Docker images for distinct tasks.
I have searched for relevant examples or discussions on forums and GitHub issues but have not found any direct solutions.

Environment:

Kedro version: 0.18.14
Kedro-Vertex version: 0.9.1
Kedro-Docker version: 0.4.0
Python version: 3.8.18
Operating System: Mac

Suggested Solution:
It would be incredibly valuable to provide documentation or examples demonstrating how to define and use different Docker images for various tasks within a Kedro-Vertex project. If this feature is not currently supported, it would be helpful to know its status and any potential workarounds.

Notes:
vertexai.yml is generated by command kedro vertexai init

This issue is aimed at improving the flexibility and resource management in Kedro-Vertex by allowing users to define and use different Docker images for different tasks. Your attention to this matter is greatly appreciated.

Add support for python 3.9 and 3.10

Make sure plugin can work with python versions 3.8, 3.9 and 3.10. Modify test process to verify all three versions as working as expected (matrix builds).

Solve FutureWarning: AIPlatformClient will be deprecated in v2.0.0

/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/kfp/v2/google/client/client.py:169: FutureWarning: AIPlatformClient will be deprecated in v2.0.0. Please use PipelineJob https://googleapis.dev/python/aiplatform/latest/_modules/google/cloud/aiplatform/pipeline_jobs.html in Vertex SDK. Install the SDK using "pip install google-cloud-aiplatform"
warnings.warn(

Evaluate use of defaultdict for resources section handling in config.

What about using from collections import defaultdict with
resources: Optional[Dict[str, ResourcesConfig]] = defaultdict(lambda: ResourcesConfig(cpu="500m", memory="1024Mi"))

it would also simplify (or even make not needed) the resources_for function.

Originally posted by @marrrcin in #11 (comment)

Investigate overriding default MemoryDataSet

Right now the plug-in requires all intermediate datasets to be stored in GCS, which forces the pipeline authors to remember that all intermediate datasets need to be explicitly defined in the data catalog.
Consider overriding the create_default_data_set in the ThreadRunner https://github.com/kedro-org/kedro/blob/74fbd752b6bf77c409f016ed73e60a4ea38d6a95/kedro/runner/thread_runner.py#L51 or find other extension point to make the use of intermediate datasets transparent to the users (we could store them in GCS, as we already have paths to bucket and unique run id in the plugin).

Params of type list not accepted in VertexAI

When trying to use parameters of type list vertexAI pipeline fails with the error.

INPUT in params.yml:

signature_columns:
  - email
  - created_at
  - creation_country
  - has_company_name
  - partner_id
  - account_type
  - followers_count
  - posts_count
  - is_private
  - is_verified
  - risk_level

Generated step that fails:

Error:

Add CMEK support

Allow to specify customer-managed encryption key for the Vertex pipeline execution.

Passing in runtime parameters seems to fail.

When trying to do a, as indicated in the docstring with '='

kedro vertexai run-once --image test:latest --pipeline default --param date_param=20240800

i always end up with this value error

ValueError: The pipeline parameter date_param=20240800 is not found in the pipeline job input definitions.

When using the ':' like described in the kedro documentation

kedro vertexai run-once --image test:latest --pipeline default --param date_param:20240800

I get

ValueError: The pipeline parameter date_param is not found in the pipeline job input definitions.

The parameter is set within the base config -> model_name -> data-preparation -> parameters.yaml
It's defined as:

param1: value1
param2: value2
data_preparation:
  training_month: "${runtime_params:date_param, 20230900}"
  param3: value3
  ...

Within the pipeline code it's called as:

...
# Create training and prediction DFs
def create_pipeline(**kwargs) -> Pipeline:
    """Function to create the data prep pipeline."""
  return pipeline(
      [
            ...
            node(
                func=nodes_data_prep.prepare_df,
                inputs={
                    "some_df": "some_df",
                    "training_month": "params:data_preparation.training_month",
                  },
                  outputs="model_input.master_df",
                  name="create_master_df",
                  tags=["data_preparation_prod", "group.data_preparation"],
            ),
            ...
      ] 
  )

I'm sure, i'm missing something, but i can't figure out how the --param argument passes in runtime parameters and how to define them correctly.
Param definitions tried:

--param date_param=20240800
--param training_month=20240800
--param date_param:20240800
--param training_month:20240800

E2E test fix unauthorized pull from docker repository

Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

Update kfp dependencies

Update plugin dependencies to recent versions:

kfp - 2.*

Allow specifying vertex service account in config file.

Currently service account can be passed only with env variable

kedro-vertexai/kedro_vertexai/client.py

Line 90 in 6710c40

service_account=os.getenv("SERVICE_ACCOUNT"),

ideally I should be able to use config file for this.

Auto docker-build on "run-once"

Having to build the docker image on every code change, before user runs kedro vertexai run-once is exhausting and error prone. Introduce --auto-build flag which will use Python Docker SDK to build the image and push it to GCR before executing the pipeline.
A few implementation notes:

If there is no dependency on docker sdk right now, make it as extra and warn about missing packages when --auto-build flag is used. If the dependency exist, just leave it as is.
--auto-build should warn the user if the image tag specified in the config is not :latest, because it means it will overwrite the image. Maybe an interactive prompt [y/n] will be enough here. Use click for that.

Please discuss other design decisions with @em-pe and @szczeles .

Update kedro dependencies

Update plugin dependencies to recent versions:

kedro - 0.18.*

Dependencies issues pydantic

Hello,

With kedro-vertexai 0.11.0, we can't use the plugin with kedro-viz or other pydantic dependent package since kedro-vertexai is locked to pydantic <2.0.0 and kedro-viz for kedro 19 > need pydantic >= 2.0.0

Is it possible to upgrade pydantic dependencie ?

Regards,

Extend E2E test scenario with mlflow support.

Current E2E tests don't cover use of kedro-mlflow. The idea is to extend tests coverage our own MlFlow instance.

installation fails with python 3.10 and 3.11

Hi,

I am having issues installing this plugin with any python version with pip install kedro-vertexai

with python 3.10

AttributeError: cython_sources
[end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

cause pyyaml pinned version and cython doesn't live happily together.

with python 3.11 installation fails cause pinned grpcio=1.44 doesn't support python 3.11 but I guess that's fair since that version is not officially supported.

Any suggestion?

Thanks

grcpio dependency update to >=1.53

There was a dependabot alert for gRPC being vulnerable for grcpio <1.53. I tried to update our environment, but we could not because kedro-vertexai has a requirement of grcpio >=1.44, <1.45.

$ poetry lock
Updating dependencies
Resolving dependencies... (5.4s)

Because no versions of kedro-vertexai match >0.7.0,<0.8.0 || >0.8.0,<0.8.1 || >0.8.1,<0.9.0 || >0.9.0,<1.0.0
 and kedro-vertexai (0.7.0) depends on grpcio (>=1.44.0,<1.45.0), kedro-vertexai (>=0.7.0,<0.8.0 || >0.8.0,<0.8.1 || >0.8.1,<0.9.0 || >0.9.0,<1.0.0) requires grpcio (>=1.44.0,<1.45.0).
And because kedro-vertexai (0.8.0) depends on grpcio (>=1.44.0,<1.45.0), kedro-vertexai (>=0.7.0,<0.8.1 || >0.8.1,<0.9.0 || >0.9.0,<1.0.0) requires grpcio (>=1.44.0,<1.45.0).
And because kedro-vertexai (0.8.1) depends on grpcio (>=1.44.0,<1.45.0)
 and kedro-vertexai (0.9.0) depends on grpcio (>=1.44.0,<1.45.0), kedro-vertexai (>=0.7.0,<1.0.0) requires grpcio (>=1.44.0,<1.45.0).
So, because django-kedro depends on both kedro-vertexai (>=0.7.0,<1.0.0) and grpcio (>=1.53.0,<2.0.0), version solving failed.

Solve FutureWarning: APIs imported from the v1 namespace

Right now we're using mostly v1 APIs from kfp.dsl, which causes a warning:

FutureWarning: APIs imported from the v1 namespace (e.g. kfp.dsl, kfp.components, etc) will not be supported by the v2 compiler since v2.0.0

We should solve those issues to make our plug-in up-to-date. There is a namespace with similar classes:

from kfp.v2.components.experimental.structures import ComponentSpec # and others...

But the API is not 1:1 compatible, so right now it's a breaking change for us.

Utilise kfp Artifacts

In kfp there is a concept of Artifacts that integrate with Vertex AI and help to track metadata. We could implement custom Kedro dataset that would utilise this feature.

Incorrect inheritance in class MLFlowGoogleOAuthCredentialsProvider

class MLFlowGoogleOAuthCredentialsProvider in gcp.py file should inheritance from RequestHeaderProviderWithKedroContext

Based on documentation - doc
If we will choose OAuth authorization and do like is desribed right now:

settings.py

DISABLE_HOOKS_FOR_PLUGINS = ("kedro-mlflow",)
from kedro_vertexai.auth.mlflow_request_header_provider_hook import MLFlowRequestHeaderProviderHook
from kedro_vertexai.auth.gcp import MLFlowGoogleOAuthCredentialsProvider
from kedro_mlflow.framework.hooks import MlflowHook
HOOKS = (MlflowHook(), MLFlowRequestHeaderProviderHook(MLFlowGoogleOAuthCredentialsProvider), )

we get error:
AssertionError: Provider class needs to be a subclass of RequestHeaderProviderWithKedroContext

Adding intermediate component Artifacts

Hi, in one of the projects I need to create and display a Kubeflow Artifact between the components to quickly and conveniently get the location of the data exchanged by the nodes. I haven't checked that yet but if the plugin saves the data in a binary format (or any other format that's not really human readable), I might have to implement a custom Dataset or some changes might be required in the plugin internals.

Would you be interested in such contribution?

Investigate scheduled_run_name

It seems like this value is outdated/no longer used. In the code experiment_name is the config parameter for run name in vertex.

Possibility to passs runtime parameters to compiled pipeline

Hello,

Im am just enquiring to find out if the is an option to configure runtime parameters for the compiled vertex pipeline? I have a requirement to set seperate run names at runtime of my vertex pipeline.

I can't quite figure out from the documentation if this is possible using the plugin. Thanks for your help

Templated configuration does not work with globals

When trying to use a global parameter saved in globals.yml, the following error is returned:

ValueError: Failed to format pattern '${data_root_path}': no config value found, no default provided

After updating hooks.py and adding a variable KEDRO_VERTEXAI_DISABLE_CONFIG_HOOK=true, global parameters start to be recognized.

Has developmet stopped?

Last commit seems to be from 1 year ago, which makes it risky to use this plug-in in production, without an ongoing maintenance.

Restore scheduler functionality

Scheduler was temporarily disabled due to slight issues with MLFlow integrations (token obtaining) and not working re-scheduling (old schedules were not disabled).
Restore this functionality in future releases.