exasol / sagemaker-extension Goto Github PK

View Code? Open in Web Editor NEW

3.0 7.0 1.0 342 KB

An Exasol extension to interact with AWS SageMaker from inside the database

License: MIT License

Python 57.06% Shell 12.12% Lua 30.04% Dockerfile 0.28% Java 0.50%

exasol-integration exasol sagemaker machine-learning data-science

sagemaker-extension's People

Stargazers

Watchers

Forkers

frschwab

sagemaker-extension's Issues

Update dependencies

update typeguard = "^2.11.1" since the latest versions lead to following error:
TypeError: typechecked() got an unexpected keyword argument 'always'

Validate jobname and endpointname inputs

endpointname:

should meet Sql variable name pattern
should meet AutoMLJob name pattern [1]

jobname:

should meet AutoMLJob name pattern [1]

[1] https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateAutoMLJob.html

Add tutorial link to README

add the link of the prepared SME tutorial to README of the SME project

Add Python3 UDF to orchestrate the SageMaker Autopilot training from Exasol

Background

We want to start the training of SageMaker Autopilot from Exasol
For that, we can use a Python3 UDF which can either use boto3 or the Sagemaker SDK
- boto3 is quite primitive but per default in the script-language-containers installed
- the SageMaker SDK is easier and more capable than boto3, but you would need to add it to the script-language container, see exasol/script-languages-release#336
The following notebook shows at the end what you need to call from the SageMaker Python SDK
- https://github.com/exasol/data-science-examples/blob/e85ce663a0474c60bd8d8700c9ae60c05c38f1e3/tutorials/machine-learning/python/sagemaker/sagemaker_autopilot.ipynb

Tips

To get project specific container follow the example in the data-science-utils-python repo
- build script: https://github.com/exasol/data-science-utils-python/blob/main/build_language_container.sh
- container-context (needed to build the container): https://github.com/exasol/data-science-utils-python/tree/main/language_container
- container-flavor: https://github.com/exasol/data-science-utils-python/tree/main/language_container/exasol-data-science-utils-python-container
- build language container from pytest: https://github.com/exasol/data-science-utils-python/blob/main/tests/integration_tests/fixtures/build_language_container_fixture.py
- here a tutorial for the script language containers https://github.com/exasol/data-science-examples/blob/main/tutorials/script-languages/script-languages.ipynb
The following notebooks shows you how you can tell the UDFs to use the instance level credentials. see comment #2 (comment)
The Exasol DB needs the permissions to access SageMaker and S3
Use pytest for testing
Separate UDF specific parts of the code from core logic for better testability
Use #1 for testing
Use udf-mock-python for testing

Acceptance Criteria

Python3 UDF which can start a SageMaker training and waits to its completion and returns the training id

Add probability score associated with the prediction

in classification problem type probabilities that the result is predicted can be presented as a second target column.
We provide problem_type as an optional parameter in this extension like AutoML.
We need to get problem_type out from model. While deploying the endpoint, the problem_type should be determined and the predicionUDF should be prepared accordingly..

/bin/sh: amalg.lua: command not found

Dear team,
while trying to install sagemaker extension, I got the attached issue.

Please let me know if you require any additional details.
databaseProductVersion 7.1.3

Fix python version in release yaml file

specify python version in release_droid_upload_github_release_assets.yml

Enhancements in deployment of the extension

Continuation of the issue #13

make refinements in the implementation
refactor the modules

Add Lua Function for a Lua Script to export a query to S3 as CSV files

Background

We want to train models with SageMaker
SageMaker expects the data in a S3 bucket
Exasol can export the data to S3 as CSV files via the Export commands
We need a Lua Function for a Lua Script which exports a SQL query to S3
https://docs.exasol.com/7.1/database_concepts/scripting.htm
Lua Scripts support the exasol extension called pquery which allows them to run queries in the same transaction as the script got called
- https://docs.exasol.com/7.1/database_concepts/scripting/db_interaction.htm

Tips

This project will be a mixed Lua/Python project. However, I recommend to set it up as a python project with poetry as the build tool and dependency manager. For the Lua parts I recommend to follow the blog articles in our community https://community.exasol.com/t5/database-features/exasol-loves-lua-part-1-how-to-use-eclipse-ide-support-for/ta-p/752
- There should be also a Lua plugin for Pycharm if you prefer that
Write the Lua function in a way that you inject pquery into it, to allow local development and testing with a mock
- See here to bundle Lua modules https://community.exasol.com/t5/database-features/exasol-loves-lua-part-3-handling-modules/ta-p/2134
For integration tests you need to use https://github.com/exasol/terraform-aws-exasol-test-setup and for continuous integration you need https://github.com/exasol/ci-isolation-aws
Use pytest with pyexasol for integration tests
[https://github.com/exasol/data-science-examples/blob/e85ce663a0474c60bd8d8700c9ae60c05c38f1e3/tutorials/machine-learning/python/sagemaker/sagemaker_autopilot.ipynb](Check here to see how you need to export the data for SageMaker Autopilot)

Acceptance Criteria

We have a Lua function which exports a query in parallel as CSV files
We have unit tests with a Mock for pquery
We have integration tests

Remove unnecessary release droid files

Background

Release droid only needs release_droid_upload_github_release_assets.yml
the following configs can be removed release_droid_prepare_original_checksum.yml, release_droid_print_quick_checksum.yml

Acceptance Criteria

The two yaml files are removed

Background

the repository more or less empty
we want to treat the repository as a python project
with that we need
- poetry setup
- githooks (see https://github.com/exasol/bucketfs-utils-python/tree/main/githooks)
- github workflows (see https://github.com/exasol/bucketfs-utils-python/tree/main/.github/workflows, except Github pages)
- prepare repository structure

Acceptance Criteria

the above listed items are present and work in the repository

Create Python CLI Tool to deploy the extension

Background

We need a way to deploy the CREATE SCRIPT statements.
All scripts need to be created schema

Acceptance Criteria

User can provide host of the DB
User can provide credentials for the DB
User can provide a schema where the scripts get installed to
The CLI installs all necessary CREATE SCRIPT statements to run the sagemaker extension.
The CLI can print out all CREATE SCRIPT statements, such that user can copy it and run them via another SQLClient

Support temporary AWS session credentials

Background

It is common that many users use MFA to secure their AWS accounts or that you assume a role in CI, in both cases, you get temporary session credentials, instead of using a user with permanent credentials
The extension should support the usage of those temporary session credentials

Acceptance Criteria

The extensions should support permanent credentials and temporary session credentials

Setup ci-isolation for integration test

An integration test has been performed using AWS emulator Localstack. However It is necessary to perform CI in real AWS.

Setup ci-isolation environment (https://github.com/exasol/ci-isolation-aws)
Specify external configuration to pytest so that it is possible to switch between AWS and Localstack integrations.
In Localstack integration, localstack is on the same network as exasol container and uses a specified ip address. Update this configuration by taking into account the risk of overlapping IP addresses.

Update to Lua 5.4 and the newest exaerror version

This project seems to depend on a exaerror version which doesn't support Lua 5.4. However, newer Exasol version use Lua 5.4, such that we should test with that.

Get details of the polled Autopilot job status

Current polling implementation just provides general job statuses such as FeatureEngineering, Training, ...
Get details of job status, e.g. in case of failure state the reason by following below steps:

call DescribeTrainingJobResponse
check FailureReason field

Update Developer Guide

in Developer Guide

explain how to add AWS role for CI

the other updates:

update change log for recent changes
put credentials of the CI user into the keeper

Add Python3 UDF to poll Autopilot training status

Autopilot consist of thee main steps: Analyzing Data, Feature Engineering, Model Tuning. In addition to them it includes five different job status : Completed, InProgress, Failed, Stopped, Stopping.

We need to observe these steps and statuses of training models. For this, the following steps can be implemented as a UDF script:

Create a table including metadata of training models like model_name
Retrieve status of a given model which trains on Autopilot and insert it into a table

Parallel execution of CI Tests of SME

CI Tests of SME are implemented in the #34 issue.
These tests take too long ~2.5 h
Parallelize their execution

Prepare developer guide for SME

prepare a developer guide for SME

indicates that regenerate and develop setup
shows how project is built and tested

Add option for SSL certificate handling to Deployer classes

Background

in #83 we fix the connection problems with v8 temporarily by disabling the SSL verification
However, a proper fix needs to allow the user to choose to use the verification or not

Background

with poetry 1.4.0 it doesn't create the setup.py anymore https://github.com/python-poetry/poetry/releases/tag/1.4.0
we currently use poetry build to generate the setup.py
however, the setup.py isn't needed with newer pip versions and if there are releases to pypi or as wheels
for that reason, we can remove setup.py and githook that generates it, from this repo

Acceptance Criteria

Update workflows to poetry 1.4.0
Remove setup.py Github Workflow
Remove setup.py githook
Remove setup.py file

Background

This project needs Lua for testing and development
Currently, the project relies on the system Lua which could have the wrong Lua version

Update release version for the release 0.3.0

Since the release version is not updated, the release 0.3.0 has assets with older version number 0.2.2

Implement PredictionUDF using api-endpoint

Background

The extension only gets useful with prediction from Exasol
Question:
1. Do we create UDF per Model, TrainingJob (preferred)
- possible to hard code model specific things, like model connection object name
  - the connection object can be changed without changing the UDF
- you can do proper typing -> proper types Input and output variables
1. use generic UDF, where you variadic parameters
Option 1. requires a Luascript which create the UDF, connection object and the api endpoint
How to shut down and resume endpoint?

Investigate sagemaker serializer for prediction

Currently using CSVSerilize.
Check out the others and determine the best one in terms of cost/efficieninecy
https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html#sagemaker.serializers.NumpySerializer

Make batch_size configurable in PredictionUDF

keep batch_size in model_connection_obj to make it configurable

Add Autopilot endpoint information into Metadata table

keep SageMaker Autopilot endpoint information in the metadata row of the relevant model. So that users can keep track of which model is deployed and details of the endpoint such as the endpoint-name.
Discuss how to update a row when an endpoint is created/deleted.

Cannot connect to Exasol V8 over TLS

The command

python3 -m exasol_sagemaker_extension.deployment.deploy_cli --host=w.x.y.y --port=8563 --user=xxx --pass=yyy --schema=RETAIL

returns an error, that it cannot connect via Non-TLS connections. V8 only allows TLS connections:

pyexasol.exceptions.ExaRequestError:
(
message => Connection exception - Only TLS connections are allowed.
dsn => www.xxx.yyy.zzz:8563
user => xxxxxxxxxxx
schema =>
session_id =>
code => 08004
)

Needs to be fixed for V8

Save polled Autopilot training status into a log table

Add option which saves polled statuses into a log table for a given interval

training might be long, so it might produce too many status logs. it might be better to restrict the polling interval.
log table should be deleted after training is finished.

Add Lua Script which combines export and training

Background

I this feature we want to call following features from a Lua Script to have a single point of usage for the User
- #2
- #1

Acceptance Criteria

Lua Script which first exports the data and then starts the training of SageMaker Autopilot
Write a User Guide including deployment and running it

Currently the aws tests fail with
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateAutoMLJob operation: Could not assume role ***. Please ensure that the role exists and allows principal 'sagemaker.amazonaws.com' to assume the role.

Evaluate in-database prediction for trained Auto-ML models

Background

SageMaker autopilot stores the models in S3 and they look like we might be able to run them in the UDFs
We need to test if we can run some in python outside the DB and inside
Check if different models and preprocessing steps have somewhat consistent interface
How do find the best candidate?

Implement GetCandidateListUDF script

this scripts returns a list of trained models' names.
if users wants to run a model other than the best model, they can take the name of model from this list.
update endpoint deployment script to get model name as input argument

Add documentation folder

add doc folder and prepare a structure for it

Prepare release to PyPi

Background

We need to remove the setup.py, before that, we need to release the project to pypi
Here is an example workflow for this https://github.com/exasol/error-reporting-python/blob/79aca5e435de66792150da6e365eee3d0778a237/.github/workflows/release.yaml#L25
@redcatbear or @kaklakariada have to add the organization secret to this project

Acceptance Criteria

Extend the release workflow
Add the organization secret

Run real tests sequentially

Real tests, communicating with AWS services, depend on each other's results.
In real tests, the flow is as follows: train->(poll) -> deploy -> predict -> delete
Note that training part is asynchronous run. It should be checked that it is completed by polling in sequential run.

Fix CI setup

This project uses AWS CI isolation to run its integration tests with SageMaker
However, SageMaker needs a role with specific permission (SageMakerFullAcess and S3FullAccess)
In general, we could create this role in the CDK description once and make it protected, such that it doesn't get cleaned up
The issues with this is, that a CI user isn't allowed to pass or assume a protected role by the CI isolation master setup
- https://github.com/exasol/ci-isolation-aws/blob/977ea37602b1efd56ae376e8846555d49fa8fc9c/src/main/java/com/exasol/ciisolation/aws/cleanup/AccountCleanupStack.java#L87
- https://github.com/exasol/ci-isolation-aws/blob/977ea37602b1efd56ae376e8846555d49fa8fc9c/src/main/java/com/exasol/ciisolation/aws/cleanup/AccountCleanupStack.java#L94
- Solutions
  - Introduce a role which is not removed, but allowed to assume and passed
  - Create the Role during running the tests
Needed role and polices

aws iam create-role --role-name sagemaker-role --assume-role-policy-document '{
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Allow",
                        "Principal": {
                            "Service": "sagemaker.amazonaws.com"
                        },
                        "Action": "sts:AssumeRole"
                    }
                ]
            }'

aws iam create-policy --policy-name "sagemaker-s3-access" --policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "iam:CreatePolicy",
        "iam:CreateRole",
        "iam:AttachRolePolicy",
        "iam:DeleteRole",
        "iam:DeletePolicy",
        "iam:DetachRolePolicy"
      ],
      "Resource": "*"
    }
  ]
}'

aws iam attach-role-policy --role-name sagemaker-role --policy-arn arn:aws:iam::166283903643:policy/sagemaker-s3-access
aws iam attach-role-policy --role-name sagemaker-role --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

Improve error handling by validating inputs and outputs of functions

Background

We got at some point the following error message

[Code: 0, SQL State: 43000] "attempt to index a nil value (field 'integer index')" caught in script "IDA"."SME_DEPLOY_SAGEMAKER_AUTOPILOT_ENDPOINT" at line 402 (Session: 1775016774761775104)
This error is caused by something which is nil, but shouldn't be nil, we should check this before we're accessing and throw proper error messages.
It is likely, that the following line caused the error

sagemaker-extension/exasol_sagemaker_extension/resources/lua/outputs/create_statement_autopilot_endpoint_deployment_lua_script.sql

Line 403 in 3b93869

local target_column = metadata_row[1][1]
- and that the metadata_row nil was, because the db_metadata_reader didn't find the metadata row for this particular job.
- To get better error message, we should have checked in the db_metadata_reader, if the row was nil or not

Name the return column as predictions

prediction return columns is named as "predictions"
but scania trucks classification return predictions in the target column, rather than "predictions" columns

Implement UDF scripts for endpoint operations

Implement Create/Delete endpoint udf scripts

CreateEndpoint Lua Script
- Create the Endpoint for model
- Create or Update Model Connection (with the endpoint information)
- Create UDF Function (if not exists)
DeleteEndpoint Lua Script
- Delete the Endpoint
- Update the connection object with "Not running"

Update udf-mock import and poetry

Prepare release 0.1.0

Check consistency of version numbers (poetry, changelog)
Complete release letter
Configure release droid for the project

Handle prediction of uncompleted Autopilot job

if a job is interrupted before completion due to working of one of the "max_runtime" stopping criteria, autopilot might have several candidate but might not have an best candidate.
we can throw an exception saying that there is not a best candidate.

Add pytest-xdist to speed up test

xdist distributes tests across multiple CPUs
This allows to speed up

Add static code analysis for Lua

Add following tools for CI

luacheck for linting and static code analysis
luacov for coverage analyzer

Check poethepoet to add these in the pyproject.toml

Add CREATE SCRIPT statements for deployment of the training UDF

Background

We need CREATE SCRIPT statements for all UDF and LUA Scripts
For Lua we already have, see https://github.com/exasol/sagemaker-extension/blob/main/scripts/create_statement_template.sql
For the Python UDF we still need them
- They will look similar as the udf_wrapper in the UDFMock tests, except without the mock functions
- Example: Assuming the udf_wrapper in
  
  sagemaker-extension/tests/test_autopilot_training_udf_mock.py
  
  Line 18 in cf0b415
  
  def udf_wrapper():

def udf_wrapper():
   from exasol_udf_mock_python.udf_context import UDFContext
   from exasol_sagemaker_extension.autopilot_training_udf import AutopilotTrainingUDF

   def mocked_training_method(**kwargs):
       return "test_job_name"

   udf = AutopilotTrainingUDF(exa, training_method=mocked_training_method)

   def run(ctx: UDFContext):
       udf.run(ctx)

the CREATE STATEMENT would look like

CREATE PYTHON3 SET SCRIPT AutopilotTrainingUDF(model_name VARCHAR(23), ....)
EMITS (model_name VARCHAR(32)) AS
    from exasol_sagemaker_extension.autopilot_training_udf import AutopilotTrainingUDF

    udf = AutopilotTrainingUDF(exa)

    def run(ctx):
        udf.run(ctx)
/

Acceptance Criteria

Create CREATE SCRIPT statement for AutopilotTrainingUDF and AutopilotTrainingStatusUDF

Use Click for the deployment cli script of SME

Preparing the CLI interface that deploys the scripts required for SME installation with Click for

better user experience
simplify the implementation

Update to Python 3.8

Background

Python 3.6 is eol and for Python 3.7 we don't have officially supported language container
We are going to switch all other project as well to python 3.8

Acceptance Criteria

The supported python version was changed to 3.8
The packages are updated accordingly
The language container is also moved to python 3.8 minimal

exasol / sagemaker-extension Goto Github PK

sagemaker-extension's People

Stargazers

Watchers

Forkers

sagemaker-extension's Issues

Background

Tips

Acceptance Criteria

Background

Tips

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Background

Acceptance Criteria

Background

Background

Background

Acceptance Criteria

Background

Background

Acceptance Criteria

Background

Background

Acceptance Criteria

Background

Acceptance Criteria

Recommend Projects

Recommend Topics

Recommend Org