Add docstrings

Background

Parts of the code don't contain docstrings

Acceptance Criteria

All relevant classes and functions container docstrings

Extract GROUP BY Clause for parallelization from TrainingRunner and TrainUDF

Background

At the moment, is the group by clause which is responsible for the parallelization hard-coded in the TrainingRunner
- GROUP BY IPROC(), floor(rand(1,4))

Acceptance Criteria

Replace the hard-coded group by clause with a parameter

Add StandardScaler

Background

The StandardScaler is besides the MinMaxScaler one of the most important normalization methods. Many models depend on it. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Acceptance Criteria

StandardScaler as SQL ColumnPreprocessor
Fixed StandardScaler for scikit learn which doesn't need to be fitted

Add UDFs which combine the model iterators with the BucketFS

Background

The model train iterators don't work alone, we need UDF which combine them with uploading/downloading models from BucketFS

Acceptance Criteria

Write UDF which combines the training with the upload of the model to the BucketFS
Write UDF which combines the prediction with the download of the model from the BucketFS
Write a UDF to merge multiple models read from the BucketFS and then write back the merged model into the BucketFS

Add first version of iterators to train and predict scikit-learn models

Background

Writing scalable code for model training and prediction for UDFs is hard
Scikit-learn is one of the most used ML Frameworks

Acceptance Criteria

Write an iterator based abstraction to train and predict certain scikit - learn models
- Online Algorithms, such SGDClassifier/Regressor https://scikit-learn.org/0.15/modules/scaling_strategies.html
- Train one decision forest per batch
Write a mechanism to combine multiple models to an ensemble
- for general models
- for random forest

Preliminary work https://github.com/exasol/data-science-utils-python/tree/model_utils

Update dependencies

Add experiment name parameter to the Train UDFs

Background

We need to store somewhere the results of the training runs, at the moment we store them in two tables
The first tables stores the base estimators paths
The second tables stores the path to the final estimator
However, it is likely, that we need to do multiple training runs and want to store their results

Acceptance Criteria

Add experiment name parameter to TrainUDF
Add a column in the tables for this parameter
Use the parameter to extend the bucketfs path

Update dependencies

scipy and joblib need to be updated to resolve CVEs

Improve SQLExecutor

Background

The SQLExecutor is needed in the https://github.com/exasol/advanced-analytics-framework
However, when we used it there we recognized that the columns functions returns a Dict instead of a column object and that the mock implementation are only available in the test of this project, but are also needed in other projects

Acceptance Criteria

Let the columns method of the SQLExecutor return a column object
Move the MockSQLExecutor to a testing package which will be included in the package
Implement not implemented methods

Correct GITHOOKS_PATH

GITHOOKS_PATH should include data-science-utils-python

data-science-utils-python/githooks/install.sh

Line 15 in 592ac86

GITHOOKS_PATH="$GIT_DIR/modules/bucketfs-utils-python/hooks"

Inject the preprocessing via Dependency Injection into the model iterators

Background

Currently, the preprocessing is hard coded into the model iterators

Acceptance Criteria

The preprocessing can be injected as dependency into the model iterators

Minor Refactoring of TrainRunner and TrainUDF

Background

input and target column lists are not needed anymore, we only need a column list as parameter
the name is not appropriate, because both only work for PartialFitRegressions

Acceptance Criteria

Rename to PartialFitRegressor*
Replace input and target column list with a single column list

Add first version of a sql based preprocessing library

Background

Many preprocessing steps for Machine Learning can be expressed in SQL
SQL is in Exasol usually way faster than UDFs or external processing

Acceptance Criteria

Add sql generator for min-max scaler based on sql
Add dictionary generator for categorical columns based on sql
Use a similar interface as scikit-learn with fit and apply

Let the Table and ColumnPreprocessor execute SQL queries

Background

For some preprocessing we actually need the results of some SQL queries to generate the next queries

Acceptance Criteria

Table and ColumnPreprocessor accept a SQLExecutor which is an abstraction around a mechanism to execute SQL
Table and ColumnPreprocessor use it to run the queries by them self
Interface gets change to fit and transform instead fit_queries and transform_queries

Fix hash and eq function in schema

Background

Our current Schema classes have some issues with their implementation of __hash__ and are, in fact, only immutable structs, this is perfect for dataclasses.

The DBObjectName classes should be fine and should be normal classes.

Add retry and wait to CombineToVotingRegressorUDF

Background

We upload and download the base models from the BucketFS
The BucketFS synchronization is asynchronous, such that it can happen that we try to download a base model in CombineToVotingRegressorUDF before it was synchronized to the particular node where the UDF runs

Acceptance Criteria

Add retry and wait mechanism and parameter to CombineToVotingRegressorUDF

Refactor schema classes and add missing classes

Background

We use the schema classes in https://github.com/exasol/advanced-analytics-framework
Currently, the these classes are implementations, however, in the AAF, we need new implementation of them which work as proxy to for the temporary dbobject managed by the QuerHandlerContext
Further, the AAF needs also classes for UDFs and ConnectionObjects

Acceptance Criteria

Extract Interfaces from the dbobject name classes
Add a static create method to the interfaces which create the default implementation
Remove the old Builder classes, because they are actually not so useful for python with its keyword arguments
Add UDFs and ConnectionObjects

Add documentation

Background

This project still misses documentation including readme, user guide, ...
We also miss the automated generation publishing of the documentation from https://github.com/exasol/bucketfs-utils-python

Acceptance Criteria

- Add documentation generation and publishing
- Add user guide
- Usage
- Overview about the training process
- Add Readme
- Add changelog
- Add Developer Guide

Add readme

add a short description

Add retry and with to PartialFitUDF

Background

in PartialFitTrainRunner, we upload a model prototype to the bucketfs, directly afterwards we start the PartialFitUDF.
The bucketfs syncs asynchronously and the PartialFitUDF runs on each node, for that reason, we need to retry and wait in the PartialFitUDF

Acceptance Criteria

PartialFitUDF waits and retries downloading the prototype model
USer can configure with a parameter how long.

Add bucketfs path to the Model UDFs

Background

Currently, we use the base-path specified by the model_connection to store the models
However, it is likely that we can't create bucketfs connections at will, for security reason, maybe someone creates one for us with the credentials and grants us the usage.

Acceptance Criteria

Add a bucketfs path to the model and use it to store the models

Remove support for python version < 3.8

This project supports python version >= 3.6.1 . We want to remove older python versions.

Remove support for python version < 3.8

Relock and remove setup.py

Update Pillow version

There is a dependabot alert for upgrading Pillow library to version 9.1.1

When reading a TGA file with RLE packets that cross scan lines, Pillow reads the information past the end of the first line without deducting that from the length of the remaining file data. This vulnerability was introduced in Pillow 9.1.0, and can cause a heap buffer overflow.

Opening an image with a zero or negative height has been found to bypass a decompression bomb check. This will now raise a SyntaxError instead, in turn raising a PIL.UnidentifiedImageError.

Move BucketFS Location to bucketfs-utils-python

Move all related files to the BucketFSLocation to project https://github.com/exasol/bucketfs-utils-python. This includes AbstractBucketFSLocation and subclasses and the BucketFSFactory.

AbstractBucketFSLocation: https://github.com/exasol/data-science-utils-python/blob/main/exasol_data_science_utils_python/udf_utils/abstract_bucketfs_location.py
BucketFSFactory; https://github.com/exasol/data-science-utils-python/blob/main/exasol_data_science_utils_python/udf_utils/bucketfs_factory.py

TrainUDF should work with multiple groups

Background:

TrainUDF as a UDF could get multiple groups as input
Each group could be seen as a configuration
If we can train models for multiple configuration we could implement hyperparameter optimization
If we have multiple groups in can happen, that we either have multiple configurations on the same UDF instance or we have multiple instances

Acceptance Criteria

TrainUDF and TrainRunner can handle multiple groups
Each groups trains a model with its own id
All groups share a job id, such that we can identify which models were trained together

Table and ColumnPreprocessor return the names of generated tables and columns

Background

At the moment, the Table and ColumnPreprocessor create tables with queries, but we don't have any way to get their schema

Acceptance Criteria

Return Schema for created Table
For fit return it per input column

Update poetry version and udf-mock import

Fix hash function of schema classes

Background

We can't use id() for computing hashs, because we serialize and deserialize dicts and sets with them
The default hash functions uses id()

Move to advanced-analytics-framework

Parts that need to move to the advanced-analytics-framework

https://github.com/exasol/data-science-utils-python/tree/main/exasol_data_science_utils_python/schema
https://github.com/exasol/data-science-utils-python/tree/main/exasol_data_science_utils_python/udf_utils
https://github.com/exasol/data-science-utils-python/tree/main/exasol_data_science_utils_python/utils

and their tests need to move as well

After that happened we can archive this repository

Save Checkpoints in PartialFitRegressorUDF

Background

Training can take long
It would be nice to check out regular checkpoints to get a feeling about the progress of training
Furthermore, checkpoints could be used for continuing training if it fails early

Background

Save regularly checkpoints during training to the bucketfs in PartialFitRegressorUDF

Refactor ColumnPreprocessorCreator

Background

Currently, the ColumnPreprocessorCreator is more or less hard-coded
It combines the SQLPreprocessor with the creation of the ColumnTransformers in a hard-coded way
It uses a fixed strategy to decide which preprocessing is applied to which column
It only hard-codes the preprocessor MaxMinScaler and OneHotEncoding
It currently, assumes you use the source table as input for training

Acceptance Criteria

Build abstraction which combines SQLPreprocessors and ColumnTransformers creation
Make the ColumnPreprocessorCreator configurable using mapping between column selectors and this abstraction
Make the ColumnPreprocessorCreator return the table to use for further usage, to allow the training on globally modified data (which is necessary for Target Encoding)
-> This requires that the ColumnTransformers can have to modes, one for fit and one for transform
-> This requires that the PartialFitIIterator switch the mode to fit and the other iterators to transform

Add BucketFS Abstraction for UDF

Background

We usually can't inject Objects into UDF functions
For that reason, we need an Abstraction at least for our other functions
We also need a Factory which can generate MockBucketFS which is backed by the file system or a real BucketFS backed object

Acceptance Criteria

Add abstract class for BucketFS
Add implementation for BucketFS
Add mock implementation backed by file system
Add Factory which can generate a BucketFS from a connection

Add experiment name to the SQLTablePreprocessor

Background

The SQLTablePreprocessor creates new tables from the source table which include the source table name
Often you have many experiments runs and don't want to overwrite previous runs

Acceptance Critera

SQLTablePreprocessor accepts experiment and includes it into all generated table names
TrainRunner, TrainUDF, TablePreprocessor accept the parameter as well and forward it to SQLPreprocessor

Use a factory for PartialFitRegressorUDF in TrainRunner and TrainUDF

Background

Currently, the PartialFitRegressorUDF is hard-coded in TrainRunner and TrainUDF
However, there might be different implementations for it. For example, you could have one with configurable epochs (already exists), early stopping, ...

Acceptance Criteria

Use a factory for the creation of the PartialFitRegressorUDF in TrainRunner and TrainUDF

Add ReservoirShuffle

Background

For SGD based algorithms, you need to reshuffle the data each epoch
However, shuffling usually needs the whole dataset, which is not feasible in the UDF
A non-perfect shuffling (usually enough for SGD), can be done with limited memory
It works similar to Reservoir Sampling, except if the random oracle decides to that you replace a value in the memory with the current new value, you replace the old value, but also yield it to the iterator user

Acceptance Criteria

Implement a ReservoirShuffleIterator with limited memory which can be used for batches or rows in form of pandas dataframes

ColumnName uses TableName instead of TableLikeName

Background

Columns can be used with tables and views and for this they need to accept TableLikeNames

Allow column names being number

The current implementation does not validate numeric column names, even tough it is allowed in Exasol
For example, the following columns names must be valid: "1", "3_id", "id_3" ..

data-science-utils-python/exasol_data_science_utils_python/preprocessing/sql/schema/identifier.py

Line 51 in 45283be

def _validate_first_character(self, chararcter: str) -> bool:

exasol / data-science-utils-python Goto Github PK

data-science-utils-python's People

Contributors

Watchers

data-science-utils-python's Issues

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Criteria

Background:

Acceptance Criteria

Background

Acceptance Criteria

Background

Background

Background

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Acceptance Critera

Background

Acceptance Criteria

Background

Acceptance Criteria

Background

Recommend Projects

Recommend Topics

Recommend Org