Coder Social home page Coder Social logo

exasol / data-science-utils-python Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 373 KB

This project provides utilities for developing data science integrations for Exasol.

License: MIT License

Python 96.14% Shell 3.71% Dockerfile 0.15%
data-science exasol-integration

data-science-utils-python's People

Contributors

marlenekress79789 avatar nicoretti avatar redcatbear avatar tkilias avatar umitbuyuksahin avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-science-utils-python's Issues

Add docstrings

Background

  • Parts of the code don't contain docstrings

Acceptance Criteria

  • All relevant classes and functions container docstrings

Add UDFs which combine the model iterators with the BucketFS

Background

  • The model train iterators don't work alone, we need UDF which combine them with uploading/downloading models from BucketFS

Acceptance Criteria

  • Write UDF which combines the training with the upload of the model to the BucketFS
  • Write UDF which combines the prediction with the download of the model from the BucketFS
  • Write a UDF to merge multiple models read from the BucketFS and then write back the merged model into the BucketFS

Add first version of iterators to train and predict scikit-learn models

Background

  • Writing scalable code for model training and prediction for UDFs is hard
  • Scikit-learn is one of the most used ML Frameworks

Acceptance Criteria

  • Write an iterator based abstraction to train and predict certain scikit - learn models
  • Write a mechanism to combine multiple models to an ensemble
    • for general models
    • for random forest

Preliminary work https://github.com/exasol/data-science-utils-python/tree/model_utils

Add experiment name parameter to the Train UDFs

Background

  • We need to store somewhere the results of the training runs, at the moment we store them in two tables
  • The first tables stores the base estimators paths
  • The second tables stores the path to the final estimator
  • However, it is likely, that we need to do multiple training runs and want to store their results

Acceptance Criteria

  • Add experiment name parameter to TrainUDF
  • Add a column in the tables for this parameter
  • Use the parameter to extend the bucketfs path

Improve SQLExecutor

Background

  • The SQLExecutor is needed in the https://github.com/exasol/advanced-analytics-framework
  • However, when we used it there we recognized that the columns functions returns a Dict instead of a column object and that the mock implementation are only available in the test of this project, but are also needed in other projects

Acceptance Criteria

  • Let the columns method of the SQLExecutor return a column object
  • Move the MockSQLExecutor to a testing package which will be included in the package
  • Implement not implemented methods

Minor Refactoring of TrainRunner and TrainUDF

Background

  • input and target column lists are not needed anymore, we only need a column list as parameter
  • the name is not appropriate, because both only work for PartialFitRegressions

Acceptance Criteria

  • Rename to PartialFitRegressor*
  • Replace input and target column list with a single column list

Add first version of a sql based preprocessing library

Background

  • Many preprocessing steps for Machine Learning can be expressed in SQL
  • SQL is in Exasol usually way faster than UDFs or external processing

Acceptance Criteria

  • Add sql generator for min-max scaler based on sql
  • Add dictionary generator for categorical columns based on sql
  • Use a similar interface as scikit-learn with fit and apply

Let the Table and ColumnPreprocessor execute SQL queries

Background

  • For some preprocessing we actually need the results of some SQL queries to generate the next queries

Acceptance Criteria

  • Table and ColumnPreprocessor accept a SQLExecutor which is an abstraction around a mechanism to execute SQL
  • Table and ColumnPreprocessor use it to run the queries by them self
  • Interface gets change to fit and transform instead fit_queries and transform_queries

Fix hash and eq function in schema

Background

Our current Schema classes have some issues with their implementation of __hash__ and are, in fact, only immutable structs, this is perfect for dataclasses.

The DBObjectName classes should be fine and should be normal classes.

Add retry and wait to CombineToVotingRegressorUDF

Background

  • We upload and download the base models from the BucketFS
  • The BucketFS synchronization is asynchronous, such that it can happen that we try to download a base model in CombineToVotingRegressorUDF before it was synchronized to the particular node where the UDF runs

Acceptance Criteria

  • Add retry and wait mechanism and parameter to CombineToVotingRegressorUDF

Refactor schema classes and add missing classes

Background

  • We use the schema classes in https://github.com/exasol/advanced-analytics-framework
  • Currently, the these classes are implementations, however, in the AAF, we need new implementation of them which work as proxy to for the temporary dbobject managed by the QuerHandlerContext
  • Further, the AAF needs also classes for UDFs and ConnectionObjects

Acceptance Criteria

  • Extract Interfaces from the dbobject name classes
  • Add a static create method to the interfaces which create the default implementation
  • Remove the old Builder classes, because they are actually not so useful for python with its keyword arguments
  • Add UDFs and ConnectionObjects

Add documentation

Background

Acceptance Criteria

  • - Add documentation generation and publishing
  • - Add user guide
    • Usage
    • Overview about the training process
  • - Add Readme
  • - Add changelog
  • - Add Developer Guide

Add retry and with to PartialFitUDF

Background

  • in PartialFitTrainRunner, we upload a model prototype to the bucketfs, directly afterwards we start the PartialFitUDF.
  • The bucketfs syncs asynchronously and the PartialFitUDF runs on each node, for that reason, we need to retry and wait in the PartialFitUDF

Acceptance Criteria

  • PartialFitUDF waits and retries downloading the prototype model
  • USer can configure with a parameter how long.

Add bucketfs path to the Model UDFs

Background

  • Currently, we use the base-path specified by the model_connection to store the models
  • However, it is likely that we can't create bucketfs connections at will, for security reason, maybe someone creates one for us with the credentials and grants us the usage.

Acceptance Criteria

  • Add a bucketfs path to the model and use it to store the models

Update Pillow version

There is a dependabot alert for upgrading Pillow library to version 9.1.1

When reading a TGA file with RLE packets that cross scan lines, Pillow reads the information past the end of the first line without deducting that from the length of the remaining file data. This vulnerability was introduced in Pillow 9.1.0, and can cause a heap buffer overflow.

Opening an image with a zero or negative height has been found to bypass a decompression bomb check. This will now raise a SyntaxError instead, in turn raising a PIL.UnidentifiedImageError.

Move BucketFS Location to bucketfs-utils-python

TrainUDF should work with multiple groups

Background:

  • TrainUDF as a UDF could get multiple groups as input
  • Each group could be seen as a configuration
  • If we can train models for multiple configuration we could implement hyperparameter optimization
  • If we have multiple groups in can happen, that we either have multiple configurations on the same UDF instance or we have multiple instances

Acceptance Criteria

  • TrainUDF and TrainRunner can handle multiple groups
  • Each groups trains a model with its own id
  • All groups share a job id, such that we can identify which models were trained together

Fix hash function of schema classes

Background

  • We can't use id() for computing hashs, because we serialize and deserialize dicts and sets with them
  • The default hash functions uses id()

Save Checkpoints in PartialFitRegressorUDF

Background

  • Training can take long
  • It would be nice to check out regular checkpoints to get a feeling about the progress of training
  • Furthermore, checkpoints could be used for continuing training if it fails early

Background

  • Save regularly checkpoints during training to the bucketfs in PartialFitRegressorUDF

Refactor ColumnPreprocessorCreator

Background

  • Currently, the ColumnPreprocessorCreator is more or less hard-coded
  • It combines the SQLPreprocessor with the creation of the ColumnTransformers in a hard-coded way
  • It uses a fixed strategy to decide which preprocessing is applied to which column
  • It only hard-codes the preprocessor MaxMinScaler and OneHotEncoding
  • It currently, assumes you use the source table as input for training

Acceptance Criteria

  • Build abstraction which combines SQLPreprocessors and ColumnTransformers creation
  • Make the ColumnPreprocessorCreator configurable using mapping between column selectors and this abstraction
  • Make the ColumnPreprocessorCreator return the table to use for further usage, to allow the training on globally modified data (which is necessary for Target Encoding)
    -> This requires that the ColumnTransformers can have to modes, one for fit and one for transform
    -> This requires that the PartialFitIIterator switch the mode to fit and the other iterators to transform

Add BucketFS Abstraction for UDF

Background

  • We usually can't inject Objects into UDF functions
  • For that reason, we need an Abstraction at least for our other functions
  • We also need a Factory which can generate MockBucketFS which is backed by the file system or a real BucketFS backed object

Acceptance Criteria

  • Add abstract class for BucketFS
  • Add implementation for BucketFS
  • Add mock implementation backed by file system
  • Add Factory which can generate a BucketFS from a connection

Add experiment name to the SQLTablePreprocessor

Background

  • The SQLTablePreprocessor creates new tables from the source table which include the source table name
  • Often you have many experiments runs and don't want to overwrite previous runs

Acceptance Critera

  • SQLTablePreprocessor accepts experiment and includes it into all generated table names
  • TrainRunner, TrainUDF, TablePreprocessor accept the parameter as well and forward it to SQLPreprocessor

Use a factory for PartialFitRegressorUDF in TrainRunner and TrainUDF

Background

  • Currently, the PartialFitRegressorUDF is hard-coded in TrainRunner and TrainUDF
  • However, there might be different implementations for it. For example, you could have one with configurable epochs (already exists), early stopping, ...

Acceptance Criteria

  • Use a factory for the creation of the PartialFitRegressorUDF in TrainRunner and TrainUDF

Add ReservoirShuffle

Background

  • For SGD based algorithms, you need to reshuffle the data each epoch
  • However, shuffling usually needs the whole dataset, which is not feasible in the UDF
  • A non-perfect shuffling (usually enough for SGD), can be done with limited memory
  • It works similar to Reservoir Sampling, except if the random oracle decides to that you replace a value in the memory with the current new value, you replace the old value, but also yield it to the iterator user

Acceptance Criteria

  • Implement a ReservoirShuffleIterator with limited memory which can be used for batches or rows in form of pandas dataframes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.