Coder Social home page Coder Social logo

exasol / advanced-analytics-framework Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 1.94 MB

Framework for building complex data analysis algorithms with Exasol

License: MIT License

Python 94.11% Shell 1.76% Lua 3.99% Dockerfile 0.13% Jinja 0.01%
data-science exasol exasol-integration

advanced-analytics-framework's People

Contributors

ckunki avatar dejanmihajlovic avatar marlenekress79789 avatar nicoretti avatar tkilias avatar umitbuyuksahin avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

advanced-analytics-framework's Issues

Add Python Event-Handler Framework

Background

  • one of the parts of the Framework will be a python event handler. Usually, the logic of the event handler is implemented by the user, but we need to implement the Framework to execute this event handler.
  • The Framework is responsible for the loading and storing the state, for call the user provided event handler and managing temporary tables
  • To test event handler a python implementation of the event loop is useful, because debugging with the Lua event loop is probably difficult
  • Users need to be able to specify the class which implements the event handler logic
    • Alternative, they might define a UDF from which this class is imported. This is for adhoc algorithms, when we don't want to build a python package or language container for the algorithm.

Acceptance Criteria

  • Create Python Event-Handler Framework
  • Create Python Event Loop for Testing

QueryHandlerRunnerUDF doesn't serialize the final_result to json

Background

  • The ResultType of the UDFQueryHandler assumes a Dict[str,Any] as return value and it expects that this will be serialized to json from the QueryHandlerRunnerUDF
  • However, currently, we transform the result into a string and return this instead

Add UDF Local Discovery

Background:

Some algorithms need communication between UDF Instances of a UDF Call for example ML algorithms or some HPC algorithms. To achieve this, we need first to discovery all UDF instances of a UDF call. We can have many instances in a cluster, such that we need to do first a local discovery per node.

Cleanup in Error Handling

  • In case of any errors in the execution, temporarily created
    -- bucketfs files
    -- tables
    should be cleaned, if with the keeping option not selected.

  • don't let Lua do sys.exit() when an error occurs

Store bundled Lua scripts as resource

  • in the current installation, the bundled Lua script is generated while installing the project

  • however this installation requires Lua and its dependencies to be installed in the user machine.

  • in order to break this dependencies, we should store the bundled Lua scripts in the wheel.

  • add functionality to save bundled script

  • provide an option to cli whether to re-bundle or not

  • call re-bundling in each commit to be sure that it is saved

_ScopeQueryHandlerContextBase._invalidate might not _invalidate all child context, if a grand-child-context wasn't released

Background

    def _invalidate(self):
        self._check_if_valid()
        self._invalid_object_proxies = self._invalid_object_proxies.union(self._valid_object_proxies)
        self._valid_object_proxies = set()
        self._owned_object_proxies = set()
        self._is_valid = False
        child_context_were_not_released = False
        for child_query_handler_context in self._child_query_handler_context_list:
            if child_query_handler_context._is_valid:
                child_context_were_not_released = True
                child_query_handler_context._invalidate()
        if child_context_were_not_released:
            raise RuntimeError("Child contexts are not released.")
  • child_query_handler_context._invalidate() can throw an exception which would stop the for loop. We need to catch the Exception.

Store stacktrace during the creation of the query_handler_context

Background

  • It is difficult to identify which query_handler_context wasn't released, if we get an Exception. To allow better diagnostics, we should store the stack trace during the constructor call of the query_handler_context as a string inside the query_handler_context.

Implement MockEventContext

Background

  • currently we need a UDF mock or a UDF to test the EventHandler
  • For complex EventHandler this makes testing unnecessary hard
  • we should implement a MockEventContext which we can create from a column definition and a List of tuples

Make lua parts easier to test

Background

  • We recognized in downstream projects, that mocking some parts was really difficult
  • Specifically mocking the Exasol Lua Script-specific part, such as pquery
  • For that reason, we decided to inject these parts at the top

Remove setup.py

Background

  • with poetry 1.4.0 it doesn't create the setup.py anymore https://github.com/python-poetry/poetry/releases/tag/1.4.0
  • we currently use poetry build to generate the setup.py
  • however, the setup.py isn't needed with newer pip versions and if there are releases to pypi or as wheels
  • for that reason, we can remove setup.py and githook that generates it, from this repo

Acceptance Criteria

  • #94
  • Update workflows to poetry 1.4.0
  • Remove setup.py Github Workflow
  • Remove setup.py githook
  • Remove setup.py file

Add ZMQ wrapper which injects faults by losing messages

Background

  • In real life, we need to expect to lose sometimes messages, for that reason, we need to design and test our protocols in a way that they can survive this
  • ZMQ ROUTER and DEALER use tcp, but also there exists failure scenarios, especially because the send operation is asynchronous and we don't get informed about reconnects
  • This abstraction helps us to test our protocols with lost messages

Acceptance

  • Add ZMQ Wrapper which injects faults by losing messages on random

Add MockQueryHandlerRunner

Background

  • To test SQL-based query handlers we need to actually run the query handlers against a database, however using the Lua-based QueryHandlerRunner requires the build of the language container
  • To avoid this, we need a Python-based runner which can send the queries to the database via a SQL-Client
  • But furthermore, by abstracting the SQL-Client, we can also mock this and compare in unit test the generated queries with our expectation.

Add README

Background:

I saw we are missing a README in this project.

Acceptance Criteria

  • Add a README in the usual style, but add that this repository is in early development

Add Lua Event Loop

Background

  • From work consists of an Event Loop and an Event-Handler
  • The Event Loop calls the Event Handler and executes queries the Event Handler returned
  • The Event Loop needs to be written in Lua, such that we can run in a Lua Script to use pquery to run the queries
  • The Event loop needs to receive configuration and must forward it to the event handler

Acceptance Criteria

  • Implement Event-Loop as Lua Module
  • Implement Lua Script
  • Implement deployment cli

Inital Setup of the Project

  • add gtihub workflows
  • add githooks
  • a short README.rst
  • doc folder including changes folder, guides... etc
  • poetry setup
  • subdirectories such as lu/src, lua/test

Add integration test for EventLoop

  • We need a custom EventHandler class inside language container to test EventLoop.
  • However, it is not desired to put the test codes into release package.
  • Therefore, we need to have an additional language container flavor for testing purpose.
  • The repo architecture could be as follows:
- pyproject.toml
- exasol_advanced_analytics_framework
- tests/test_package/
	- pyproject.toml
		- include: ../event_handlers or if this doesn't work we create symlink ../event_handlers -> test_package
- tests/event_handlers
	- test_event_handler1.py
	- test_event_handler2.py
- language_container
	- exasol_advanced_analytics_framework
		- flavor_base
			- release
				- install exasol_advanced_analytics_framework
			- test depends on release
				- install test_package

Use new DBObjectName interfaces for temporary DBObjectNames in QueryHandlerContext

Background

  • exasol/data-science-utils-python#60 extracted the interfaces from the DBObjectName, such as TableName, ...
  • For more complex garbage collection of DBObjects we need to handle the DBObjectProxies like normal DBObjectNames, such that we can combine them with the DBObject classes and to be able to track dependencies between them.

Acceptance Criteria

  • Rename DBObjectProxies to DBObjectNameProxies
  • Let DBObjectNameProxies implement the DBObjectName Interfaces

Potential bug in _ScopeQueryHandlerContextBase._transfer_object_to

Background

Update to Lua 5.4

Background

  • currently we still use Lua 5.1, because some parts of the dev environment use Ubuntu 20.04

Acceptance Criteria

  • update dev environment (CI, vagrant)
  • update code (use const)
  • update exaerror

Add an event-handler framework to emulate imperative programming

Background

  • Most algorithms are formulated as imperative programs with data flow parts
    • This means the high-level structure of the algorithm consists of for-loops, while-loops, if-statements and sequences of statements.
    • The single statements are then often data flows
  • Data flows are usually represented in this framework as SQL queries
  • Because only Lua Scripts can run dynamic SQL queries in the same transaction and we wanted to have the core logic in Python, we designed this framework as event driven, where the event handler returns the queries which Lua will execute
  • However, implementing complex algorithms with an event handler is fairly complex. State Machines get easily very complex, and Hierarchical State Machines are not so much more helpful.
  • To simplify the implementation of algorithms, we need a framework which allows implementing an imperative program where the statements are event handler. This allows the main logic of the algorithms in a very familiar paradigm and only on the lowest level, we need to care about the event handling.

Design Idea

class Variables

    def __init__(parent_scope_variables)
        """
        Variables are scoped and typed and need to be declared before usage. The scope means when ever you entry a Block (e.g. in IfStatement, your variables get a new Scope, variables you create in this scope are only visiable in this scope or subscope, but not in the parent scopes.)
        """
        self.parent_scope_variables = parent_scope_variables
        self.local_variables:Dict[name,Variable] = {} 

    def declare(name, type, value)
    
    def get(name)
    
    def set(name, value)


class Statement
        """
        We implement objects for the usual imparative constructs, such as Statements, If-Statements, Loops, .... Except that our statements consist of an init_handler method which is called first and handle_event method, which is can execute SQL queries via the Continue Result
        """
    def init_hanler(variables:Variables)
    def handle_event(event_context, event_handler_context):Union[Continue,Finished]
    
class StatementChain(Statement)

    def __init__(event_hanlder_factories:List[StatementFactory])
        self.event_hanlder_factories=event_hanlder_factories
        
    def init_hanler(variables:Variables)
        self.event_handlers = create_event_hanlder(self.event_hanlder_factories)
        self.current_event_handlers_idx = 0
        self.event_handlers[self.current_event_handlers_idx].init_handler(variables)
        
    def handle_event(event_context, event_handler_context):Union[Continue,Finished]
        result = self.event_handlers[self.current_event_handlers_idx](event_context, event_handler_context)
        if isinstance(result, Continue):
            return result
        else
            if self.current_event_handlers_idx<len(self.event_handlers)-1:
                self.current_event_handlers_idx+=1
                self.event_handlers[self.current_event_handlers_idx].init_handler(variables)
            elif
                return result
                
class ForEachStatement(Statement)

class WhileStatement(Statement)

class IfStatement(Statement)

Make project more reusable

Background

  • BundleLuaScripts uses hard-coded parameter
  • query_loop_main.lua contains code which should be part of query_loop.lua, because they are specific to the query loop and are needed in other project if they use the query loop
  • We would like to reuse it in other projects

Acceptance Criteria

  • Refactor BundleLuaScripts such that we can reuse it
  • Move _prepare_init_query and dependent function to query_loop.lua

Global UDF Discovery

Background

  • With #79 we implemented the UDF Discovery on a single node.
  • Now we need to implement the discovery between nodes.

Add user guide

Background

We need a User Guide for this repo. It should contain the following sections:

  • Setup
  • Usage
  • Implementation of algorithms

Fix __hash__ DBObjectNameProxy

Background

  • This is, currently, the hash function of DBObjectNameProxy
    def __hash__(self):
        self._check_if_valid()
        return hash(id(self))

  • It has two issues
  • First we use self._check_if_valid() which fails when the DBObject was release. If we store DBObject in Dicts as key or in Sets, this can cause that we can't remove these objects anymore from the container.
    • The same issues has eq
  • The second, issue is that it uses the id which might change after pickle unpickle which is a common case for us.

Extend the EventHandlerContext to a scope-based system for handling temporary objects

Background

  • Currently, the EventHandlerContext only provides an interface to create and load temporary bucketfs files, but didn't have yet a way to clean up
  • Implementing the EventHandlerContext as a scope-based system gives us finer control over when temporary objects get cleaned up. This is important, in case we execute a long sequence of operations, where each operation can create temporary objects. At some point, some temporary objects are not needed anymore and can be release. Without that, we would potentially accumulate many large tables.
  • In the scope-based EventHandlerContext, a parent event handler creates a ScopeEventHandlerContext and forwards it to a child event handler. The child event handler uses the EventHandlerContext for creating temporary objects. If the child event handler finishes, the parent releases the ScopeEventHandlerContext and with that, all temporary objects created by the child event handler.
  • We can, of course, build the scopes recursively, where all intermediate event handlers get ScopeEventHandlerContext from which they can create new ChildEventHandlerContext.
  • On the top layer we have a special EventHandlerContext which finally can clean up the release objects after the event handler returned
  • Furthermore, we need to be able to transfer temporary objects from one scope to another scope, in case an event handler returns it or gets it as input. For this, a temporary objects needs to be owned by exactly one ScopeEventHandlerContext and we need a way to transfer ownership between the ScopeEventHandlerContext (parent, child, sibling)
  • Also, if a ScopeEventHandlerContext forgets to release its temporary objects, they should get released by its parent ScopeEventHandlerContext. This ensures, at least at the end of the event_handler we clean up.

Evaluate Event-Handler overhead

Background

  • We plan for the framework a Lua loop which executes SQL queries and runs an Event Handler which consume their result and issues new SQL Queries
  • The Event handler should if possible be written in Python for that we need to start a Python UDF which can have a bit of overhead.
  • Furthermore, the Event handler is stateful, that means in each call to the UDF we need to store the event handler in the BucketFS and load it during the next call.
  • The total overhead should be best below 0.5 seconds or at least below a 1 second
  • We also need to know how fast the distribution of the event handler state in the bucketfs is for cluster with more cores

Unify is_valid and released for QueryHandlerContext

Background

Methods called in BucketFSLocationProxy.cleanup can fail and stop the cleanup

Background:

    def cleanup(self):
        if self._is_valid:
            raise Exception("Cleanup of BucketFSLocationProxy only allowed after release.")
        for file in self._bucketfs_location.list_files_in_bucketfs(""):
            self._bucketfs_location.delete_file_in_bucketfs(file)
  • list_files_in_bucketfs can fail, if the location wasn't used for uploading
  • delete_file_in_bucketfs can fail, but it shouldn't prevent use from deleting the other files

Add a fault injection wrapper for the SocketFactory

Background

  • Our UDF Communication algorithm need to work under message loss. ZMQ ROUTER and DEALER use tcp, but also there exists failure scenarios, especially because the send operation is asynchronous and we don't get informed about reconnects
  • To test the fault tolerance of our algorithms, we need to inject send faults

Acceptance Criteria

  • Add a wrapper for the SocketFactory which can injects send faults
  • Use the wrapper in the tests for the peer handshake

Hide generated files in Github

Background

  • We have huge generated files for the Lua script checked into the repo, because we want to be able to install it without having lua amlg available
  • However, it makes reviewing unnecessary hard
  • Github seems to provide a way to hide this file https://github.com/github/linguist/blob/master/docs/overrides.md#summary
  • It is likely that the following line in the .gitattributes file should help (please check, what file extension the jinja sql templates have)
*.sql linguist-generated

Refactor EventHandler Interface

Background

  • currently, we have only the handle_event method, which accepts a EventContext and a EventHandlerContext.
  • however, for the first iteration of the EventHandler there is exists no Input for the EventContext, because we didn't execute a ReturnQuery yet
  • for the first iteration we only have the input parameters
  • For that reason, it is probably better to use two different methods for these cases
def handle_start(parameters,EventHandlerContext)->EventHandlerResult

def handle_query_result(EventContext)->EventHandlerResult

  • we don't insert the parameters via the constructor, because we might need it to construct complex nested EventHandler, see #23
  • We should also provide two different version of EventHandlers one with ScopeEventHandlerContext and one with EventHandlerContext
  • it is probably also a good idea if we can specify the input types and result types via GnericTypes

Add get_connection to query_handler_context

Background

  • Currently, query_handler can't access the connection object to retrieve credentials.
  • UDFs provide the get_connection function for this, we need to forward this to the query_handlers
  • Because, we already inject the query_handler_context and it is more or less for responsible for the communication with the outside, it is probably the best point to add this functionality.

Fix parameter and query_result issues of QueryHandlerRunnerUDF

Background

  • In #53 we moved the serialization and deserialization into wrapper QueryHandler, but it seems we forgot to remove the json.loads in QueryHandlerRunnerUDF
  • It also looks like, that the AAF_RUN_QUERY_HANDLER UDF which runs the QueryHandlerRunnerUDF is a scalar UDF and as such doesn't allow next, which is used by the query_result

UDFEventContext __next__ returns bool instead of row

Background

The EventContext is an abstraction around the UDFContext or any exasol driver (e.g. pyexasol). We simplified its interface, such that either of them can implement that. However, in contrast to the UDFContext returns next method the row and not a bool for indicating if there are more rows.

Acceptance Test

  • fix bug
  • add type hints to EventContext
  • add tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.