zinchse / hero Goto Github PK

Python 100.00%

hero's Introduction

I'm inspired by the development of Algorithms and Data Structures that are optimal for specific tasks.
🎨 In free time I like to play sports, listen to music and solve algorithmic problems.

hero's People

Contributors

Watchers

hero's Issues

improve project structure

check configuration files in all project branches (.pylintrc, requirements.txt, etc.).
remove all unnecessary packages from the requirements.txt file
...

add costs

add cost values in the plan representation

Cost model vs NN

Compare learned NN with cost model on plan ranking problem (in generalisation mode!): "What is the probability that random 2 plans will be ordered correctly via cost (NN prediction) comparison?"

Task

Investigate possible learning algorithms (i.e. strategies of exploration for good transition).

Context

We have realized that the robustness problem is quite acute even on commonly used benchmarks.
The natural way to deal with it would be to a) switch to offline learning and b) use checks of similarity of the custom plan and its estimated cardinalities with the experience from history.
In order to guarantee the safety of any prediction of the M model, the transitions obtained with it must have already been explored. In this case, we don't need any prediction model, because we can just take the times from history itself!
It means that offline learning becomes just applying a smart strategy for filling history with the most useful transitions, i.e., we must explore queries and hintsets in such a way that we can find transitions with the highest speedup as quickly as possible.
So we get a situation where hintsets are just a way to get the desired transition, and inference becomes just a search against the default plan for hintsets that could potentially lead to good and already confirmed transition. We will see later why this is an extremely important feature of the model.

Main Questions

For various load configurations, answer the following questions:
- When can the application of a NN in an online scenario be beneficial?
- How much resources will be needed for this?
- How much more effective is the hero approach?
Consider a) planning time, b) training time (hero), and c) regression from predictions in an online scenario.
Investigate the dependency of the achieved performance gain and required resources on the search space (only hintset / only dop / hintset and dop).

Scenarios of Interest

A scenario in emulation is determined by two components - the available data for model training and the workload.

Data = all default plans, workload = all queries.
Goal: to test the ability to generalise knowledge based on the history of standard plans without changing the workload.
Data = results of the execution of plans previously selected by the NN, workload = all queries; the process of training models, executing the workload, and collecting data is repeated until convergence to the optimum.
Goal: to measure the resources needed to achieve a beneficial outcome using the classical approach.
Data = plans of all fast queries, workload = long queries (and vice versa).
Goal: to test the ability to generalise to a workload with changes in the distribution of query execution times.
Data = plans of part of the queries with the structure of the standard tree X, workload = remaining queries with the same structure X.
Goal: to test the ability to generalise knowledge from a partial history to a workload with changes only in the statistics of standard plans.
Data = plans of part of the queries with the standard tree X, workload = remaining queries with the same standard tree X.
Goal: to test the ability to generalise knowledge from a partial history to a workload without changes in standard plans.

add experiments artifacts

add archives with experiment artifacts:

model weights
loss curves, and
processed stratified metrics

explore the prediction modes

tldr;
compare different explore mode of hint prediction modes: a) by template, b) by logical plan based and c) by full-plan (with estimations).

Goals

Find the answers for the questions:

"Is it possible to make robust template-based hint prediction?"
"Is the logical plan enough to make robust hint prediction?"
"What is the worst case for these types of predictions?"

Exploring the possibilities of hint-based optimization

Description

Investigate extreme cases of query behavior when using hints and query_dop parameter (both regression and acceleration)

`sequential-all` dataset

Dataset Description

Collect result of sequential calls of EXPLAIN (format json) and EXPLAIN (analyze, format json) commands for all queries from 3 common benchmarks (JOB, TPCH, sample_queries) under different environment settings (all combination of 7 hints and 3 parallel modes). Execution of duplicated plans can be eliminated in order to reduce the collection time.

check TCNN abilitiies

to do:

reimplement TCNN
check its ability to avoid regressions
try a neighbor prioritization approach during local search using TCNN prediction sorting