The telamon from ulysseb

Vector dimensions cannot be MERGED

Vector dimensions must be declared in the access pattern of every memory instruction nested inside them. This forbids vector dimensions to be merged to other dimensions as they would not be declared in the access pattern.

This might not be a problem as vector dimensions cannot contain more than on instruction anyway. But if it is the case, we should make it explicit.

model::Size evaluation overflows in some cases

model::Size::bound triggers an overflow in when the numerator and denominator are too big. The obvious solution is to use big integers. However, we might be able to do better by carefully ordering computations.

New contraints not linked to a new objects are not enforced.

TelamonGen does not enforces the contraints created when adding a new object to a set if the constraint does not apply to a new object.

Add an explicit implication in exh syntax

Exh files are currently very hard to read. In particular, implications A => B are generally encoded as B || not A, even if this is not stated anywhere explicitely. This can be confusing, especially when there are more than one term in the left branch of this implication. In addition, it can also be confused with a normal disjunction. We could add significant clarity just by adding an implication syntax that would allow to write A => B directly.
Syntax could be :
A => B (mathematic)
B :- A (Prolog)
Others ?

Decreasing bound for loop carried dependencies on multiple loops.

A dependency is carried by loop d0 and d1, it is accounted d0*d0-1 times if d0 is not scheduled with d1, but only d0*(d1-1) is d0 is scheduled outside of d1

☂️ Improvements to statistical exploration

This is a tracking issue for various possible improvements to Telamon's statistical exploration procedures. Ideas include:

Some approaches (e.g. [1]) propose to learn a performance model instead of specifying one manually. One of Telamon's strength here is also a weakness, because the need to be able to predict performance for partially-specified candidates makes it harder to learn a meaningful cost function -- and we need the performance model to keep predicting an actual lower bound on the performance. However, we also use the performance model to guide the exploration in the monte-carlo exploration phase, and we could use a ranking-based model (as in [1]) to predict, for instance, the most interesting candidate to explore during a monte-carlo descent directly from features extracted from the performance model instead of using only the bound.
Experiment with alternatives to the monte-carlo search. This includes experimenting with different hyper-parameters, different strategies for the armed bandit problem, and completely different approaches to the exploration which could be based on genetic algorithms or simulated annealing. Need to think more about how those would work wrt partial candidates :)
Enable search across different choices. This would probably require having an efficient way to compare candidates (since the search tree would become a search DAG and we want to merge identical candidates). One possibility is to perform the search across all possible decisions instead of limiting ourselves to a single choice, but that occurs a high computational cost. We could also try to dynamically predict good choices to make, or perform meta-analysis across different kernels to improve the existing static choice order.

1: TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Python bindings

We want to have Python bindings to Telamon in order to ease experimentation of statistical exploration approaches and integration with experiment reporting platforms such as mlflow.

Proposed approach is to use cbindgen to create a thin C API, then bundle it through milksnake to create a Python package. The initial goal is to expose a search function to drive Telamon's search from Python, as well as a couple helper to create kernels. Additional capabilities will be added as needed.

Improve the accuracy and cost of evaluations on the GPU

We currently run each candidate 20 times on the GPU if it has an execution time within 3X of the current best candidate. Better accuracy and performance could be achieved by exiting after a few evaluations if the performance is too bad and evaluating more if the performance is close the best candidate (e.g <3%).

Create a C API for Telamon

In order to link with over frameworks we need to export a C API. This API should allows:

to create kernel,
to specify actions,
to create a context and
to configure and lauch a search.
The goal is not yet to expose the full domain as it is not stable enough.

Context creation will be specialized per device. We might also need to specialize some functions operating on context as they are parametrized in rust.

For kernel creation, we can either expose the builder or directly the IR. Before deciding this, we need to analyse the needs of potential users.

Actions creation should be handled directly by Telamon-gen.

Dump an event log

In order to better devise new possible ways of exploring the search space, it is useful to be able to get access to statistics on what happens during the search. For instance we may want to collect statistics on how discriminant some choices were. Currently we need to perform those statistics online while the search is running, but it would be useful to be able to dump an event log from which we could replay the whole search procedure and compute any new statistics we didn't compute at the time of the run.

Replaying the search should be significantly faster than re-starting it, especially since we won't run on the accelerator device nor the (expensive) propagation steps during the monte-carlo descents, and would allow to gather statistics after the fact that we didn't compute (maybe because we didn't know we wanted to compute them at the time) during the run. Proposed implementation is to dump that information into a protobuf stream. One issue is we need to generate .protos from telamon-gen, and we need a marker to ensure that we can only load the protos if the search space has not changed.

Building with mppa feature is broken

I have fixed some issues in #171 but others remain. I realized this as I am trying to ensure the CI tests as much of the code as possible, including code behind feature gates.

Expose values and variables in the IR

Data flow is currently implicit in our representation. This causes some problems:

we can't measure the register pressure
it is hard to measure the size of memory blocks. This leads to a O(n^3) search space representation.
we don't know when to introduce broadcasts
reductions are hard to manipulate and we cannot represent parallel reductions
we do not know when to privatise shared memory

Bellow is the modifications necessary to implement the new value system:

Wrong pressure for stores on CUDA

We assume loads and stores use the same bottlenecks on CUDA device. Axpy shows this is not the case.

Find a more efficient representation for `NumericSet`.

NumericSet allocates arrays of 16 u32 on the stacks. Instead we could either:

only keep a reference to the universe and tick available posibilities in a bitset. The universe should, in this case, be stored in the sigature.
pass the universe as argument each time we make a computation

Add an option to change the order of choices

Currently, the order of choices is not customizable, it is just provided by the api. For the sake of experimenting, we would like to have the possibility to alter this order, as we want to highlight the fact that this change can have a significant impact on the exploration time.
In order to do so, we have to make the following modifications :

Modify telamon-gen to expose a list of all possible choices (we currently have a list of all actions, which are just instantiated choices - choices with a value).
modify config.rs to expose this possibility to the user.
Maybe implement a shuffle or anything custom for the order of the choices we want to make.

Also, we want to allow a special king of exploration that would not necessarily go the leaves of the tree, but rather explore for example the 4 or 5 first levels exhaustively. This implies that we would not run anything on the device. These benchmarks could be launched with cargo bench.

Separate common config options in `CommonConfig`

Right now we have a bunch of stuff in Config which make sense for all search algorithms but we don't pass to them in favour of the more specific BanditConfig etc.

It somehow simplifies things because now we can't end up with a wrong config (e.g. a BoundOrder config when we are in bandit code), but it is also annoying because we don't have access to those common config options.

A better way of doing this would be to have a CommonConfig struct with the common config options which we can pass along, and #[serde(flatten)] it into the Config struct to keep the existing fields.

Add a method to obtain the complement of an Action, if it exists.

This method might be usefull to remove some decisions we tried from the initial search space.

Remove Cargo.lock from the index

Cargo.lock has been added to the index so we can pin rustlex dependencies. This forces us to compile with rustc version nightly-2018-04=02. Once #4 is solved, we can remove Cargo.lock to compile with any rustc version.

Allow thread dimensions with a size other than a power of 2.

This limitations is arbitrary and forbids us to optimize for some cases that are not well supported in existing libraries.

remove the ir constraints.
update kernels so we adapt tiling to the divisors
ensure ir::device::cuda::mem_model still works

Use a typed template framework

We currently use handlebars for templating the output. This makes things hard to debug since it is not statically typed. Instead we should use quote!.

Improve caching in travis

Travis currently re-compile many dependencies at each build. My guess is that part of the cache is overwritten by parallel jobs. We need to find how is the cache privatised and use it.

Discard the Rustlex dependency

Rustlex uses libsyntax and thus cannot run on stable. It must be ported each time the compiler internals change. The available solutions are:

to use the Lalrpop tokenizer, but it is limited and cannot count lines for error reporting
to use flex and generate a C tokenizer
to use a parser combinator, such as nom, to create tokens.

Create benchmarks to assess telamon-gen output performance

For some kernels, the code generated by telamon-gen is limiting the speed of the exploration. We need to precisely asses the performance of telamon-gen, both for Telamon and for dedicated benchmarks.

Use structopt

As noted in #178 we should use structopt for argument parsing, as that provides a much more user-friendly interface than getopt. There is currently some getopt argument parsing in the explorer configuration which should be converted to structopt. In particular, explorer::config::Config should #[derive(StructOpt)].

Switch back to crates.io version of `binary_heap_plus`

Switch back to the crates.io version of binary_heap_plus after sekineh/binary-heap-plus-rs#1 is merged and published.

Report errors during Telamon-gen parsing

Currently, Telamon-gen generates a backtrace when it encounters a user error. This makes it hard to use. It should report errors with line and column numbers instead.

Remaining actions:

Update README.md

The readme should be updated to reflect recent developments.

move the code map to CONTRIBUTING.md
move contribution guidelines to CONTRIBUTING.md
explain how to run kernel test (mention --release flag for correctness tests, otherwise it is too slow).

Tracking issue for types/objects/concepts that should be renamed

Comment to propose types, objects or concepts that should be renamed or to propose new names for the names listed bellow.

In Telamon:

BasicBlock => Statement
MemBlock => MemoryRegion

In Telamon-gen

Action could be renamed to Decision
counter value => incr_amount
ir::CounterKind => ir::CounterOperator

Clean up Function architecture (frozen functions etc.)

We want to distinguish functions being created from functions which are fully created ("frozen"). It should not be possible to add instructions or other objects to frozen functions. This is because we want to have a complete view of all possible choices (including choices induced by e.g. lowering) even before starting the exploration.

Currently this is implemented with a somewhat ugly hack using a "Lowering" type which is unit for non-frozen functions and contains lowering information (mapping from initial dimensions to lowered dimensions) for frozen functions. This was done in order to minimize the amount of needed code churn to the previous code (which handled only non-frozen functions) but is very counterintuitive.

We should refactor the Function / Instruction / etc. architecture (possibly using a common trait and specialized sub-traits, or by somehow wrapping things with lowering information) in order to make the distinctions as well as the common elements more explicit.

Stabilize measures on the GPU when the CPU is limiting

When the CPU is limiting and the GPU underused, measures are innacurate: #84 shows a difference that can reach a factor of 2 between the measured time when the CPU is limiting compared to when the GPU is limiting.

Ideally, the fix should not impact performance: more evaluations are needed only in the case where the CPU was limiting. Bellow are multiple ideas to have stable measures:

launch dummy kernel when the GPU is not working to keep it busy
find another approach to pin the GPU clock
- not available on ficus
use events to detect when the measured and actual time do not match
- some incorrect time may not be detected if the two are too close
lauch each kernels multiple times.
- exit early if the perf is too bad

Add strip-minig decisions

Tracking issue for strip-mining decisions.

express sizes as decisions
expose logical dimensions
simplify size creation with dedicated methods for const and fixed size
limit the size of inner BLOCK dimensions
update the model to depend on bounds/lcm/gcd
change ir::size so it reference the dimension size
expose mapped dimensions
allow dimension size to vary

Writing generic code which dynamically checks for a feature (e.g. using cfg!(feature) instead of #[cfg(feature)] and #[cfg(not(feature))]` pairs).
In a second time, allow features to be compiled in and fail gracefully (i.e. without panicking) if no hardware accelerator is found on the current machine.

ulysseb / telamon Goto Github PK

telamon's People

Contributors

Stargazers

Watchers

Forkers

telamon's Issues

Recommend Projects

Recommend Topics

Recommend Org