ulysseb / telamon Goto Github PK
View Code? Open in Web Editor NEWA framework to find good combinations of optimizations for computational kernels on GPUs.
Home Page: https://ulysseb.github.io/telamon/telamon
License: Apache License 2.0
A framework to find good combinations of optimizations for computational kernels on GPUs.
Home Page: https://ulysseb.github.io/telamon/telamon
License: Apache License 2.0
Vector dimensions must be declared in the access pattern of every memory instruction nested inside them. This forbids vector dimensions to be merged to other dimensions as they would not be declared in the access pattern.
This might not be a problem as vector dimensions cannot contain more than on instruction anyway. But if it is the case, we should make it explicit.
model::Size::bound
triggers an overflow in when the numerator and denominator are too big. The obvious solution is to use big integers. However, we might be able to do better by carefully ordering computations.
TelamonGen does not enforces the contraints created when adding a new object to a set if the constraint does not apply to a new object.
Exh files are currently very hard to read. In particular, implications A => B are generally encoded as B || not A, even if this is not stated anywhere explicitely. This can be confusing, especially when there are more than one term in the left branch of this implication. In addition, it can also be confused with a normal disjunction. We could add significant clarity just by adding an implication syntax that would allow to write A => B directly.
Syntax could be :
A => B (mathematic)
B :- A (Prolog)
Others ?
A dependency is carried by loop d0
and d1
, it is accounted d0*d0-1
times if d0
is not scheduled with d1
, but only d0*(d1-1)
is d0
is scheduled outside of d1
This is a tracking issue for various possible improvements to Telamon's statistical exploration procedures. Ideas include:
1: TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
We want to have Python bindings to Telamon in order to ease experimentation of statistical exploration approaches and integration with experiment reporting platforms such as mlflow.
Proposed approach is to use cbindgen to create a thin C API, then bundle it through milksnake to create a Python package. The initial goal is to expose a search
function to drive Telamon's search from Python, as well as a couple helper to create kernels. Additional capabilities will be added as needed.
We currently run each candidate 20 times on the GPU if it has an execution time within 3X of the current best candidate. Better accuracy and performance could be achieved by exiting after a few evaluations if the performance is too bad and evaluating more if the performance is close the best candidate (e.g <3%).
In order to link with over frameworks we need to export a C API. This API should allows:
Context creation will be specialized per device. We might also need to specialize some functions operating on context as they are parametrized in rust.
For kernel creation, we can either expose the builder or directly the IR. Before deciding this, we need to analyse the needs of potential users.
Actions creation should be handled directly by Telamon-gen.
In order to better devise new possible ways of exploring the search space, it is useful to be able to get access to statistics on what happens during the search. For instance we may want to collect statistics on how discriminant some choices were. Currently we need to perform those statistics online while the search is running, but it would be useful to be able to dump an event log from which we could replay the whole search procedure and compute any new statistics we didn't compute at the time of the run.
Replaying the search should be significantly faster than re-starting it, especially since we won't run on the accelerator device nor the (expensive) propagation steps during the monte-carlo descents, and would allow to gather statistics after the fact that we didn't compute (maybe because we didn't know we wanted to compute them at the time) during the run. Proposed implementation is to dump that information into a protobuf stream. One issue is we need to generate .protos from telamon-gen, and we need a marker to ensure that we can only load the protos if the search space has not changed.
I have fixed some issues in #171 but others remain. I realized this as I am trying to ensure the CI tests as much of the code as possible, including code behind feature gates.
Data flow is currently implicit in our representation. This causes some problems:
Bellow is the modifications necessary to implement the new value system:
Value
object. This lays out the ground work for the rest of the modifications. #63Value
struct, along with accessors.We assume loads and stores use the same bottlenecks on CUDA device. Axpy shows this is not the case.
NumericSet allocates arrays of 16 u32
on the stacks. Instead we could either:
Currently, the order of choices is not customizable, it is just provided by the api. For the sake of experimenting, we would like to have the possibility to alter this order, as we want to highlight the fact that this change can have a significant impact on the exploration time.
In order to do so, we have to make the following modifications :
Also, we want to allow a special king of exploration that would not necessarily go the leaves of the tree, but rather explore for example the 4 or 5 first levels exhaustively. This implies that we would not run anything on the device. These benchmarks could be launched with cargo bench.
Right now we have a bunch of stuff in Config
which make sense for all search algorithms but we don't pass to them in favour of the more specific BanditConfig
etc.
It somehow simplifies things because now we can't end up with a wrong config (e.g. a BoundOrder
config when we are in bandit code), but it is also annoying because we don't have access to those common config options.
A better way of doing this would be to have a CommonConfig
struct with the common config options which we can pass along, and #[serde(flatten)]
it into the Config
struct to keep the existing fields.
This method might be usefull to remove some decisions we tried from the initial search space.
Cargo.lock has been added to the index so we can pin rustlex dependencies. This forces us to compile with rustc version nightly-2018-04=02
. Once #4 is solved, we can remove Cargo.lock
to compile with any rustc version.
This limitations is arbitrary and forbids us to optimize for some cases that are not well supported in existing libraries.
ir::device::cuda::mem_model
still worksWe currently use handlebars for templating the output. This makes things hard to debug since it is not statically typed. Instead we should use quote!
.
Travis currently re-compile many dependencies at each build. My guess is that part of the cache is overwritten by parallel jobs. We need to find how is the cache privatised and use it.
Rustlex uses libsyntax and thus cannot run on stable. It must be ported each time the compiler internals change. The available solutions are:
For some kernels, the code generated by telamon-gen is limiting the speed of the exploration. We need to precisely asses the performance of telamon-gen, both for Telamon and for dedicated benchmarks.
As noted in #178 we should use structopt for argument parsing, as that provides a much more user-friendly interface than getopt. There is currently some getopt argument parsing in the explorer configuration which should be converted to structopt. In particular, explorer::config::Config
should #[derive(StructOpt)]
.
Switch back to the crates.io
version of binary_heap_plus
after sekineh/binary-heap-plus-rs#1 is merged and published.
Currently, Telamon-gen generates a backtrace when it encounters a user error. This makes it hard to use. It should report errors with line and column numbers instead.
Remaining actions:
The readme should be updated to reflect recent developments.
Comment to propose types, objects or concepts that should be renamed or to propose new names for the names listed bellow.
In Telamon:
BasicBlock
=> Statement
MemBlock
=> MemoryRegion
In Telamon-gen
Action
could be renamed to Decision
value
=> incr_amount
ir::CounterKind
=> ir::CounterOperator
We want to distinguish functions being created from functions which are fully created ("frozen"). It should not be possible to add instructions or other objects to frozen functions. This is because we want to have a complete view of all possible choices (including choices induced by e.g. lowering) even before starting the exploration.
Currently this is implemented with a somewhat ugly hack using a "Lowering" type which is unit for non-frozen functions and contains lowering information (mapping from initial dimensions to lowered dimensions) for frozen functions. This was done in order to minimize the amount of needed code churn to the previous code (which handled only non-frozen functions) but is very counterintuitive.
We should refactor the Function / Instruction / etc. architecture (possibly using a common trait and specialized sub-traits, or by somehow wrapping things with lowering information) in order to make the distinctions as well as the common elements more explicit.
When the CPU is limiting and the GPU underused, measures are innacurate: #84 shows a difference that can reach a factor of 2 between the measured time when the CPU is limiting compared to when the GPU is limiting.
Ideally, the fix should not impact performance: more evaluations are needed only in the case where the CPU was limiting. Bellow are multiple ideas to have stable measures:
Tracking issue for strip-mining decisions.
Cuda limits the size of block dimensions. For x
, the maximal size is 2^31-1
but for y
and z
it is 2^16-1
. Thus w can easily hit the limit on y
and z
axis.
The problem is that we do not statically know the size of potential block dimensions. Thus, we will need to find a way to express the maximal size they can take.
Telamon-gen currently fails is it must generate an or operator between two ranges. We should either make it work or make sure this behavior does not happed.
Shared memory should be privatised per thread when it is not repeated accross all thread dimensions.
To support tiling, we need to expose decisions in a finite set of numeric values, provided by the user at runtime. This decision should be handled like enums, but with different constraints.
When using the new_node_order = api option in config, the following tests are stalling - happens to be all tests starting from max_thread_on_setkind
max_thread_on_setkind
max_thread_on_addinst
block_dims
temporary_memory_gen_simple
nested_thread_dims
unroll_dims
inst_dim_order
Two_thread_dim_map
reduce_dim_invariants
vector_dims
Currently the available executors are defined based on feature flags and the complete function/type definitions are erased when the feature is disabled.
It would be useful to keep a thin API layer for interoperability even when the feature is disabled, and to make it so that the API can fail. This would allow:
cfg!(feature)
instead of #[cfg(feature)]
and #[cfg(not(feature))]` pairs).A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.