Coder Social home page Coder Social logo

ulysseb / telamon Goto Github PK

View Code? Open in Web Editor NEW
23.0 23.0 6.0 70.25 MB

A framework to find good combinations of optimizations for computational kernels on GPUs.

Home Page: https://ulysseb.github.io/telamon/telamon

License: Apache License 2.0

Rust 92.05% C 6.34% Vim Script 0.04% Python 0.38% Lex 0.25% C++ 0.06% Shell 0.71% Emacs Lisp 0.15% Makefile 0.01%
compilation cuda gpu performance-modeling rust

telamon's People

Contributors

adjivas avatar andidr avatar bors[bot] avatar elarnon avatar nicotolly avatar samkaufman avatar superfluffy avatar ulysseb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

telamon's Issues

Vector dimensions cannot be MERGED

Vector dimensions must be declared in the access pattern of every memory instruction nested inside them. This forbids vector dimensions to be merged to other dimensions as they would not be declared in the access pattern.

This might not be a problem as vector dimensions cannot contain more than on instruction anyway. But if it is the case, we should make it explicit.

model::Size evaluation overflows in some cases

model::Size::bound triggers an overflow in when the numerator and denominator are too big. The obvious solution is to use big integers. However, we might be able to do better by carefully ordering computations.

Add an explicit implication in exh syntax

Exh files are currently very hard to read. In particular, implications A => B are generally encoded as B || not A, even if this is not stated anywhere explicitely. This can be confusing, especially when there are more than one term in the left branch of this implication. In addition, it can also be confused with a normal disjunction. We could add significant clarity just by adding an implication syntax that would allow to write A => B directly.
Syntax could be :
A => B (mathematic)
B :- A (Prolog)
Others ?

โ˜‚๏ธ Improvements to statistical exploration

This is a tracking issue for various possible improvements to Telamon's statistical exploration procedures. Ideas include:

  • Some approaches (e.g. [1]) propose to learn a performance model instead of specifying one manually. One of Telamon's strength here is also a weakness, because the need to be able to predict performance for partially-specified candidates makes it harder to learn a meaningful cost function -- and we need the performance model to keep predicting an actual lower bound on the performance. However, we also use the performance model to guide the exploration in the monte-carlo exploration phase, and we could use a ranking-based model (as in [1]) to predict, for instance, the most interesting candidate to explore during a monte-carlo descent directly from features extracted from the performance model instead of using only the bound.
  • Experiment with alternatives to the monte-carlo search. This includes experimenting with different hyper-parameters, different strategies for the armed bandit problem, and completely different approaches to the exploration which could be based on genetic algorithms or simulated annealing. Need to think more about how those would work wrt partial candidates :)
  • Enable search across different choices. This would probably require having an efficient way to compare candidates (since the search tree would become a search DAG and we want to merge identical candidates). One possibility is to perform the search across all possible decisions instead of limiting ourselves to a single choice, but that occurs a high computational cost. We could also try to dynamically predict good choices to make, or perform meta-analysis across different kernels to improve the existing static choice order.

1: TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Python bindings

We want to have Python bindings to Telamon in order to ease experimentation of statistical exploration approaches and integration with experiment reporting platforms such as mlflow.

Proposed approach is to use cbindgen to create a thin C API, then bundle it through milksnake to create a Python package. The initial goal is to expose a search function to drive Telamon's search from Python, as well as a couple helper to create kernels. Additional capabilities will be added as needed.

Improve the accuracy and cost of evaluations on the GPU

We currently run each candidate 20 times on the GPU if it has an execution time within 3X of the current best candidate. Better accuracy and performance could be achieved by exiting after a few evaluations if the performance is too bad and evaluating more if the performance is close the best candidate (e.g <3%).

Create a C API for Telamon

In order to link with over frameworks we need to export a C API. This API should allows:

  • to create kernel,
  • to specify actions,
  • to create a context and
  • to configure and lauch a search.
    The goal is not yet to expose the full domain as it is not stable enough.

Context creation will be specialized per device. We might also need to specialize some functions operating on context as they are parametrized in rust.

For kernel creation, we can either expose the builder or directly the IR. Before deciding this, we need to analyse the needs of potential users.

Actions creation should be handled directly by Telamon-gen.

Dump an event log

In order to better devise new possible ways of exploring the search space, it is useful to be able to get access to statistics on what happens during the search. For instance we may want to collect statistics on how discriminant some choices were. Currently we need to perform those statistics online while the search is running, but it would be useful to be able to dump an event log from which we could replay the whole search procedure and compute any new statistics we didn't compute at the time of the run.

Replaying the search should be significantly faster than re-starting it, especially since we won't run on the accelerator device nor the (expensive) propagation steps during the monte-carlo descents, and would allow to gather statistics after the fact that we didn't compute (maybe because we didn't know we wanted to compute them at the time) during the run. Proposed implementation is to dump that information into a protobuf stream. One issue is we need to generate .protos from telamon-gen, and we need a marker to ensure that we can only load the protos if the search space has not changed.

Building with mppa feature is broken

I have fixed some issues in #171 but others remain. I realized this as I am trying to ensure the CI tests as much of the code as possible, including code behind feature gates.

Expose values and variables in the IR

Data flow is currently implicit in our representation. This causes some problems:

  • we can't measure the register pressure
  • it is hard to measure the size of memory blocks. This leads to a O(n^3) search space representation.
  • we don't know when to introduce broadcasts
  • reductions are hard to manipulate and we cannot represent parallel reductions
  • we do not know when to privatise shared memory

Bellow is the modifications necessary to implement the new value system:

  • Allow accessing the value produce by an instruction with a Value object. This lays out the ground work for the rest of the modifications. #63
    - Define a Value struct, along with accessors.
    - Allow to store the result of an instruction in a value
  • Expose values in code generation
    - Expose a specified value struct
    - Name values
  • Expose values in the .exh file
  • Add a new operand kind to read a value
    - Expose the points where a value is created and used
    - Enforce ordering constraints from the exh.
  • Make the performance model value-aware
    - Take dependencies between values uses and defs into account.
  • Handle the last operator
  • Handle dimensions mapping
  • Handle the fby operator
  • Update the builder to use values and remove inst and reduce operands.
  • Handle Lowering
  • Remove the old operand system.
    • remove instruction type

Find a more efficient representation for `NumericSet`.

NumericSet allocates arrays of 16 u32 on the stacks. Instead we could either:

  • only keep a reference to the universe and tick available posibilities in a bitset. The universe should, in this case, be stored in the sigature.
  • pass the universe as argument each time we make a computation

Add an option to change the order of choices

Currently, the order of choices is not customizable, it is just provided by the api. For the sake of experimenting, we would like to have the possibility to alter this order, as we want to highlight the fact that this change can have a significant impact on the exploration time.
In order to do so, we have to make the following modifications :

  • Modify telamon-gen to expose a list of all possible choices (we currently have a list of all actions, which are just instantiated choices - choices with a value).
  • modify config.rs to expose this possibility to the user.
  • Maybe implement a shuffle or anything custom for the order of the choices we want to make.

Also, we want to allow a special king of exploration that would not necessarily go the leaves of the tree, but rather explore for example the 4 or 5 first levels exhaustively. This implies that we would not run anything on the device. These benchmarks could be launched with cargo bench.

Separate common config options in `CommonConfig`

Right now we have a bunch of stuff in Config which make sense for all search algorithms but we don't pass to them in favour of the more specific BanditConfig etc.

It somehow simplifies things because now we can't end up with a wrong config (e.g. a BoundOrder config when we are in bandit code), but it is also annoying because we don't have access to those common config options.

A better way of doing this would be to have a CommonConfig struct with the common config options which we can pass along, and #[serde(flatten)] it into the Config struct to keep the existing fields.

Remove Cargo.lock from the index

Cargo.lock has been added to the index so we can pin rustlex dependencies. This forces us to compile with rustc version nightly-2018-04=02. Once #4 is solved, we can remove Cargo.lock to compile with any rustc version.

Allow thread dimensions with a size other than a power of 2.

This limitations is arbitrary and forbids us to optimize for some cases that are not well supported in existing libraries.

  • remove the ir constraints.
  • update kernels so we adapt tiling to the divisors
  • ensure ir::device::cuda::mem_model still works

Use a typed template framework

We currently use handlebars for templating the output. This makes things hard to debug since it is not statically typed. Instead we should use quote!.

Improve caching in travis

Travis currently re-compile many dependencies at each build. My guess is that part of the cache is overwritten by parallel jobs. We need to find how is the cache privatised and use it.

Discard the Rustlex dependency

Rustlex uses libsyntax and thus cannot run on stable. It must be ported each time the compiler internals change. The available solutions are:

  • to use the Lalrpop tokenizer, but it is limited and cannot count lines for error reporting
  • to use flex and generate a C tokenizer
  • to use a parser combinator, such as nom, to create tokens.

Use structopt

As noted in #178 we should use structopt for argument parsing, as that provides a much more user-friendly interface than getopt. There is currently some getopt argument parsing in the explorer configuration which should be converted to structopt. In particular, explorer::config::Config should #[derive(StructOpt)].

Report errors during Telamon-gen parsing

Currently, Telamon-gen generates a backtrace when it encounters a user error. This makes it hard to use. It should report errors with line and column numbers instead.

Remaining actions:

  • Separate declaration and definition
  • Sets type checking and declaration
  • Generic choices type checking
  • Enum type checking and declaration
  • Integer type checking and declaration
  • NumericSet type checking and declaration
  • Counter type checking and declaration
  • Constraint type checking and declaration
  • Trigger type checking and declaration
  • Quotient sets type checking and delclaration

Update README.md

The readme should be updated to reflect recent developments.

  • move the code map to CONTRIBUTING.md
  • move contribution guidelines to CONTRIBUTING.md
  • explain how to run kernel test (mention --release flag for correctness tests, otherwise it is too slow).

Tracking issue for types/objects/concepts that should be renamed

Comment to propose types, objects or concepts that should be renamed or to propose new names for the names listed bellow.

In Telamon:

  • BasicBlock => Statement
  • MemBlock => MemoryRegion

In Telamon-gen

  • Action could be renamed to Decision
  • counter value => incr_amount
  • ir::CounterKind => ir::CounterOperator

Clean up Function architecture (frozen functions etc.)

We want to distinguish functions being created from functions which are fully created ("frozen"). It should not be possible to add instructions or other objects to frozen functions. This is because we want to have a complete view of all possible choices (including choices induced by e.g. lowering) even before starting the exploration.

Currently this is implemented with a somewhat ugly hack using a "Lowering" type which is unit for non-frozen functions and contains lowering information (mapping from initial dimensions to lowered dimensions) for frozen functions. This was done in order to minimize the amount of needed code churn to the previous code (which handled only non-frozen functions) but is very counterintuitive.

We should refactor the Function / Instruction / etc. architecture (possibly using a common trait and specialized sub-traits, or by somehow wrapping things with lowering information) in order to make the distinctions as well as the common elements more explicit.

Stabilize measures on the GPU when the CPU is limiting

When the CPU is limiting and the GPU underused, measures are innacurate: #84 shows a difference that can reach a factor of 2 between the measured time when the CPU is limiting compared to when the GPU is limiting.

Ideally, the fix should not impact performance: more evaluations are needed only in the case where the CPU was limiting. Bellow are multiple ideas to have stable measures:

  • launch dummy kernel when the GPU is not working to keep it busy
  • find another approach to pin the GPU clock
    • not available on ficus
  • use events to detect when the measured and actual time do not match
    • some incorrect time may not be detected if the two are too close
  • lauch each kernels multiple times.
    • exit early if the perf is too bad

Add strip-minig decisions

Tracking issue for strip-mining decisions.

  • express sizes as decisions
  • expose logical dimensions
  • simplify size creation with dedicated methods for const and fixed size
  • limit the size of inner BLOCK dimensions
  • update the model to depend on bounds/lcm/gcd
  • change ir::size so it reference the dimension size
  • expose mapped dimensions
  • allow dimension size to vary

Block dimensions are not checked against CUDA maximal block dim size.

Cuda limits the size of block dimensions. For x, the maximal size is 2^31-1 but for y and z it is 2^16-1. Thus w can easily hit the limit on y and z axis.

The problem is that we do not statically know the size of potential block dimensions. Thus, we will need to find a way to express the maximal size they can take.

Handle numeric decisions

To support tiling, we need to expose decisions in a finite set of numeric values, provided by the user at runtime. This decision should be handled like enums, but with different constraints.

Stall in test with new_node_order = api

When using the new_node_order = api option in config, the following tests are stalling - happens to be all tests starting from max_thread_on_setkind

max_thread_on_setkind
max_thread_on_addinst
block_dims
temporary_memory_gen_simple
nested_thread_dims
unroll_dims
inst_dim_order
Two_thread_dim_map
reduce_dim_invariants
vector_dims

Allow executors to fail

Currently the available executors are defined based on feature flags and the complete function/type definitions are erased when the feature is disabled.

It would be useful to keep a thin API layer for interoperability even when the feature is disabled, and to make it so that the API can fail. This would allow:

  • Writing generic code which dynamically checks for a feature (e.g. using cfg!(feature) instead of #[cfg(feature)] and #[cfg(not(feature))]` pairs).
  • In a second time, allow features to be compiled in and fail gracefully (i.e. without panicking) if no hardware accelerator is found on the current machine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.