Coder Social home page Coder Social logo

tiramisu-compiler / tiramisu Goto Github PK

View Code? Open in Web Editor NEW
895.0 44.0 132.0 356.91 MB

A polyhedral compiler for expressing fast and portable data parallel algorithms

Home Page: http://tiramisu-compiler.org

License: MIT License

C++ 48.08% C 4.15% Shell 0.69% Python 8.86% CMake 0.54% Makefile 0.17% MATLAB 0.05% Jupyter Notebook 37.04% Cuda 0.41% C# 0.01%
compiler library optimization linear-algebra tensors deep-neural-networks machine-learning code-generation tensor-algebra

tiramisu's Introduction

MIT licensed

Overview

Tiramisu is a compiler for expressing fast and portable data parallel computations. It provides a simple C++ API for expressing algorithms (Tiramisu expressions) and how these algorithms should be optimized by the compiler. Tiramisu can be used in areas such as linear and tensor algebra, deep learning, image processing, stencil computations and machine learning.

The Tiramisu compiler is based on the polyhedral model thus it can express a large set of loop optimizations and data layout transformations. Currently it targets (1) multicore X86 CPUs, (2) Nvidia GPUs, (3) Xilinx FPGAs (Vivado HLS) and (4) distributed machines (using MPI). It is designed to enable easy integration of code generators for new architectures.

Example

The following is an example of a Tiramisu program specified using the C++ API.

// C++ code with a Tiramisu expression.
#include "tiramisu/tiramisu.h"
using namespace tiramisu;

void generate_code()
{
    // Specify the name of the function that you want to create.
    tiramisu::init("foo");

    // Declare two iterator variables (i and j) such that 0<=i<100 and 0<=j<100.
    var i("i", 0, 100), j("j", 0, 100);

    // Declare a Tiramisu expression (algorithm) that is equivalent to the following C code
    // for (i=0; i<100; i++)
    //   for (j=0; j<100; j++)
    //     C(i,j) = 0;
    computation C({i,j}, 0);
    
    // Specify optimizations
    C.parallelize(i);
    C.vectorize(j, 4);
    
    buffer b_C("b_C", {100, 100}, p_int32, a_output);
    C.store_in(&b_C);

    // Generate code
    C.codegen({&b_C}, "generated_code.o");
}

Building Tiramisu from Sources

This section provides a description of how to build Tiramisu. The installation instructions below have been tested on Linux Ubuntu (18.04) and MacOS (13.0.1) but should work on other Linux and MacOS versions.

Prerequisites

Required
  1. CMake: version 3.22 or greater.

  2. Autoconf and libtool.

  3. Ninja.

Optional
  1. OpenMPI and OpenSSh: if you want to generate and run distributed code (MPI).

  2. CUDA Toolkit: if you want to generate and run CUDA code.

  3. Python 3.8 or higher if you want to use the python bindings. (Along with Pybind 2.10.2, Cython, and Numpy).

Build Methods

There are 3 ways to build Tiramisu:

  1. From spack, which will build everything from source for you.
  2. From source, but using system package managers for dependencies.
  3. Purely from source with our install script.

The last two only differ only in how they setup the dependenies.

Method 1: Build from spack

Install spack and then run:

spack install tiramisu

Method 2: Build from source but install dependencies using system package managers

There are two steps:

  1. Install the dependencies (either using Homebrew or using Apt).
  2. Use Cmake to build Tiramisu.
Install the dependencies
Install the dependencies using Homebrew

If you are on MacOS and using Homebrew, you can run the following commands to setup the dependencies:

brew install cmake
brew install llvm@14
brew install halide
brew install isl
brew link halide
brew link isl

If any of these ask you to update your path, do so. For example, using the following command, you can find the isl include and library directories:

brew info isl
ISL_INCLUDE_DIRECTORY=..
ISL_LIB_DIRECTORY=..
Install the dependencies using Apt

If you are on Ubuntu/Debian, you can use apt to setup the dependencies:

wget https://apt.llvm.org/llvm.sh
chmod +x llvm.sh
sudo ./llvm.sh 14 all
sudo apt-get install liblld-14-dev llvm-14-runtime
sudo apt-get install libllvm14 llvm-14-dev
sudo apt-get install llvm14-*
sudo apt-get install halide
sudo apt-get install libisl-dev

Using the following command, you can find the isl include and library directories:

dpkg -L libisl-dev
ISL_INCLUDE_DIRECTORY=..
ISL_LIB_DIRECTORY=..
Building Tiramisu with cmake
  1. Get Tiramisu
git clone https://github.com/Tiramisu-Compiler/tiramisu.git
cd tiramisu
mkdir build
  1. Setup the configure.cmake. In particular, choose if you want to use a GPU or MPI setup. Choose if you want to use the python bindings. Choose if you want to us the auto scheduler. You may need to add other options to support these.

  2. Configure:

cmake . -B build -DISL_LIB_DIRECTORY=$ISL_LIB_DIRECTORY -DISL_INCLUDE_DIRECTORY=$ISL_INCLUDE_DIRECTORY -DPython3_EXECUTABLE=`which python3`

If you want to install, add CMAKE_INSTALL_PREFIX. If you are installing the python bindings, add Tiramisu_INSTALL_PYTHONDIR to tell Tiramisu where to place a python package. You will need add these install locations to the relevant path variables such as PYTHONPATH and LD_LIBRARY_PATH.

  1. Build:
cmake --build build

You can also install if you want via cmake --install.

Method 3: Build from source, but install dependencies using our script

There are two steps:

  1. Install the dependencies using our script.
  2. Use Cmake to build Tiramisu.
Building Dependencies via Script
  1. Get Tiramisu
git clone https://github.com/Tiramisu-Compiler/tiramisu.git
cd tiramisu
  1. Get and install Tiramisu submodules (ISL, LLVM and Halide). This step may take between few minutes to few hours (downloading and compiling LLVM is time consuming).
./utils/scripts/install_submodules.sh <TIRAMISU_ROOT_DIR>
- Note: Make sure `<TIRAMISU_ROOT_DIR>` is absolute path!
  1. Optional: configure the tiramisu build by editing configure.cmake. Needed only if you want to generate MPI or GPU code, run the BLAS benchmarks, or if you want to build the autoscheduler module. A description of what each variable is and how it should be set is provided in comments in configure.cmake.

    • To use the GPU backend, set USE_GPU to TRUE. If the CUDA library is not found automatically while building Tiramisu, the user will be prompt to provide the path to the CUDA library.
    • To use the distributed backend, set USE_MPI to TRUE. If the MPI library is not found automatically, set the following variables: MPI_INCLUDE_DIR, MPI_LIB_DIR, and MPI_LIB_FLAGS.
    • To build the autoscheduler module, set USE_AUTO_SCHEDULER to TRUE.
  2. Add Halide's cmake to the CMAKE_PREFIX_PATH:

export CMAKE_PREFIX_PATH=<TIRAMISU_ROOT_DIR>/3rdParty/Halide/build/:$CMAKE_PREFIX_PATH
  1. Build the main Tiramisu library
mkdir build
cd build
cmake ..
cmake --build .
  1. If you want to build the autoscheduler module, set USE_AUTO_SCHEDULER to TRUE in configure.cmake, and after building Tiramisu :
make tiramisu_auto_scheduler

Old Tiramisu on a Virtual Machine

Users can use the Tiramisu virtual machine disk image. The image is created using virtual box (5.2.12) and has Tiramisu already pre-compiled and ready for use. It was compiled using the same instructions in this README file.

Once you download the image, unzip it and use virtual box to open the file 'TiramisuVM.vbox'.

Once the virtual machine has started, open a terminal, then go to the Tiramisu directory

cd /home/b/tiramisu/

If asked for a username/password

Username:b
Password:b

Getting Started

Run Tests

To run all the tests, assuming you are in the build/ directory

make test

or

ctest

To run only one test (test_01 for example)

ctest -R 01

This will compile and run the code generator and then the wrapper.

To view the output of a test pass the --verbose option to ctest.

tiramisu's People

Contributors

abdouskamel avatar abeakkas avatar asmabalamane avatar bhafsa avatar dinataklit avatar emanueledelsozzo avatar gueroudjiamal avatar hbenyamina avatar ihiaadj avatar isramekki avatar jrayzero avatar khadidjarabea avatar linanina avatar malekbr avatar mascinissa avatar medchik avatar mekrache avatar michaeldickkk avatar mlwagman avatar nadir199 avatar nadjisouilamas avatar nassimtchoulak avatar nedjimabelgacem avatar psuriana avatar radjahachilif avatar rbaghdadi avatar solleer avatar thinhinaneihadadene avatar wraith1995 avatar ychen306 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tiramisu's Issues

tiramisu_core

src/tiramisu_core.cpp has over 8800 lines which is very difficult to navigate through. It would be convenient if we split it into class definition files like tiramisu_computation.cpp, tiramisu_buffer.cpp, etc. This might be difficult to do in one shot since the system has many interconnected parts, so we can start moving things around function by function.

Add gpu_tile overloads

gpu_tile is missing overloads that lets you specify inner and outer iterator variables like:
gpu_tile(i, j, bsize, bsize, i0, j0, i1, j1); as in cpu tile.

Bug with tag_unroll_level ?

In the following code : https://github.com/abdouskamel/tiramisu/blob/master/benchmarks/DNN/blocks/Resize-Conv/resize_conv_generator_tiramisu.cpp, I tag the loop level fin of the computation resize to be unrolled (line 79). But when I generate the code, it's the loop level fin of the computation init_resized_input that gets unrolled. Here is the generated code :

parallel (c1, 0, 32) {
    for (c3, 0, 226) {
        for (c5, 0, 226) {
            unrolled (c7, 0, 3) { // Loop level that gets unrolled
                input_resized_buf[(((c7 + int32((int64(c5)*(int64)3))) + int32((int64(c3)*(int64)678))) + int32((int64(c1)*(int64)153228)))] = 0.000000f
            }
        }
    }
    for (c3, 0, 224) {
        for (c5, 0, 28) {
            vectorized (c7, 0, 8) {
                for (c9, 0, 3) { // Loop level that should be unrolled
                    let t27.s = int32(floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f)))
                    let t26.s = int32(floor_f32(((float32(c3)*1.785714f) + 0.392857f)))
                    let t25.s = t27.s
                    let t24 = t26.s
                    let t23 = t25.s
                    let t22.s = t24
                    let t21 = t25.s
                    let t20 = t24

                    input_resized_buf[(((c9 + int32(((int64(((c5*8) + c7))*(int64)3) + (int64)3))) + int32(((int64(c3)*(int64)678) + (int64)678))) + int32((int64(c1)*(int64)153228)))] = ((((_c_input_b0[(((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000))]*((floor_f32(((float32(c3)*1.785714f) + 0.392857f)) - (float32(c3)*1.785714f)) + 0.607143f)) + (_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 1800)]*(((float32(c3)*1.785714f) - floor_f32(((float32(c3)*1.785714f) + 0.392857f))) + 0.392857f)))*((floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f)) - (float32(((c5*8) + c7))*2.678571f)) + 0.160714f)) + (((_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 3)]*((floor_f32(((float32(c3)*1.785714f) + 0.392857f)) - (float32(c3)*1.785714f)) + 0.607143f)) + (_c_input_b0[((((c9 + (t25.s*3)) + (t24*1800)) + (c1*720000)) + 1803)]*(((float32(c3)*1.785714f) - floor_f32(((float32(c3)*1.785714f) + 0.392857f))) + 0.392857f)))*(((float32(((c5*8) + c7))*2.678571f) - floor_f32(((float32(((c5*8) + c7))*2.678571f) + 0.839286f))) + 0.839286f)))
                }
            }
        }
    }
    for (c3, 0, 4) {
        for (c5, 0, 224) {
            for (c7, 0, 224) {
                for (c9, 0, 8) {
                    output_buf[((((c9 + int32((int64(c7)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] = _conv_bias_b2[(c9 + (c3*8))]
                }
            }
            for (c7, 0, 3) {
                for (c9, 0, 3) {
                    for (c11, 0, 224) {
                        for (c13, 0, 3) {
                            vectorized (c15, 0, 8) {
                                output_buf[((((c15 + int32((int64(c11)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] = (output_buf[((((c15 + int32((int64(c11)*(int64)8))) + int32((int64(c5)*(int64)1792))) + int32((int64(c3)*(int64)401408))) + int32((int64(c1)*(int64)1605632)))] + (input_resized_buf[(((c13 + int32((int64((c9 + c11))*(int64)3))) + int32((int64((c5 + c7))*(int64)678))) + int32((int64(c1)*(int64)153228)))]*_conv_filter_b1[((((c15 + (c13*8)) + (c9*24)) + (c7*72)) + (c3*216))]))
                            }
                        }
                    }
                }
            }
        }
    }
}

I don't know if it's a bug or if there's an error in my code.

Can Multi Convs or MatMuls be fused?

Conv or Matmul can be fused with eltwise operations easily. I want to know if two Convs or Conv and Matmul can be fused? Because fusion can reduce memory footprint,so it benefits a lot to some Tensor Accelerators (e.g. TVA) if cascaded Convs can be fused. If this kind of fusion is available, can anyone provide an example?

Does Tiramisu have legality check of scheduling?

I tried the following example in Tiramisu master (c40b004).

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main(int argc, char** argv) {
    tiramisu::init("test_after");

    static const int N = 48;
    static const int M = 48;
    
    var x("x", 0, N);
    var y("y", 0, M);
    
    computation f("f", {y, x}, x + y);
    computation g("g", {y, x}, f(y, 48 - 1 - x));

    if (std::getenv("AFTER")) g.after(f, x);
    else g.after(f, computation::root);

    buffer buf_f("buf_f", {M, N}, p_int32, a_temporary);
    buffer buf_g("buf_g", {M, N}, p_int32, a_output);

    f.store_in(&buf_f, {y, x});
    g.store_in(&buf_g, {y, x});

    tiramisu::codegen({&buf_g}, "test_after.o");

    return 0;
}

I expected that this will cause an error when "AFTER=" is enabled because the scheduling g.after(f, x) breaks dependency between f and g. However, it can be compiled and returns a broken result.
The Tiramisu paper "Tiramisu: A Code Optimization Framework for High Performance Systems" says that "TIRAMISU does not have
this restriction since it checks transformation legality using
dependence analysis [18].", but I cannot find the way to check "transformation legality".

How can we check the legality of transformation?

Some Travis Tests Fail

When travis tests are enabled in:

# - make test

some tests fail every now and then (e.g test_10). The issue is probably about avx2 support on Travis machines, so we night need to specify the architecture in .travis.yml file.

CUDA Block Size

Add error when thread block size is bigger than 1024. Currently, the compiler fails silently which makes it hard to figure out what's wrong.

Name in codegen

The name of the function is defined in init:

tiramisu::init("sgemm");
..
tiramisu::codegen({&arg1, &arg2}, "build/generated_fct_sgemm.o");

but it would be more convenient if we provided the name in codegen along with file name:

tiramisu::init();
..
tiramisu::codegen("sgemm", {&arg1, &arg2}, "build/generated_fct_sgemm.o");

This also seems to be the Halide convention (with opposite order):

Res.compile_to_object("build/generated_fct_sgemm.o", {Img}, "sgemm");

CI build is broken

The CI build is failing. It seems to have something to do with halide's ABI not matching LLVM's, not sure how to fix that.

Julia interface?

Your paper mentions Julia interface. Is it going to be released?

Constants in store_in

There is no store_in equivalent of:
A.set_access("[offset]->{A[i]->buf_A[i + offset]}");
where offset is a tiramisu::constant. The following statement:
A.store_in(buf_A, {i + offset})
generates ISL syntax {A[i]->buf_A[i + offset]} which gives error since offset is not defined. We need a way to infer or pass the initial part ([offset]->) to the map.

Halide Buffer Index Problem

Halide buffer indices are reversed compared to C convention and thus Tiramisu. For example if we define a buffer buf(10, 20) it is passed as an array of shape buf[20][10] to Tiramisu. Consequently, Halide statementbuf(x, y) = z sets buf[y][x] instead of buf[x][y]. This creates unexpected problems.

The issue does not come up in tutorials because either outputs are constant so the index flip does not matter or we don't check the outputs for validity at all.

I will be fixing the tutorials and use one tutorial to explain this behavior.

Bug in constant declaration

Declaring a constant using the following constructor

tiramisu::constant N("N", tiramisu::expr((int32_t) size), p_int32);

would make the constant non function-wide while the following constructor

tiramisu::constant N_CONST("N", tiramisu::expr((int32_t) size));

declares the constant to be function-wide. The two behaviors are not coherent. Either they should be coherent or a different class should be created for non function-wide constants.

ISL Debugging

When there is an ISL syntax error we get something like syntax error (1, 53): unknown identifier and there is no clue which ISL statement is causing the error. We should print the erroneous ISL statement or the line number.

CUDA backend Warnings

When compiling tiramisu, the CUDA backend emits warnings

make tiramisu

....
....
....

/data/scratch/baghdadi/tiramisu/src/tiramisu_codegen_cuda.cpp: In member function ‘bool tiramisu::cuda_ast::compiler::compile(const string&) const’:
/data/scratch/baghdadi/tiramisu/src/tiramisu_codegen_cuda.cpp:1757:25: warning: ignoring return value of ‘char* getcwd(char*, size_t)’, declared with attribute warn_unused_result [-Wunused-result]
getcwd(cwd, 500);
^

Skewing problem

Skewing operation doesn't seem to have any effect. Say you have a nested loop:

for i in [0, 4):
  for j in [0, 4):
    A[i, j] = x

This traverses the loop range as in:

 0  1  2  3
 4  5  6  7
 8  9 10 11
12 13 14 15

After the skewing operation A.skew(i, j, 1), one would expect diagonal traversal:

 0  1  3  6
 2  4  7 10
 5  8 11 13
 9 12 14 15

But the resulting Tiramisu program still has the previous traversal. One can see this in the debug output as well:

Generated Halide stmt:
produce  {
  for (c1, 0, 4) {
    for (c3, c1, 4) {
      A[((c3 - c1) + (c1*4))] = x
    }
  }
}

Skew only shifts the inner loop by c1, and subtracts c1 back in the access. So the resulting program is equivalent to the one without skewing.

pytorch integration mentioned in paper

Thanks for the great work!

I read paper TIRAMISU: A Polyhedral Compiler for Dense and Sparse Deep Learning from learningsys workshop, and found the following statement:

TIRAMISU is integrated in high level deep learning frameworks such as Pytorch and therefore can be used transparently by end-users

Are we going to open-source this part? I checked pytorch official repo but found nothing. Also, there's repo under Tiramisu-Compiler named tiramisu-pytorch, while the repo is empty.

Thanks a lot!

Vectorize an inner loop that depends on the outer loop

Hello,

Actually that's not an issue but a question concerning Tiramisu. I don't know if it's the right place for asking questions, thus I'm sorry if it's inconvenient.

I want to implement a computation that looks like this with tiramisu :

for (i = 0; i < N; ++i)
    for (j = 0; j < i; ++j)
        S0
    S1

I have done it following the tutorials, but now I want to apply vectorization to the loop j with vector length v. The problem is that, as stated in the documentation :

the vectorization of a loop that has less than v iterations is not correct.

Thus, for the first iterations where i < v, the vectorization will be incorrect because the loop j iterates for less than v iterations. I think that this can be fixed by splitting the outer loop like that :

for (i = 0; i < v; ++i)
    for (j = 0; j < i; ++j)
        S0
    S1

for (i = v; i < N; ++i)
    for (j = 0; j < i; ++j)
        S0
    S1

and apply vectorization only to the second loop. This can be done easily in the declaration of the algorithm (layer 1), but I think that I must resolve this issue in layer 2 because it's an issue related to the optimization of the algorithm.

I searched in the documentation but didn't find a way to do that, so I am asking for your help.

Thank you.

CUDA Unrolling

Unrolling is not implemented in CUDA ast. The loop is splitted but "#pragma unroll" directive is not added.

CUDA Wrapper Bug

cuda_ast::statement_ptr{new cuda_ast::host_function{cuda_ast::kernel::wrapper_return_type,
kernel->get_wrapper_name(), kernel->get_arguments(),
std::static_pointer_cast<cuda_ast::statement>(wrapper_block)}}

Host code generation for CUDA has a subtle bug. Wrapper copies kernel's arguments as its own arguments (kernel->get_arguments() above). However, if kernel drops an unused argument while wrapper still uses it for block dimension, generated code becomes faulty:

int32_t _kernel_0_wrapper(int32_t K, int32_t N) {
  dim3 blocks(((M / 96) + 1), ((N / 256) + 1), 1);
  dim3 threads((15 + 1), (15 + 1), 1);
  _kernel_0<<<blocks, threads>>>(K, N);
  return 0;
}

M is undefined above and should've been passed to wrapper as an argument. I'll work on this asap since it blocks GEMM benchmark.

Some tests of skewing command failing

I've run the tests and the following failed : 133-134-135-139-140, I checked them and they are skewing tests, giving an output different than the expected.
I tried running them individually a few times, and they don't always fail, which means that perhaps a few other skewing tests could fail as well.

Debug Macro

__FILE__ and __LINE__ macros point to this file (tiramisu_debug.cpp:41) instead of where function is called. We should implement it as a macro instead of a function.

void error(const std::string &str, bool exit_program)
{
std::cerr << "Error in " << __FILE__ << ":"
<< __LINE__ << " - " << str << std::endl;
if (exit_program)
{
exit(1);
}
}

Any suggestions for understanding the ISL code in Tiramisu?

Hello Tiramisu developers:

I am a software developer interested in the Tiramisu project.

Currently, I'm reading the Tiramisu codes and find complex usage of ISL for the polyhedral analysis.

I wonder how you developers can master the ISL so fluently, as far as I googled through the Internet and just found several documents about the ISL, with very few demos and examples about the API.

After reading these documentations, I found it still hard to understand the ISL usage code in the Tiramisu project.

Can you provide some suggestions about the materials for mastering the ISL APIs?

Errors compiling benchmarks/halide/heat2d_dist_tiramisu.cpp

This seems to be an error due to MPI, maybe because this benchmark is not disable if WITH_MPI is not used.

Assuming that

WITH_MPI=false

If the user runs

make benchmarks

We get the following error messages

/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:165:32: error: too few arguments to function call, single argument 'level' was not specified
init_even.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:166:31: error: too few arguments to function call, single argument 'level' was not specified
init_odd.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:167:36: error: too few arguments to function call, single argument 'level' was not specified
out_comp_even.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:168:35: error: too few arguments to function call, single argument 'level' was not specified
out_comp_odd.drop_rank_iter();
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:2981:5: note: 'drop_rank_iter' declared here
void drop_rank_iter(var level);
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:191:12: error: use of undeclared identifier 'TOTAL_ITERATIONS'
assert(TOTAL_ITERATIONS % 2 == 0); // so that the output buffer is the last odd computation
^
/Users/b/Documents/src/MIT/tiramisu/benchmarks/halide/heat2d_dist_tiramisu.cpp:261:21: error: 'lift_dist_comps' is a protected member of 'tiramisu::function'
heat2d_tiramisu.lift_dist_comps(); // MUST go before gen_isl_ast
^
/Users/b/Documents/src/MIT/tiramisu/include/tiramisu/core.h:680:10: note: declared protected here
void lift_dist_comps();
^

Add an argument type for input-output

We should add a new argument type for buffers that are used as both input and output. Also documentation for buffer constructor needs to be updated. The code generator should also be modified.

In the CPU code generator that uses Halide there is not difference between input, output and input/output buffers. They are all treated in the same way, but for generating GPU code, we want to be able to differentiate between them so that we can decide which buffer to copy in to GPU and which buffers to copy out from GPU.

* - a_input: for inputs of the function,
* - a_output: for outputs of the function,
* - a_temporary: for buffers used as temporary buffers within

RNN example

I would be very interested in a RNN implementation example using Tiramisu.

Unfortunately due to #217 I cannot explore that myself at the moment.

Also in the tiramisu.github.io website, you claim that Halide cannot represent RNNs but time is just another loop/tensor dimension. Case in point, this seems to be an actual implementation of LSTM in Halide: https://github.com/misaka-10032/Halstm/blob/master/src/layer/lstm.cpp.

One thing I would be very interested in is the wavefront optimisation on stacked RNNs as mentionned in Nvidia's blog post in optimization 3.

They even provide the Cuda source code that can serve as a reference benchmark.

Build instructions/scripts need to be updated. Unable to build.

I tried to follow instructions on an up-to-date archlinux distro with GCC 8.3.0 and Clang/LLVM 8.0.0 but had several issues with both LLVM and Halide and had to give up trying Tiramisu in the end.

Note that I can successfully build Halide from source. My last build was as of Feb 9 (https://github.com/halide/Halide/commits/d02247b3021549fde4bec8e600dced90f5d9a87c).

Build

I tried to use the install submodule script at https://github.com/Tiramisu-Compiler/tiramisu/blob/2ee529439fbfccf82f7351ee2e3c01f10387af26/utils/scripts/install_submodules.sh

LLVM

It fails for LLVM after reaching 100%.

[100%] Built target libclang
[100%] Built target c-arcmt-test
[100%] Built target c-index-test
In file included from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/LinkAllCodegenComponents.h:20,
                 from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/tools/lli/lli.cpp:22:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/SchedulerRegistry.h: In constructor ‘llvm::RegisterScheduler::RegisterScheduler(const char*, const char*, llvm::RegisterScheduler::FunctionPassCtor)’:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/CodeGen/SchedulerRegistry.h:40:52: warning: cast between incompatible function types from ‘llvm::RegisterScheduler::FunctionPassCtor’ {aka ‘llvm::ScheduleDAGSDNodes* (*)(llvm::SelectionDAGISel*, llvm::CodeGenOpt::Level)’} to ‘llvm::MachinePassCtor’ {aka ‘void* (*)()’} [-Wcast-function-type]
   : MachinePassRegistryNode(N, D, (MachinePassCtor)C)
                                                    ^
In file included from /home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/tools/lli/lli.cpp:30:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/ExecutionEngine/Orc/OrcRemoteTargetClient.h: In member function ‘llvm::Expected<std::vector<char> > llvm::orc::remote::OrcRemoteTargetClient<ChannelT>::readMem(char*, llvm::JITTargetAddress, uint64_t)’:
/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/llvm/include/llvm/ExecutionEngine/Orc/OrcRemoteTargetClient.h:722:26: error: could not convert ‘((llvm::orc::remote::OrcRemoteTargetClient<ChannelT>*)this)->callB<llvm::orc::remote::OrcRemoteTargetRPCAPI::ReadMem>(Src, Size)’ from ‘Expected<vector<unsigned char,allocator<unsigned char>>>’ to ‘Expected<vector<char,allocator<char>>>
     return callB<ReadMem>(Src, Size);

Halide

The script launches the default compiler, GCC 8 on my machine (note that it works for current master branch Halide). But GCC 8 brought a lots of new warnings that needs to be ignored with:
-Wno-stringop-truncation (/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/Halide/src/Introspection.cpp:36:12: error: ‘char* strncpy(char*, const char*, size_t)’ specified bound 2048 equals destination size [-Werror=stringop-truncation])
-Wno-catch-value (some polymorphic exception caught by value)
-Wno-format (/home/beta/Programming/DSL_Compilers_for_Tensors/tiramisu/3rdParty/Halide/test/correctness/extern_producer.cpp:47:12: error: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘int64_t’ {aka ‘long int’} [-Werror=format=])
-fpermissive

It would be good to force usage of the Clang 5.0 if that is the only supported one.

Object files

This part of Halide Makefile is deleting the objects that are supposed to be put in the static lib and the make command fails with file not found: https://github.com/jrayzero/Halide/blob/0e2cac7a2e3982d1a51d82932b185f75af05f4c2/Makefile#L670-L697

$(BUILD_DIR)/llvm_objects/list: $(OBJECTS) $(INITIAL_MODULES)
	# Determine the relevant object files from llvm with a dummy
	# compilation. Passing -t to the linker gets it to list which
	# object files in which archives it uses to resolve
	# symbols. We only care about the libLLVM ones.
	@mkdir -p $(@D)
	$(CXX) -o /dev/null -shared $(OBJECTS) $(INITIAL_MODULES) -Wl,-t $(LLVM_STATIC_LIBS) $(COMMON_LD_FLAGS) | egrep "libLLVM" > $(BUILD_DIR)/llvm_objects/list.new
	# if the list has changed since the previous build, or there
	# is no list from a previous build, then delete any old object
	# files and re-extract the required object files
	cd $(BUILD_DIR)/llvm_objects; \
	if cmp -s list.new list; \
	then \
	echo "No changes in LLVM deps"; \
	touch list; \
	else \
	rm -f llvm_*.o*; \
	cat list.new | sed = | sed "N;s/[()]/ /g;s/\n /\n/;s/\([0-9]*\)\n\([^ ]*\) \([^ ]*\)/ar x \2 \3; mv \3 llvm_\1_\3/" | bash -; \
	mv list.new list; \
	fi

$(LIB_DIR)/libHalide.a: $(OBJECTS) $(INITIAL_MODULES) $(BUILD_DIR)/llvm_objects/list
	# Archive together all the halide and llvm object files
	@mkdir -p $(@D)
	@rm -f $(LIB_DIR)/libHalide.a
	# ar breaks on MinGW with all objects at the same time.
	echo $(OBJECTS) $(INITIAL_MODULES) $(BUILD_DIR)/llvm_objects/llvm_*.o* | xargs -n200 ar q $(LIB_DIR)/libHalide.a
	ranlib $(LIB_DIR)/libHalide.a

LLVMHeaders

When building Halide it doesn't search the LLVM Headers in their proper place and I had to explicitly include 3rdParty/llvm/include

CuDNN variable

There is a CUDNN_LOCATION variable here

set(CUDNN_LOCATION /data/scratch/akkas/cudnn7)

but during build I had a warning that CUDNN_LIBRARIES was not set.

Looking into this code:

tiramisu/CMakeLists.txt

Lines 41 to 42 in 11395ca

find_library(CUDNN_LIBRARIES cudnn PATHS ${CUDNN_LOCATION}/lib64 NO_DEFAULT_PATH)
set(CUDNN_INCLUDE_DIRECTORY ${CUDNN_LOCATION}/include)

either CUDNN_LOCATION should be changed to LIB/INCLUDE similar to ISL and Halide:

tiramisu/configure.cmake

Lines 40 to 46 in 2ee5294

# ISL paths
set(ISL_INCLUDE_DIRECTORY "3rdParty/isl/build/include/")
set(ISL_LIB_DIRECTORY "3rdParty/isl/build/lib/")
# Halide Paths
set(HALIDE_SOURCE_DIRECTORY "3rdParty/Halide")
set(HALIDE_LIB_DIRECTORY "3rdParty/Halide/lib")

or CUDNN_LOCATION description Change with the cudnn library location should be made more clear that it's the directory parent to lib64/libcudnn.so and include/cudnn.h as naively people will set it to /usr/lib or /usr/lib64 instead of /usr

Regression issue? Many halide tests slower with tiramisu on Mac, or crashing

The following tests are significantly slower with Tiramisu vs Halide:

  • blurxy
  • convolution, convolution_layer
  • gaussian
  • vgg
  • warp-affine

Others fails:

  • recfilter — segfault
  • heat2d, heat3d —crash: out of bounds access
  • laplacian —doesn’t build
  • optical_flow —crash: out of bounds access
  • resize — crash: name not in scope

This is on macbookpro 2018, MacOS Mojave, latest xcode and Homebrew up-to-date.

rectfilter benchmark segfaults on Linux

For some reasons, rectfilter benchmark segfaults on Linux. It was fine on Mac. The segfault seems to happen after the benchmark is run. Commenting out the save_image seems to make the issue disappear, although I am not sure if it is the direct cause of the segfault.

Here is the error message:

Kernel : Tiramisu ; Halide ;
recfilter : 13.795732 ; 12.503011 ;
/bin/sh: line 1: 70723 Segmentation fault (core dumped) LD_LIBRARY_PATH=:/usr/local/Halide/lib:/usr/local/isl/installed/lib:Halide/lib:/usr/local/lib:/usr/local/tiramisu/build/ DYLD_LIBRARY_PATH=:Halide/lib:/usr/local/tiramisu/build/ ${tt}

Docker image is broken?

I downloaded the docker image from http://groups.csail.mit.edu/commit/software/TiramisuVM.zip
to try it out and when I unzip it I get the following:

bash-3.2$ unzip TiramisuVM.zip
Archive:  TiramisuVM.zip
warning [TiramisuVM.zip]:  4294967296 extra bytes at beginning or within zipfile
  (attempting to process anyway)
file #1:  bad zipfile offset (local header sig):  4294967296
  (attempting to re-compensate)
   creating: TiramisuVM/
  inflating: TiramisuVM/.DS_Store
   creating: __MACOSX/
   creating: __MACOSX/TiramisuVM/
  inflating: __MACOSX/TiramisuVM/._.DS_Store
   creating: TiramisuVM/Logs/
  inflating: TiramisuVM/Logs/VBox.log
  inflating: TiramisuVM/TiramisuVM.vbox
  inflating: TiramisuVM/TiramisuVM.vbox-prev
  inflating: TiramisuVM/TiramisuVM.vdi
  error:  invalid compressed data to inflate

Could you please provide a working docker image?

New API Scheduling Issue

There is a scheduling issue in new API that causes all computations to be placed in the innermost loop nest. Check generated code of tutorial_04A.

Is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependency?

I am quite new to the polyhedral model, and may still be unfamiliar with related concepts, so please point out if I made any mistakes.

I would like to know is there any methods for parallelizing or loop tiling that automatically resolves data dependencies. To be more specific, consider the following one-dimensional stencil computation:

for (t = 1; t < T; t += 1)
  for (i = 1; i < N - 1; i += 1)
    A[t][i] = 0.25 * (A[t - 1][i + 1] - 2.0 * A[t - 1][i] + A[t - 1][i - 1]);

Since computing A[t][i] needs to read A[t - 1][i + 1], the statement instance (t, i) have to be excuted after the statement instance (t - 1, i + 1). So the two-level loop cannot be simply tiled, otherwise data dependency will be violated.

However, the Computation::tile function in Tiramisu seems won't make any efforts to solve the data dependency:

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main() {
  tiramisu::init("stencil");

  const int SIZE_T = 200, SIZE_N = 100;
  constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));

  var t("t", 1, T), i("i", 1, N - 1);

  computation A("A", {t, i}, p_float32);
  A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
//  var t0("t0"), i0("i0"), t1("t1"), i1("i1");
//  A.tile(t, i, 32, 32, t0, i0, t1, i1);
  
  buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_input);
  A.store_in(&b_A);

  tiramisu::codegen({&b_A}, "stencil.o");
}

Uncomment the two lines related to loop tiling, and the output Halide IR changes from:

Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce  {
  allocate _A_b0[float32 * 98 * 199]
  for (c1, 1, 199) {
    for (c3, 1, 98) {
      b_A[(c3 + int32((int64(c1)*(int64)100)))] = (((b_A[(int32((int64(c3) + (int64)1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))] - (b_A[(c3 + int32(((int64(c1)*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(c3) + (int64)-1)) + int32(((int64(c1)*(int64)100) + (int64)-100)))])*0.250000f)
    }
  }
}

to:

Generated Halide IR:
assert((reinterpret(uint64, b_A.buffer) != (uint64)0), halide_error_buffer_argument_is_null("b_A"))
let b_A = _halide_buffer_get_host(b_A.buffer)
produce  {
  allocate _A_b0[float32 * 98 * 199]
  for (c1, 0, 7) {
    for (c3, 0, 4) {
      for (c5, (1 - min((c1*32), 1)), ((min((c1*32), 1) - max((c1*32), 168)) + 199)) {
        for (c7, (1 - min((c3*32), 1)), ((min((c3*32), 1) - max((c3*32), 67)) + 98)) {
          b_A[(((c3*32) + c7) + int32((int64(((c1*32) + c5))*(int64)100)))] = (((b_A[(int32((int64(((c3*32) + c7)) + (int64)1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))] - (b_A[(((c3*32) + c7) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))]*2.000000f)) + b_A[(int32((int64(((c3*32) + c7)) + (int64)-1)) + int32(((int64(((c1*32) + c5))*(int64)100) + (int64)-100)))])*0.250000f)
        }
      }
    }
  }
}

where data dependency is violated, and test also fails.

I have heard that the Pluto algorithm can be adopted in such a scenario, which will automatically skew the iteration domain to solve the data dependency:

for (t = 1; t < T; t += 1)
  for (i = 1 + t; i < N - 1 + t; i += 1)
    A[t][i - t] = 0.25 * (A[t - 1][i - t + 1] - 2.0 * A[t - 1][i + t] + A[t - 1][i - t - 1]);

and the loop can be safely tiled. It is also possible to skew the iteration domain and perform loop tiling in Tiramisu:

#include <tiramisu/tiramisu.h>

using namespace tiramisu;

int main() {
  tiramisu::init("stencil");

  const int SIZE_T = 200, SIZE_N = 100;
  constant T("T", expr(SIZE_T)), N("N", expr(SIZE_N));

  var t("t", 1, T), i("i", 1, N - 1);

  computation A("A", {t, i}, p_float32);
  A.set_expression((A(t - 1, i + 1) - A(t - 1, i) * 2.0f + A(t - 1, i - 1)) * 0.25f);
  var nt("nt"), ni("ni");
  A.skew(t, i, 1, nt, ni);
  var t0("t0"), i0("i0"), t1("t1"), i1("i1");
  A.tile(nt, ni, 32, 32, t0, i0, t1, i1);

  buffer b_A("b_A", {expr(SIZE_T), expr(SIZE_N)}, p_float32, a_output);
  A.store_in(&b_A);

  tiramisu::codegen({&b_A}, "stencil.o");
}

It will pass the test. But it requires my observation of the loop patterns to make such a transformation. Moreover, if I want to paralize the loop, it requires not only skewing, but also synchronization and communication between parallel computation units. This seems complicated to me, but it can theoratically be automated through the Pluto algorithm. This is why I would like to know: is there any methods in Tiramisu for parallelizing or loop tiling that automatically resolves data dependencies?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.