jingpu / halide-hls Goto Github PK

HLS branch of Halide

License: Other

CMake 1.04% Makefile 1.87% Shell 0.52% C++ 85.80% Java 1.23% C 5.00% MATLAB 0.05% Objective-C 0.04% HTML 0.02% Objective-C++ 0.45% Python 2.39% Batchfile 0.08% LLVM 1.41% Verilog 0.03% Tcl 0.01% Ruby 0.06%

halide fpga hls image-processing dsl fpga-soc

halide-hls's Introduction

Halide to CPU/FPGA

The current compiler is based on Halide release 2017/05/03 (https://github.com/halide/Halide/releases).

Intructions for building examples can be found at the wiki page: https://github.com/jingpu/Halide-HLS/wiki

A paper is available at https://arxiv.org/abs/1610.09405

If you want to build the compiler in other settings, please refer to the original readme: https://github.com/jingpu/Halide-HLS/blob/HLS/README.orig.md

For more detail about what Halide is, see http://halide-lang.org.

Build Status

Linux

Updates

2017/06/14 merge Halide nightly 2017/06/13 ( d55ebd5b806110cd77e7c4f2616ddb9ffbf2e99e ), picking up aotcpp_generators tests.
2017/06/13 merge Halide release 2017/05/03.

halide-hls's People

Contributors

Stargazers

Watchers

halide-hls's Issues

Add HLS AOT generator test

We should have at least an AOT-CPP generator test for HLS backend.

Blocked by #7

Rewrite example apps using Generator

It is blocked #8

Example Usage of HLS in Application

I've managed to get hls synthesis working and integrated into a project; however I'm not quite sure how to format my images in order for the accelerator to actually run them.

In the 'run.cpp' code the BufferMinimal type is used, but I don't believe I can simply serialize that over to the FPGA since the hls_target expects an hls::stream<AXIPackedStencil<T...>> as an argument.

Do you have any examples of apps formatting data for use by the hls block? Or can you recommend how I can use halide libraries to convert my image into a stream of AXIPackedStencils?

N.B. I'm attempting this with the Gaussian example.

Thanks.

Lookup Tables for Trigonometric Functions

Dear all,

Regarding constant function optimization, in the Halide-HLS paper (Programming Heterogeneous Systems from an Image Processing DSL), it has been declared that:

Enabling more design optimizations, the compiler can also statically evaluate constant functions (e.g. lookup tables), and generate the code that later synthesizes to ROMs.

There are some examples in hls_examples directory of Halide-HLS (e.g. unsharp_hls) that use exponential function in a reduction domain, and the values for that exponential function is replaced with constants in generated HLS code.

I am trying to infer look up tables for sin(x) and cos(y) in the following code:

Input(x,y) = Input_Image(x, y);

fy_f(y) = cos(y);
fx_f(x) = sin(x);

fy(y) = Halide::cast<uint16_t>(fy_f(y) * 65535);
fx(x) = Halide::cast<uint16_t>(fx_f(x) * 65535);

fxy(x, y) = fx(x) * fy(y);
	
hw_output(x,y) = Input(x,y) * fxy(x, y);    
output(x,y) = hw_output(x,y);

It seems that there are sin_f32() and cos_f32() functions in the generated HLS code, which receive their arguments from loops indexes, and Vivado HLS does not use lookup tables for those functions, eventhough the loop indexes are known.

I know we can use constant arrays which have been evaluated on corresponding indexes of sin() and cos() in Halide code instead of using those functions explicitly. But I wonder can Halide-HLS compiler generate lookup tables directly for those functions, not just in reduction domain manner as it does in unsharp_hls example. Is there a Halide primitive that can be used in this situation?

Thanks!

Possible bug in generated HLS

Hi, I am trying to use Halide-HLS to write a simple HLS program. But I found in some cases, the tool will produce incorrect HLS results.
Here's the given Halide C++ code:

MyPipeline() : input(UInt(64), 1, "input"), A("A"), B("B"), C("C"), hw_output("hw_output")

{

    // define the algorithm

    A = BoundaryConditions::repeat_edge(input);

    Expr constant(((uint64_t)(274877906943)));

    output(x) =  A(x) & constant;

    // define common schedule: tile output

    args.push_back(input);

}

This program supposed to clear the 39th and 40th bit of A. However, in the generated HLS, the code will be:

 for (int _p2_output_s0_x = _21; _p2_output_s0_x < _21 + _137; _p2_output_s0_x++)

  {

   int32_t _138 = _14 + _15;

   int32_t _139 = _138 + -1;

   int32_t _140 = min(_p2_output_s0_x, _139);

   int32_t _141 = _140 - _14;

   int32_t _142 = max(_141, 0);

   uint64_t _143 = ((const uint64_t *)_input)[_142];

   uint64_t _144 = _143 & 63;

   int32_t _145 = _p2_output_s0_x - _21;

   ((uint64_t *)_p2_output)[_145] = _144;

  } // for _p2_output_s0_x

  int32_t _146 = _136 - _132;

  for (int _p2_output_s0_x = _132; _p2_output_s0_x < _132 + _146; _p2_output_s0_x++)

  {

   int32_t _147 = _p2_output_s0_x - _14;

   uint64_t _148 = ((const uint64_t *)_input)[_147];

   uint64_t _149 = _148 & 63;

   int32_t _150 = _p2_output_s0_x - _21;

   ((uint64_t *)_p2_output)[_150] = _149;

  } // for _p2_output_s0_x

Where the constant expr will become 63, instead of the original 274877906943. And this will produce incorrect results.

hls_stream.h: unknown type name 'type_info'

Hi @jingpu ,

I have an error when I try to synthesize the C HLS using Vivado HLS 2016.4.
It seems that when compiling with `std=c++0x or std=c++11' causes this error.

Pragma processor failed: In file included from hls_target.cpp:1: In file included from ./hls_target.h:7: In file included from C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/include\hls_stream.h:79: In file included from C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/win64/tools/clang/bin\..\lib\clang\3.1/../../../include/c++/4.5.2\queue:60: In file included from C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/win64/tools/clang/bin\..\lib\clang\3.1/../../../include/c++/4.5.2\deque:61: In file included from C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/win64/tools/clang/bin\..\lib\clang\3.1/../../../include/c++/4.5.2\bits/allocator.h:48: In file included from C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/win64/tools/clang/bin\..\lib\clang\3.1/../../../include/c++/4.5.2/x86_64-w64-mingw32\bits/c++allocator.h:34: In file included from C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/win64/tools/clang/bin\..\lib\clang\3.1/../../../include/c++/4.5.2\ext/new_allocator.h:33: In file included from C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/win64/tools/clang/bin\..\lib\clang\3.1/../../../include/c++/4.5.2\new:41: In file included from C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/win64/tools/clang/bin\..\lib\clang\3.1/../../../include/c++/4.5.2\exception:150: C:/cadappl/Vivado/2016.4/Vivado_HLS/2016.4/win64/tools/clang/bin\..\lib\clang\3.1/../../../include/c++/4.5.2\exception_ptr.h:132:13: error: unknown type name 'type_info' const type_info* ^ In file included from hls_target.cpp:1:

When I tried to remove the std=c++0x, I got an error related to static_assert() which is not supported.
I don't if this happens only in Vivado HLS 2016.4 or not.
Is it better to remove the static_assert and change it with something else? Or is there any configuration that I missed in order to properly compile the hls_stream.h?

Thanks.

reinterpret<>() for hls_target.cpp

Hello,

I found reinterpret<> can be used in hls_target.cpp and it causes compile errors and VivadoHLS doesn't know how to do it. In the following example, constant '1' for 'Pos & 1' is reinterpreted unnecessarily.

  Param<uint8_t> Pos;

  MyPipeline():
          input(Int(32), 2),
          hw_output("hw_output"),
          output("output")
  {
    padded = BoundaryConditions::constant_exterior(input, 0);

    Expr xOffset = Pos & 1;
    Expr yOffset = (Pos >> 1)&1;

    hw_output(x, y) = padded(x + xOffset, y + yOffset);

    output(x, y) = hw_output(x, y);

    args = {input, Pos};
  }

I think it should be dealt in CodeGen_HLS_Target::CodeGen_HLS_C rather than in CodeGen_C.

bad_alloc

Hello Jing,

Here is a small test case that runs into a bad_alloc error:

#include "Halide.h"
#include <stdio.h>
using namespace Halide;
int main(int argc, char **argv) {
    Func e, f, g;
    Var x;
    e(x) =x;
    f(x) = e(x);
    g(x) = f(x); 

    Var xi, xo;
    g.split(x, xo, xi, 16).accelerate({e}, xi, xo);
    f.linebuffer();

    Image<int> out = g.realize(100);//, target);
    return 0;
}

Halide-HLS]$ g++ -std=c++11 -g -fno-omit-frame-pointer -fno-rtti -Wall -Werror -Wno-unused-function -Wcast-qual -Wignored-qualifiers -Wno-comment -Wsign-compare -O3 test/correctness/gpu_dynamic_shared.cpp -Iinclude -Lbin -lHalide -lpthread -ldl -lz -rdynamic -Wl,--rpath=/home/hrong/Halide-HLS/bin -o t

Halide-HLS]$ ./t
Warning at test/correctness/gpu_dynamic_shared.cpp:34:
No linebuffer inserted after function f.
Warning at test/correctness/gpu_dynamic_shared.cpp:34:
No linebuffer inserted after function .
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)

Thanks!
Hongbo

Accelerating last function (not hw_output) not possible?

Hello,

Is it impossible to accelerate output instead of hw_output which is just in front of output like in other examples? The following code generates incorrect pipeline_hls.cpp which has syntax error "_p2_output was not declared in this scope". I'm just curious it's a bug or a un-supported feature.

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

Var x,y,xo,yo,xi,yi;

class MyPipeline {
public:
ImageParam in;
Func in2;
Func output;

std::vector<Argument> args;

MyPipeline() : in(UInt(8), 2)
{
    in2(x,y) = in(x,y);
    output(x, y) =  in2(x, y) + 2 * in2(x+1, y) + 2 * in2(x+2, y) + in2(x+3, y);
    // Arguments
    args = {in};
}

void compile_hls() {
    std::cout << "\ncompiling HLS code..." << std::endl;

    output.tile(x, y, xo, yo, xi, yi, 480, 640);

    in2.compute_root();

    output.accelerate({in2}, xi, xo);
    // Create the target for HLS simulation
    Target hls_target = get_target_from_environment();
    hls_target.set_feature(Target::CPlusPlusMangling);
    output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
    output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);
}

};

int main(int argc, char **argv) {
MyPipeline p2;
p2.compile_hls();

return 0;

}

Calculation range fail

Dear all,
I tried to modify example conv_hls. I can pass make pipeline_hls.cpp but fail at make out.png. The error message shows

Error: Input buffer input is accessed at 0, which is beyond the max (-1) in dimension 2
make: *** [out.png] Aborted (core dumped)

However, max(-1) is not expected in my code. Is there anything wrong in my code?
The following is my pipeline.cpp and run.cpp, thanks

pipeline.cpp

#include "Halide.h"
#include <string.h>

using namespace Halide;
using std::string;

Var x("x"), y("y"), c("c");
Var xo("xo"), xi("xi"), yi("yi"), yo("yo");

class MyPipeline {
    ImageParam input;
    ImageParam weight;
    Func clamped;
    Func input_buf;
    Func output;
    Func hw_output;
    std::vector<Argument> args;

    Func convolve55_rd(Func in) {
        Func local_sum, res;
        RDom r(-1, 3, -1, 3); 

        local_sum(x, y, c) = 0;

        local_sum(x, y, c) += cast<uint16_t>(in(x+r.x, y+r.y, c)) * weight(r.x+2, r.y+2);
        res(x, y, c) = cast<uint8_t>(local_sum(x, y, c) >> 8); 

        // unroll the reduction
        local_sum.update(0).unroll(r.x).unroll(r.y);
        return res;
    }   

public:
    MyPipeline() : input(UInt(8), 3, "input"),
                   weight(UInt(8), 2, "weight"),
                   output("output"), hw_output("hw_output")
                   {
        float sigma = 1.5f;

        // define the algorithm
        clamped = BoundaryConditions::repeat_edge(input);
        input_buf(x,y,c) = clamped(x,y,c);

        hw_output = convolve55_rd(clamped);
        output(x, y, c) = cast<uint8_t>(hw_output(x, y, c));

        // constraints
        output.bound(c, 0, 3); 

        weight.dim(0).set_bounds(0, 3); 
        weight.dim(1).set_bounds(0, 3); 
        weight.dim(0).set_stride(1);
        weight.dim(1).set_stride(3);
        args.push_back(input);
        args.push_back(weight);
    }  

    void compile_cpu() {
        std::cout << "\ncompiling cpu code..." << std::endl;

        //output.print_loop_nest();
        output.compile_to_lowered_stmt("pipeline_native.ir.html", args, HTML);
        output.compile_to_header("pipeline_native.h", args, "pipeline_native");
        output.compile_to_object("pipeline_native.o", args, "pipeline_native");
    }

    void compile_gpu() {
        std::cout << "\ncompiling gpu code..." << std::endl;

        output.compute_root();

        Target target = get_target_from_environment();
        target.set_feature(Target::CUDA);
        output.compile_to_lowered_stmt("pipeline_cuda.ir.html", args, HTML, target);
        output.compile_to_header("pipeline_cuda.h", args, "pipeline_cuda", target);
        output.compile_to_object("pipeline_cuda.o", args, "pipeline_cuda", target);
    }

    void compile_hls() {
        std::cout << "\ncompiling HLS code..." << std::endl;

        clamped.compute_root(); // prepare the input for the whole image

        // HLS schedule: make a hw pipeline producing 'hw_output', taking
        // inputs of 'clamped', buffering intermediates at (output, xo) loop
        // level
        hw_output.compute_root();
        hw_output.tile(x, y, xo, yo, xi, yi, 1, 1);
        hw_output.accelerate({clamped}, x, x);  // define the inputs and the output

        Target hls_target = get_target_from_environment();
        hls_target.set_feature(Target::CPlusPlusMangling);
        output.compile_to_lowered_stmt("pipeline_hls.ir.html", args, HTML, hls_target);
        output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
        output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);
    }
};


int main(int argc, char **argv) {
    MyPipeline p1;
    p1.compile_cpu();

    MyPipeline p2;
    p2.compile_hls();

    MyPipeline p3;
    p3.compile_gpu();
    return 0;
}

run.cpp

#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <math.h>

#include "pipeline_hls.h"
#include "pipeline_native.h"

#include "BufferMinimal.h"
#include "halide_image_io.h"

using Halide::Runtime::HLS::BufferMinimal;
using namespace Halide::Tools;

int main(int argc, char **argv) {
    BufferMinimal<uint8_t> input = load_image(argv[1]);
    BufferMinimal<uint8_t> weight(3,3);
//    BufferMinimal<uint8_t> input(5,5,3), weight(3,3);
    BufferMinimal<uint8_t> out_native(input.width(),input.height());
    BufferMinimal<uint8_t> out_hls(input.width(), input.height());

    printf("start.\n");

    pipeline_native(input, weight, out_native);
//    save_image(out_native, "out.png");

    printf("finish running native code\n");
    pipeline_hls(input, weight, out_native);

    printf("finish running HLS code\n");

    return 0;
}

C++11 feature is not included in makefile

I have tried to compile my own Halide code
However there is a compilation error. The error code is as below.
This is C++11 feature, and I thought there is something wrong in makefile

/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:101:27: error: a brace-enclosed initializer is not allowed here before ‘{’ token
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:101:29: sorry, unimplemented: non-static data member initializers
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:101:29: error: in-class initialization of static data member ‘buf’ of non-literal type
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:108:31: sorry, unimplemented: non-static data member initializers
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:108:31: error: ‘constexpr’ needed for in-class initialization of static data member ‘alloc’ of non-integral type
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:111:47: sorry, unimplemented: non-static data member initializers
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:111:47: error: ‘constexpr’ needed for in-class initialization of static data member ‘dev_ref_count’ of non-integral type
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:118:5: error: expected unqualified-id before ‘using’
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:122:11: error: expected nested-name-specifier before ‘not_void_T’
/users/student/mr105/cylin/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:122:11: error: using-declaration for non-member at class scope

"make run_hls" fails inside Vivado HLS

Hi @jingpu,

I followed the instructions provided in the Wiki, then I tried to run an example from Halide-HLS/apps/hls_examples (in my case, I tried the conv_hls and Gaussian_hls examples). I can see that the cpp code is generated and the project is made in vivado hls, but it fails to compile when it is running c-simulation, here is the first error:

/XXXXXX/XXXXX/Halide-HLS/apps/hls_examples/conv_hls/../../../include/HalideBuffer.h:74:22: error: function definition does not declare parameters

I also tried to open the project and run it in vivado hls and the same problem was observed. In addition to that, I tried to run ¨C-synthesis¨ without doing the C-simulation and it worked.
...........................................................................................................................................................
here is the information of my system and software:
Linux: Ubuntu 14.04.5 LTS (Release:14.04)
gcc version 4.8.4
Vivado HLS version 2015.4
...........................................................................................................................................................

Thank you in advance!

FFT in Halide-HLS

Hi @jingpu

I am trying to implement 2D FFT in Halide-HLS.

My first approach is using the Halide FFT (provided in apps/fft) even though it is in floating point format.
At first I am going to use 1D FFT, so I write a function named as "My_fft2d_r2c" which is basically the first parts of "fft2d_r2c" function of halide fft in "fft.cpp" file. The "My_fft2d_r2c" function is attached below.

I changed all expressions of "target.natural_vector_size()" to 1.
Also the following scheduling is removed from the last lines of "fft_dim1" function in "fft.cpp" file.

    for (size_t i = 0; i + 1 < stages.size(); i++) {
        Func stage = stages[i].first;
        stage.compute_at(x, group).update().vectorize(n0);
    }

Also my ''pipeline.cpp'' is attached below. Since ".accelerate" method acts on funcs only (NOT complex funcs) "re()" operator is used for "hw_output()" (because of an error complaining about Stream size).

The PROBLEM is:
When I run "make pipeline_hls.cpp", Halide-HLS reports an error like this:

Internal error at /<Halide-HLS-directory path>/src/StreamOpt.cpp:408
Condition failed: produce && consume && produce->name == consume->name
Aborted (core dumped)

Which I believe causes from this line of code in "fft_dim1" function:

exchange(A({n0, n1}, args)) = undef_z(V.output_types()[0]);

So if I try to bypass the error somehow, using following change to that line:

exchange(A({n0, n1}, args)) = x(A({n0, n1}, args));

it reports another error like this

Internal error at /<Halide-HLS-directory path>/src/ExtractHWKernelDAG.cpp:275 triggered by user code at ./pipeline.cpp:89:
Condition failed: extent_int
stencil window extent (((max((hw_output.s0.y.yi.base + hw_output.s0.y.yi), 15) - min((hw_output.s0.y.yi.base + hw_output.s0.y.yi), 0)) + 1)) is not a const.
Aborted (core dumped)

It reports that error even though the extent of yi is specified in the tile command:

hw_output.tile(x, y, xo, yo, xi, yi, 4, 4).accelerate({In_f},xi, xo);

Can you please help me with this?
Thanks!

Attachments:

Attachment No.1:
Definition of "My_fft2d_r2c" function: (which is basically the first parts of "fft2d_r2c" function of halide fft in "fft.cpp" file)

ComplexFunc My_fft2d_r2c(Func r,
                      const vector<int> &R0,
                      const vector<int> &R1,
                      const Target& target,
                      const Fft2dDesc& desc) {

string prefix = desc.name.empty() ? "r2c_" : desc.name + "_";

    vector<Var> args(r.args());
    Var n0(args[0]), n1(args[1]);
    args.erase(args.begin());
    args.erase(args.begin());

    // Get the innermost variable outside the FFT.
    Var outer = Var::outermost();
    if (!args.empty()) {
        outer = args.front();
    }

    int N0 = product(R0);
    int N1 = product(R1);

    // Cache of twiddle factors for this FFT.
    TwiddleFactorSet twiddle_cache;

    // The gain requested of the FFT.
    Expr gain = desc.gain;

    ComplexFunc zipped(prefix + "zipped");
    int zip_width = desc.vector_width;
    if (zip_width <= 0) {
        zip_width = 1;
    }
    // Ensure the zip width divides the zipped extent.
    zip_width = gcd(zip_width, N0 / 2);
    Expr zip_n0 = (n0 / zip_width) * zip_width * 2 + (n0 % zip_width);
    zipped(A({n0, n1}, args)) =
        ComplexExpr(r(A({zip_n0, n1}, args)),
                    r(A({zip_n0 + zip_width, n1}, args)));

    // DFT down the columns first.
    ComplexFunc dft1;
    dft1 = fft_dim1(zipped,
                                R1,
                                -1,  // sign
                                std::min(zip_width, N0 / 2),  // extent of dim 0
                                1.0f,
                                false,  // We parallelize unzipped below instead.
                                prefix,
                                target,
                                &twiddle_cache);    
    
    return dft1;
}

ComplexFunc My_fft2d_r2c(Func r,
                      int N0, int N1,
                      const Target& target,
                      const Fft2dDesc& desc) {
    return My_fft2d_r2c(r, radix_factor(N0), radix_factor(N1), target, desc);
}

Attachment No.2:
The content of "pipeline.cpp" file:

#include "Halide.h"
#include "fft.h"
#include "complex.h"


using namespace Halide;

Var x("x"), y("y"), z("z"), c("c");
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");

class MyPipeline {
public:
    ImageParam Input_Image;
	Func In_f;
    Func output, hw_output;
    ComplexFunc tmpfunc;
    std::vector<Argument> args;

    MyPipeline()
        : Input_Image(UInt(8), 2),
		  In_f("In_f"),
		  tmpfunc("tmpfunc"),
          hw_output("hw_output"),
		  output("output")
    {
	Target target = get_jit_target_from_environment();
	
	Fft2dDesc fwd_desc;
    
	In_f(x,y) = Halide::cast<float>(Input_Image(x,y));
	tmpfunc = My_fft2d_r2c(In_f, 16, 16, target, fwd_desc);
	hw_output(x,y) = re(tmpfunc(x,y));
	output(x,y) = hw_output(x,y);

   // Arguments
   args = {Input_Image};

    }

    void compile_hls() {
        std::cout << "\ncompiling HLS code..." << std::endl;

		output.tile(x, y, xo, yo, xi, yi, 4, 4);
		hw_output.compute_at(output, xo);
		In_f.compute_root();
        hw_output.tile(x, y, xo, yo, xi, yi, 4, 4).accelerate({In_f},xi, xo);
		
        // Create the target for HLS simulation
        Target hls_target = get_target_from_environment();
        hls_target.set_feature(Target::CPlusPlusMangling);
std::cout << "\ncompiling HLS1 code..." << std::endl;
        output.compile_to_lowered_stmt("pipeline_hls.ir.html", args, HTML, hls_target);
std::cout << "\ncompiling HLS2 code..." << std::endl;
        output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
std::cout << "\ncompiling HLS3 code..." << std::endl;
        output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);

        std::vector<Target::Feature> features({Target::Zynq});
        Target target(Target::Linux, Target::ARM, 32, features);
        output.compile_to_zynq_c("pipeline_zynq.c", args, "pipeline_zynq", target);
        output.compile_to_header("pipeline_zynq.h", args, "pipeline_zynq", target);

        output.compile_to_object("pipeline_zynq.o", args, "pipeline_zynq", target);
        output.compile_to_lowered_stmt("pipeline_zynq.ir.html", args, HTML, target);
        output.compile_to_assembly("pipeline_zynq.s", args, "pipeline_zynq", target);
        
    }
};

int main(int argc, char **argv) {

    MyPipeline p2;
    p2.compile_hls();

    return 0;
}

Support out-of-source build with proper cmake file

Seg fault when using the packed arguments calling convention

After merged with release 2016/10/25, the calling convention to a HLS pipeline using the packed argument throws a seg fault in runtime.

To reproduce, uncomment the statements in the following source, and run make in the folder.

Halide-HLS/apps/hls_examples/conv_hls/run.cpp

Line 50 in d387094

FIXME: the following calling convention causes Seg fault.

Support zero-copy buffer on Zynq

It is blocked #6

hls_target.cpp without any content

Dear all,
I tried my own example by modifying pipeline.cpp. There are not error reported in first two steps. But when I execute make run_hls, it says it doesn't have top function.
After tracing, I found the hls_target.cpp generated in run pipeline_hls.cpp has no content. The following is my pipeline.cpp. Is there any wrong in my code?

pipeline.cpp

#include "Halide.h"
#include <string.h>

using namespace Halide;
using std::string;

Var x("x"), y("y"), c("c");

class MyPipeline {
    ImageParam image;
    ImageParam sobel;
    Func grey, sobel_hor, sobel_ver;
    Func output;
    Func hw_output;
    std::vector<Argument> args;

    RDom r;

public:
    MyPipeline() : image(UInt(8), 3, "input"), sobel(UInt(8), 2, "weight"),
                   sobel_hor("sobel_hor"), sobel_ver("sobel_ver"),
                   output("output"), hw_output("hw_output"),
                   r(0, 3, 0, 3) {
        // Algorithm
        grey(x,y) = (float)0.1*image(x,y,0) + (float)0.6*image(x,y,1) + (float)0.3*image(x,y,2);
        Expr x_sob_hor = clamp(x+r.x-1,0,4);
        Expr y_sob_hor = clamp(y+r.y-1,0,4);
        sobel_hor(x,y) = sum( grey(x_sob_hor,y_sob_hor) * sobel(r.x,r.y) );
        Expr x_sob_ver = clamp(x+r.x-1,0,4);
        Expr y_sob_ver = clamp(y+r.y-1,0,4);
        sobel_ver(x,y) = sum( grey(x_sob_ver,y_sob_ver) * sobel(r.y,r.x) );
        hw_output(x,y) = sobel_hor(x,y)+sobel_ver(x,y);
        output(x,y)    = hw_output(x,y);

        args.push_back(image);
        args.push_back(sobel);
    }   

    void compile_cpu() {
        std::cout << "\ncompiling cpu code..." << std::endl;

        //output.print_loop_nest();
        output.compile_to_lowered_stmt("pipeline_native.ir.html", args, HTML);
        output.compile_to_header("pipeline_native.h", args, "pipeline_native");
        output.compile_to_object("pipeline_native.o", args, "pipeline_native");
    }

    void compile_gpu() {
        std::cout << "\ncompiling gpu code..." << std::endl;

        Target target = get_target_from_environment();
        target.set_feature(Target::CUDA);
        output.compile_to_lowered_stmt("pipeline_cuda.ir.html", args, HTML, target);
        output.compile_to_header("pipeline_cuda.h", args, "pipeline_cuda", target);
        output.compile_to_object("pipeline_cuda.o", args, "pipeline_cuda", target);
    }

    void compile_hls() {
        std::cout << "\ncompiling HLS code..." << std::endl;

        //output.print_loop_nest();
        Target hls_target = get_target_from_environment();
        hls_target.set_feature(Target::CPlusPlusMangling);
        output.compile_to_lowered_stmt("pipeline_hls.ir.html", args, HTML, hls_target);
        output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
        output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);
    }
};


int main(int argc, char **argv) {
    MyPipeline p1;
    p1.compile_cpu();

    MyPipeline p2;
    p2.compile_hls();

    MyPipeline p3;
    p3.compile_gpu();

    return 0;
}

Use Halide::Target to infer HLS code generation

Instead of having compile_to_hls, we should reuse compile_to_c and infer HLS code generation from a Halide::Target object.

'HalideRuntime.h' file not found

I have built hls_prj successfully.
However, I can't run C synthesis in Vivado HLS
Is there any wrong in my setting?

static_assert in header files

It seems vivado hls (v2016.4) does not recognize static_assert in header files Stencil.h and LineBuffer.h. It will give the following warning:

In file included from ./hls_target.h:8:
../../../include\Stencil.h:56:5: error: use of undeclared identifier 'static_assert'; did you mean 'static_cast'?
    static_assert(sizeof(T) * 8 == N, "bitcast_to_type parameters are incorrect.\n");
    ^
In file included from hls_target.cpp:3:
../../../include\Linebuffer.h:20:5: error: use of undeclared identifier 'static_assert'; did you mean 'static_cast'?
    static_assert(IMG_EXTENT_0 >= OUT_EXTENT_0, "image extent not is larger than output.");
    ^

and the errors:

Pragma processor failed: In file included from hls_target.cpp:1:
In file included from ./hls_target.h:8:
../../../include\Stencil.h:56:5: error: use of undeclared identifier 'static_assert'; did you mean 'static_cast'?
    static_assert(sizeof(T) * 8 == N, "bitcast_to_type parameters are incorrect.\n");
    ^
In file included from hls_target.cpp:3:
../../../include\Linebuffer.h:20:5: error: use of undeclared identifier 'static_assert'; did you mean 'static_cast'?
    static_assert(IMG_EXTENT_0 >= OUT_EXTENT_0, "image extent not is larger than output.");
    ^

Is this due to my version of Vivado_HLS or halide_hls? Help would be much appreciated! Thanks.

Adopt device interface

We should use Halide's device interface for generating HLS kernel code (device) and testbench wrapper code (host) and the Zynq host code.

Reduction Operation in Halide-HLS

I am trying to implement simple algorithm in Halide-HLS which requires reduction sum on the whole image to compute the average. The "pipeline.cpp" file is attached below.

Running "make pipeline_hls.cpp" results in following error:

Internal error at /Halide-HLS directory path/src/ExtractHWKernelDAG.cpp:307 triggered by user code at ./pipeline.cpp:69:
Condition failed: consumer_stencils.size() > 0
Aborted (core dumped)

changing the definition of pipeline from:

Input(x,y) = Input_Image(x, y);

image_mean() = Halide::cast<uint32_t>(0);
image_mean() += Halide::cast<uint32_t>(Input(win.x, win.y));
        
image_mean() = image_mean() >> (W_2_Power + H_2_Power);
        
hw_output(x,y) = Input(x,y) - (Halide::cast<uint8_t>(image_mean())); 
output(x,y) = hw_output(x,y);

Input(x,y) = Input_Image(x, y);

image_mean(x,y) = Halide::cast<uint32_t>(0);
image_mean(x,y) += Halide::cast<uint32_t>(Input(win.x, win.y));
        
image_mean(x,y) = image_mean(x,y) >> (W_2_Power + H_2_Power);
        
hw_output(x,y) = Input(x,y) - (Halide::cast<uint8_t>(image_mean(x,y))); 
output(x,y) = hw_output(x,y);

results in following error:

Internal error at  /Halide-HLS directory path/src/ExtractHWKernelDAG.cpp:275 triggered by user code at ./pipeline.cpp:69:
Condition failed: extent_int
stencil window extent (((max((hw_output$1.s0.x.xi.base + hw_output$1.s0.x.xi), 255) - min((hw_output$1.s0.x.xi.base + hw_output$1.s0.x.xi), 0)) + 1)) is not a const.
Aborted (core dumped)

But changing this line

image_mean(x,y) += Halide::cast<uint32_t>(Input(win.x, win.y));

image_mean(x,y) += Halide::cast<uint32_t>(Input(x+win.x, y+win.y));

works, but the extracted "hls_target" function is not efficient, since it computes the average again for each pixel.

Any idea to compute and use image average efficiently?

Thanks!

Attachment:
Content of "pipeline.cpp" file:

#include "Halide.h"
#include <stdio.h>

#define Image_Width 256
#define W_2_Power 8
#define Image_Height 256
#define H_2_Power 8

using namespace Halide;

Var x("x"), y("y"), z("z"), c("c");
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");


class MyPipeline {
public:
    ImageParam Input_Image;
    Func output;
    Func hw_output;
    std::vector<Argument> args;
	Func Input;
	Func image_mean;
	RDom win;
	
    MyPipeline()
        : Input_Image(UInt(8), 2),
          hw_output("hw_output"),
		  output("output"),
		  win(0, Image_Width, 0, Image_Height)
    {
 
    Input(x,y) = Input_Image(x, y);

	image_mean() = Halide::cast<uint32_t>(0);
	image_mean() += Halide::cast<uint32_t>(Input(win.x, win.y));
        
    image_mean() = image_mean() >> (W_2_Power + H_2_Power);
        
    hw_output(x,y) = Input(x,y) - (Halide::cast<uint8_t>(image_mean())); 
	output(x,y) = hw_output(x,y);
	
	// Arguments
	args = {Input_Image};
    }


	
    void compile_cpu() {
        std::cout << "\ncompiling cpu code..." << std::endl;

        output.tile(x, y, xo, yo, xi, yi, Image_Width, Image_Height);
        output.compile_to_header("pipeline_native.h", args, "pipeline_native");
        output.compile_to_object("pipeline_native.o", args, "pipeline_native");
    }

    void compile_hls() {
        std::cout << "\ncompiling HLS code..." << std::endl;
		
		output.tile(x, y, xo, yo, xi, yi, Image_Width, Image_Height);
		hw_output.compute_at(output, xo);
		Input.compute_at(output, xo);
    	hw_output.tile(x, y, xo, yo, xi, yi, Image_Width, Image_Height).accelerate({Input}, xi, xo);
		Input.fifo_depth(hw_output, Image_Width * Image_Height);
		
        output.print_loop_nest();
        // Create the target for HLS simulation
        Target hls_target = get_target_from_environment();
        hls_target.set_feature(Target::CPlusPlusMangling);
        output.compile_to_lowered_stmt("pipeline_hls.ir.html", args, HTML, hls_target);
        output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
        output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);

        std::vector<Target::Feature> features({Target::Zynq});
        Target target(Target::Linux, Target::ARM, 32, features);
        output.compile_to_zynq_c("pipeline_zynq.c", args, "pipeline_zynq", target);
        output.compile_to_header("pipeline_zynq.h", args, "pipeline_zynq", target);


        output.compile_to_object("pipeline_zynq.o", args, "pipeline_zynq", target);
        output.compile_to_lowered_stmt("pipeline_zynq.ir.html", args, HTML, target);
        output.compile_to_assembly("pipeline_zynq.s", args, "pipeline_zynq", target);
    }
};

int main(int argc, char **argv) {
    MyPipeline p1;
    p1.compile_cpu();

    MyPipeline p2;
    p2.compile_hls();

    return 0;
}

Fixed-point arithmetic

Hi,

Does the current version of Halide-HLS support fixed-point arithmetic?
Is it automatically geerated by the compiler or do I still need to specify it in Halide source code?

Thank you.

Error occur in removing scheduler

Dear all,
I tried to modify example conv_hls, and I want to create a Halide code with minimum scheduler. But after I remove hw_output.tile(x, y, xo, yo, xi, yi, 1, 1); in compile_hls() in my code. In make pipeline_hls.cpp, the error message shows

Internal error at /users/student/mr105/cylin/Halide-HLS/src/StreamOpt.cpp:656
Condition failed: dag.loop_vars.count(op->name)
Aborted (core dumped)

I did not use Halide variable xo, xi, yo, yi in other place. Is there anything wrong in my code?
The following is my pipeline.cpp, thanks.\

pipeline.cpp

#include "Halide.h"
#include <string.h>

using namespace Halide;
using std::string;

Var x("x"), y("y"), c("c");
Var xo("xo"), xi("xi"), yi("yi"), yo("yo");

class MyPipeline {
    ImageParam input;
    ImageParam weight;
    Func clamped;
    Func input_buf;
    Func output;
    Func hw_output;
    std::vector<Argument> args;

    Func convolve55_rd(Func in) {
        Func local_sum, res;
        RDom r(-1, 3, -1, 3); 

        local_sum(x, y, c) = 0;

        local_sum(x, y, c) += cast<uint16_t>(in(x+r.x, y+r.y, c)) * weight(r.x+2, r.y+2);
        res(x, y, c) = cast<uint8_t>(local_sum(x, y, c) >> 8); 

        // unroll the reduction
        local_sum.update(0).unroll(r.x).unroll(r.y);
        return res;
    }   

public:
    MyPipeline() : input(UInt(8), 3, "input"),
                   weight(UInt(8), 2, "weight"),
                   output("output"), hw_output("hw_output")
                   {
        float sigma = 1.5f;

        // define the algorithm
        clamped = BoundaryConditions::repeat_edge(input);
        input_buf(x,y,c) = clamped(x,y,c);

        hw_output = convolve55_rd(clamped);
        output(x, y, c) = cast<uint8_t>(hw_output(x, y, c));

        // constraints
        output.bound(c, 0, 3); 

        weight.dim(0).set_bounds(0, 3); 
        weight.dim(1).set_bounds(0, 3); 
        weight.dim(0).set_stride(1);
        weight.dim(1).set_stride(3);
        args.push_back(input);
        args.push_back(weight);
    }  

    void compile_cpu() {
        std::cout << "\ncompiling cpu code..." << std::endl;

        //output.print_loop_nest();
        output.compile_to_lowered_stmt("pipeline_native.ir.html", args, HTML);
        output.compile_to_header("pipeline_native.h", args, "pipeline_native");
        output.compile_to_object("pipeline_native.o", args, "pipeline_native");
    }

    void compile_gpu() {
        std::cout << "\ncompiling gpu code..." << std::endl;

        output.compute_root();

        Target target = get_target_from_environment();
        target.set_feature(Target::CUDA);
        output.compile_to_lowered_stmt("pipeline_cuda.ir.html", args, HTML, target);
        output.compile_to_header("pipeline_cuda.h", args, "pipeline_cuda", target);
        output.compile_to_object("pipeline_cuda.o", args, "pipeline_cuda", target);
    }

    void compile_hls() {
        std::cout << "\ncompiling HLS code..." << std::endl;

        clamped.compute_root(); // prepare the input for the whole image

        // HLS schedule: make a hw pipeline producing 'hw_output', taking
        // inputs of 'clamped', buffering intermediates at (output, xo) loop
        // level
        hw_output.compute_root();
        hw_output.tile(x, y, xo, yo, xi, yi, 1, 1);
        hw_output.accelerate({clamped}, x, x);  // define the inputs and the output

        Target hls_target = get_target_from_environment();
        hls_target.set_feature(Target::CPlusPlusMangling);
        output.compile_to_lowered_stmt("pipeline_hls.ir.html", args, HTML, hls_target);
        output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
        output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);
    }
};


int main(int argc, char **argv) {
    MyPipeline p1;
    p1.compile_cpu();

    MyPipeline p2;
    p2.compile_hls();

    MyPipeline p3;
    p3.compile_gpu();
    return 0;
}

Understanding Linebuffer in Halide-HLS

Dear all,

I have implemented a simple windowing operation in Halide-HLS and it works fine. The pipeline code is attached below.
In the generated HLS code ("hls_target.cpp") there is an instance of linebuffer of dimensions 66 by 66 like this:

linebuffer<66, 66>

I have some questions about this linebuffer, which are:

1- Why 66 x 66 and not 64 x 64? I do not think it is the result of using Halide boundary condition, since data is streamed to the pipeline after boundary condition.

2- Why instantiating line buffer this big? Since for 3 x 3 windowing operation on a tile in streaming manner, just 3 rows (64 pixels each) of data would be sufficient to start windowing.

3- Vivado-HLS synthesis report shows that it infers 2 instances of 18Kb block RAM for the line buffer. I can not relate this amount of memory to the instantiated linebuffer. Any idea?

Thanks!

Attachment:

Content of "pipeline.cpp" file:

#include "Halide.h"
#include <stdio.h>

using namespace Halide;

Var x("x"), y("y"), z("z"), c("c");
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");


class MyPipeline {
public:
    ImageParam Input_Image;
    Param<uint16_t> hotPixelStrength;
    Func output;
    Func hw_output;
    std::vector<Argument> args;
	Func Input;
    Func sum_x;
    Func sum_xy;
    Func Conv;
	
    MyPipeline()
        : Input_Image(UInt(16), 2),
          hw_output("hw_output"),
		  output("output")
    {
    
    Input = Halide::BoundaryConditions::constant_exterior(Input_Image, 0);
     
    sum_x(x, y) = (Input(x - 1, y) + Input(x, y) + Input(x + 1, y));    
    sum_xy(x, y) =  (sum_x(x, y-1) + sum_x(x, y) + sum_x(x, y+1) - Input(x, y));
        
    Conv(x, y) =  (sum_xy(x, y)) >> 3;
        
    hw_output(x,y) = select(Input(x,y) > hotPixelStrength, Conv(x, y), Input(x,y));    
	output(x,y) = hw_output(x,y);

    // Arguments
    args = {Input_Image, hotPixelStrength};
    }


	
    void compile_cpu() {
        std::cout << "\ncompiling cpu code..." << std::endl;

        output.tile(x, y, xo, yo, xi, yi, 64, 64);
        output.compile_to_header("pipeline_native.h", args, "pipeline_native");
        output.compile_to_object("pipeline_native.o", args, "pipeline_native");
    }

    void compile_hls() {
        std::cout << "\ncompiling HLS code..." << std::endl;
		
		output.tile(x, y, xo, yo, xi, yi, 64, 64);
		hw_output.compute_at(output, xo);
		Input.compute_at(output, xo);
    	hw_output.tile(x, y, xo, yo, xi, yi, 64, 64).accelerate({Input}, xi, xo);
		
        //output.print_loop_nest();
        // Create the target for HLS simulation
        Target hls_target = get_target_from_environment();
        hls_target.set_feature(Target::CPlusPlusMangling);
        output.compile_to_lowered_stmt("pipeline_hls.ir.html", args, HTML, hls_target);
        output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
        output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);

        std::vector<Target::Feature> features({Target::Zynq});
        Target target(Target::Linux, Target::ARM, 32, features);
        output.compile_to_zynq_c("pipeline_zynq.c", args, "pipeline_zynq", target);
        output.compile_to_header("pipeline_zynq.h", args, "pipeline_zynq", target);


        output.compile_to_object("pipeline_zynq.o", args, "pipeline_zynq", target);
        output.compile_to_lowered_stmt("pipeline_zynq.ir.html", args, HTML, target);
        output.compile_to_assembly("pipeline_zynq.s", args, "pipeline_zynq", target);
    }
};

int main(int argc, char **argv) {
    MyPipeline p1;
    p1.compile_cpu();

    MyPipeline p2;
    p2.compile_hls();

    return 0;
}

jingpu / halide-hls Goto Github PK

halide-hls's Introduction

Halide to CPU/FPGA

Build Status

Updates

halide-hls's People

Contributors

Stargazers

Watchers

Forkers

halide-hls's Issues

pipeline.cpp

run.cpp

Attachments:

pipeline.cpp

Recommend Projects

Recommend Topics

Recommend Org