halide / halide Goto Github PK
View Code? Open in Web Editor NEWa language for fast, portable data-parallel computation
Home Page: https://halide-lang.org
License: Other
a language for fast, portable data-parallel computation
Home Page: https://halide-lang.org
License: Other
Halide does not have a XOR operator. In my algorithm I can replace it with a inequality operator, but I've done some tests and (at least for my processor) XOR operator is faster (25%):
It would nice to see a XOR operator in Halide. :)
It correctly fails, but does not issue any kind of useful error message. E.g.:
f.vectorize(x, 4).vectorize(y, 4);
Right now it just fails an assertion inside vectorize.ml
These are almost entirely duplicated functionality. They should be refactored into a single list.
They can be added trivially (with potentially poor performance) as simple loops. The longer-term plan is to rely on syrah.
Until it moves to the new NVPTX LLVM backend, the legacy PTX backend is missing most math lib functions (transcendentals, pow, etc.) and some other expected standard library features.
This would be valuable in a bunch of paths in the LLVM codegen, and potentially later for platform-specific optimization pre-passes.
I'm using the latest Halide release to implement a dilation algorithm in RGBA float, but it segfaults. If I remove the vectorize schedule, it works.
Uniform<int> radius = 20;
RDom dom(-radius, 2*radius+1, -radius, 2*radius+1);
structEl(x,y) = select(x*x + y*y <= radius, 1, 0);
dilation(x, y, c) = select(c < 3,
maximum( select(structEl(dom.x, dom.y) == 1, input(x + dom.x - radius, y + dom.y - radius, c), 0.0f) ),
input(x, y, 3));
//schedule
structEl.root();
dilation.vectorize(x, 8);
Is there any chance, that Halide will generate OpenCL kernels for use on GPUs? Sometimes in future....
I want to use Halide in my desktop computer graphics (photo processing) app, but many users have AMD cards, not NVidia.
The stateful nature of the LLVM OCaml bindings is nasty. It would be much nicer to have a thin ADT layer above this, much like the C types used in the C backend.
This could be started as part of the llvalue IR node, which would need to be constructable without a current "builder" context.
We are currently relying on a hacked branch of LLVM 3.1svn from around the SIGGRAPH deadline, which we had to patch to fix a variety of codegen bugs for ARM and add features to the PTX backend. This should be updated to 3.2svn. The limiting factor is that the PTX codegen needs to be updated to work with the new conventions of NVPTX instead of the older independent PTX target.
Hi,
If I try to write a reduction of the form:
RDom r(0, input);
f(x) += r;
and supply an input value of 0, the reduction fails to terminate instead of outputting the initial value.
Any thoughts?
Cheers
Known issues:
One of the things which makes the build difficult for first-timers is the need to install OCaml libraries on which we depend. On some platforms this is easy (modern Ubuntu tends to have many up-to-date OCaml packages in apt), while on others it involves a long chain of manual download/configure/make/installs which is an unnecessary distraction.
The build bootstrap process should have functionality which automatically fetches, builds, and installs the relevant libraries in a project-local path, and makes these discoverable to all subsequent build steps. odb is a straightforward option.
The challenge is to do this while also using the system packages when they exist and are sufficient, to avoid too much bloat.
Automatic GPU device selection should be overridable via an HL_GPU_DEVICE environment variable.
Halide wroks fine with the CPU backend. However, as soon as I try to use the Cuda backend with
HL_TARGET=ptx ./executable
I get an error like this:
...
%f0.v0_nextvar = add i32 %f0.v0, 1
%55 = icmp ne i32 %f0.v0_nextvar, %38
br i1 %55, label %f0.v0_loop, label %f0.v0_afterloop
f0.v0_afterloop: ; preds = %f0.v0_loop
call void @__free_buffer(%struct.buffer_t* %f0.f5_buf)
call void @fast_free(i8* %f0.f5)
ret void
}
LLVM ERROR: Program used external function 'cuCtxSynchronize' which could not be resolved!
Just add an unordered flag to Block and throw them into the task pool.
There is logic in src/myocamlbuild.ml
to do this, but it seems to have stopped working correctly.
During autotuner debugging we found the following schedule for blur triggers a segfault rather than an error:
blur_y.tile(x, y, xi, yi, 2, 2)
blur_y.vectorize(xi, 8)
Identified by Victor Oliveira on the halide-dev list:
RDom r (-5, 11);
Func box_x("box_x");
box_x(x,y,c) += (clamped(c, x + r, y));
Func box_y("box_y");
box_y(x,y,c) += (box_x(x, y + r, c));
box_x.root().update().reorder(r,c,x,y).cudaTile(x,y,16,16);
box_y.root().update().reorder(r,c,x,y).cudaTile(x,y,16,16);
Options:
--use-local-llvm/use-system-llvm
--use-local-clang/use-system-clang
--use-local-ocaml-libs
binary distribution of the Halide compiler is 404 - Not Found.
Jim's interpolation algorithm test runs slower (~2x) than gcc-4.6
's optimized C result (on x86-64/Mac). We need to look into the generated code to sniff out why.
Would be similar to the ptx backend. Helpful for current-gen cell phones.
Currently, quite a few user errors result in assertions. These have been improved to be reasonably informative, but they should still be pushed off to a separate path from the implementation error asserts, specific to input program warnings and errors.
Hi,
Running bootstrap fails at that point:
--------------------------
Test: building halide.cmxa
--------------------------
Traceback (most recent call last):
File "util/bootstrap.py", line 80, in <module>
print ocamlbuild('-use-ocamlfind', 'halide.cmxa')
File "/home/hamstah/repos/Halide/util/pbs.py", line 352, in __call__
return RunningCommand(command_ran, process, call_args, actual_stdin)
File "/home/hamstah/repos/Halide/util/pbs.py", line 136, in __init__
if rc != 0: raise get_rc_exc(rc)(self.command_ran, self._stdout, self._stderr)
pbs.ErrorReturnCode_10:
Ran: '/usr/bin/ocamlbuild -use-ocamlfind halide.cmxa'
STDOUT:
ocamlfind ocamlopt -I /usr/lib/ocaml/ocamlbuild unix.cmxa /usr/lib/ocaml/ocamlbuild/ocamlbuildlib.cmxa myocamlbuild.ml /usr/lib/ocaml/ocamlbuild/ocamlbuild.cmx -o myocamlbuild
+ ocamlfind ocamlopt -I ... (214 more, please see e.stdout)
STDERR:
Running the command manually gives:
$ /usr/bin/ocamlbuild -use-ocamlfind halide.cmxa
Solver failed:
Ocamlbuild cannot find or build halide.ml. A file with such a name would usually be a source file. I suspect you have given a wrong target name to Ocamlbuild.
Backtrace:
- Failed to build the target halide.cmxa
- Building halide.cmxa:
- Failed to build all of these:
- Building halide.cmx:
- Failed to build all of these:
- Building halide.ml:
- Failed to build all of these:
- Building halide.mly
- Building halide.mll
- Building halide.mlpack
- Building halide.mllib
Compilation unsuccessful after building 0 targets (0 cached) in 00:00:00.
the cmxa file seems to be missing:
$ find . -name "*.cmxa"
./llvm/Release+Asserts/lib/ocaml/llvm_bitwriter.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_analysis.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_bitreader.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_target.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_ipo.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_scalar_opts.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm.cmxa
./llvm/Release+Asserts/lib/ocaml/llvm_executionengine.cmxa
./llvm/bindings/ocaml/target/Release+Asserts/llvm_target.cmxa
./llvm/bindings/ocaml/bitreader/Release+Asserts/llvm_bitreader.cmxa
./llvm/bindings/ocaml/llvm/Release+Asserts/llvm.cmxa
./llvm/bindings/ocaml/transforms/ipo/Release+Asserts/llvm_ipo.cmxa
./llvm/bindings/ocaml/transforms/scalar/Release+Asserts/llvm_scalar_opts.cmxa
./llvm/bindings/ocaml/executionengine/Release+Asserts/llvm_executionengine.cmxa
./llvm/bindings/ocaml/bitwriter/Release+Asserts/llvm_bitwriter.cmxa
./llvm/bindings/ocaml/analysis/Release+Asserts/llvm_analysis.cmxa
bootstrap also got stuck on the git submodule update for some reason, might because of older python/modules, but running the command manually and bootstrap again solved it
The existing tuples support is awkward, limited, and confusing.
If we add debugging metadata to the LLVM backend, generated code will be able to show sane state in gdb
.
I am trying to build Halide on VS 2010. Need some help. I have successfully compiled the code. But, the runtime prob is:
error LNK2019: unresolved external symbol "public: class Halide::DynImage __thiscall Halide::Func::realize(int)" (?realize@Func@Halide@@QAE?AVDynImage@2@H@Z) referenced in function _main
I know this is early, but I and I'm sure others would be interested in Windows support. I'm taking a guess that VC10 and maybe even VC11 don't have enough C++11 features implemented that would allow Halide to compile, so mingw is probably the way to go. Maybe someone more knowledgeable than I can explain some of the challenges that will be faced in porting to Windows.
Currently, bootstrap errors tend to be in the form of long log spew from a failed LLVM build or similar. In this case, the script tends to print the start of the log, rather than the end of it, where the actual information is.
Major issues:
size_t strides[MAX_DIMS]
- allow inputs with padded scanlinessize_t offset[MAX_DIMS]
- allow execution over sub-region of a host bufferint dims
fieldAlso of note: pointer sizes vary between architectures, which means that the size and offsets of the structure itself vary. This probably remains the right answer, but we need to be careful on 32-bit architectures.
This is a bug migrated from the trello list
Bug in the C++ layer. There should be a check for this. Possibly non-trivial, because reductions may recursively reference themselves.
TBB and Grand Central Dispatch would be valuable. This should be doable (almost?) entirely as alternative standard libraries.
when c
dimension moves from outside to inside, initialization code radically changes (becomes incorrect) - inspect generated code
I'm using Halide (the precompiled libs) for a HDR fusion program and I need to create a gaussian pyramid. There is an example in local_laplacian code, but it doesn't not use JIT (what I want).
I have this code:
Image<int> subsample()
{
Func downx, downy;
Var x, y;
downx(x, y) = ( (*this->image)(2*x-1, y) + 2 * (*this->image)(2*x, y) + (*this->image)(2*x+1 , y) ) / 4;
downy(x, y) = (downx(x, 2*y-1) + 2 * downx(x, 2*y) + downx(x, 2*y+1)) / 4;
int width = this->image->width() / 2, height = this->image->height() / 2 ;
Image<int> out = downy.realize(width-1, height-1);
return out;
}
This code obviously fails because it starts on x=0, y=0 and indexes has to be positive numbers.
How can I set the limits for "realize" downy? Any help, please?
PS: I'm just starting with C++ because of Halide, you may think my code is horrible. :)
A very common pattern in Halide code loads from an image using a clamped index:
Func clamped(x,y) = input(clamp(x,0,input.width() - 1), clamp(y,0,input.height() - 1);
In the current backend, this generates unnecessarily conservative code when vectorized. A better strategy would be to generate a dynamic branch which detects if the index vector is near the edge of the clamp range, and if not, removes the clamp and generates a simple dense aligned vector load.
In py_bindings/test_blur.py
, this schedule fails:
blur_y.root().tile(y,c,_c0,_c1,64,8)
It would be nice for users if we didn't require llc
/opt
to actually codegen and assemble a statically compiled pipeline.
This will require either plumbing llvm-c's LLVMTargetMachineEmitToFile(...)
through to the OCaml standard interface, or making our own shim in src/cllutil.c
.
It's quite confusing. We need some better way to handle loop nesting order.
We don't currently do any bounds checking on input images, which causes segfaults. One subtle way this triggers is if you vectorize something which accesses the input image but the input image is not a multiple of the vector width.
See test/cpp/input_image_bounds_check/test.cpp for code that triggers this bug
We should add asserts at the function preamble that check this (conservatively).
Desired format:
<test_name>: {compile|link|run}
....E..
This was a bug listed on the trello board. It needs to be tested to see if it's still a bug and fixed if so.
Hi!
I've been trying to implement the SURF descriptor algorithm in Halide. I've run into the problem that the piece of code below (one of the parts of SURF) takes a really long time to compile (compileJIT and compileToFile) and it exits when memory completely fills up (4GB). It's really hard to know the source of the problem since there are no error messages.
I tried to reduce code size (removing some calculations), and sometimes it compiles (though very slow).
FUNC and VAR are just macros to set unique object names.
Here is the function:
Func getResponse(UniformImage resp, UniformImage src)
{
Func FUNC(f);
Var VAR(x),VAR(y);
Expr scale = cast<float>(resp.width()) / cast<float>(src.width());
Expr xx = cast<int>(scale * cast<float>(x));
Expr yy = cast<int>(scale * cast<float>(y));
f(x,y) = resp(clamp(xx, 0, resp.width()), clamp(yy, 0, resp.height()));
return f;
}
Func nonMaxSuppression
(UniformImage t, Uniform<int> t_step, Uniform<int> t_filter,
UniformImage m, Uniform<int> m_step, Uniform<int> m_filter,
UniformImage b, Uniform<int> b_step, Uniform<int> b_filter,
UniformImage laplacian,
Uniform<float> threshold)
{
Var VAR(x), VAR(y);
/* clamp parameters */
Func FUNC(clamped_t), FUNC(clamped_m), FUNC(clamped_b);
clamped_t(x,y) = t(clamp(x, 0, t.width()), clamp(y, 0, t.height()));
clamped_m(x,y) = m(clamp(x, 0, m.width()), clamp(y, 0, m.height()));
clamped_b(x,y) = b(clamp(x, 0, b.width()), clamp(y, 0, b.height()));
Func response_m_t = getResponse(m, t);
Func response_b_t = getResponse(b, t);
Expr layerBorder = (t_filter + 1) / (2 * t_step);
Expr validBounds = ( y > layerBorder
&& y < t.height() - layerBorder
&& x > layerBorder
&& x < t.width() - layerBorder);
Expr candidate = response_m_t(x, y);
Expr aboveThreshold = candidate >= threshold;
RDom r(-1, 2, -1, 2);
Expr max_t = maximum(clamped_t(r.x+x, r.y+y));
Expr max_m = maximum(clamped_m(r.x+x, r.y+y));
Expr max_b = maximum(clamped_b(r.x+x, r.y+y));
Func FUNC(GreaterNeigh);
GreaterNeigh(x,y) = max(max_t, max(max_m, max_b));
Expr isExt = validBounds && aboveThreshold && (GreaterNeigh(x,y) <= candidate);
// ---------------------------------------
// Step 1: Calculate the 3D derivative
// ---------------------------------------
Func FUNC(dx), FUNC(dy), FUNC(ds);
dx(x,y) = (response_m_t(x+1, y ) - response_m_t(x-1, y )) / 2.0f;
dy(x,y) = (response_m_t(x, y+1) - response_m_t(x, y-1)) / 2.0f;
ds(x,y) = (clamped_t(x, y) - clamped_b(x, y)) / 2.0f;
// ---------------------------------------
// Step 2: Calculate the inverse Hessian
// ---------------------------------------
Expr v;
Func FUNC(dxx), FUNC(dyy), FUNC(dss), FUNC(dxy), FUNC(dxs), FUNC(dys);
v = response_m_t(x, y);
dxx(x,y) = response_m_t(x + 1, y) + m(x - 1, y) - 2 * v;
dyy(x,y) = response_m_t(x, y + 1) + m(x, y - 1) - 2 * v;
dss(x,y) = clamped_t(x, y) + response_b_t(x, y) - 2 * v;
dxy(x,y) = ( response_m_t(x + 1, y + 1)
- response_m_t(x - 1, y + 1)
- response_m_t(x + 1, y - 1)
+ response_m_t(x - 1, y - 1) ) / 4.0;
dxs(x,y) = ( clamped_t(x + 1, y)
- clamped_t(x - 1, y)
- response_b_t(x + 1, y)
+ response_b_t(x - 1, y) ) / 4.0;
dys(x,y) = ( clamped_t(x, y + 1)
- clamped_t(x, y - 1)
- response_b_t(x, y + 1)
+ response_b_t(x, y - 1) ) / 4.0;
Expr H[3][3] = {{dxx(x,y), dxy(x,y), dxs(x,y)},
{dxy(x,y), dyy(x,y), dys(x,y)},
{dxs(x,y), dys(x,y), dss(x,y)}};
Func FUNC(invDet);
invDet(x,y) = 1.0 /
(H[0][0]*(H[1][1]*H[2][2]-H[2][1]*H[1][2]) -
H[0][1]*(H[1][0]*H[2][2]-H[1][2]*H[2][0]) +
H[0][2]*(H[1][0]*H[2][1]-H[2][2]*H[2][0]));
Expr invH[3][3] =
{{ (H[1][1]*H[2][2]-H[2][1]*H[1][2])*invDet(x,y), -(H[1][0]*H[2][2]-H[1][2]*H[2][0])*invDet(x,y), (H[1][0]*H[2][1]-H[2][0]*H[1][1])*invDet(x,y)},
{-(H[0][1]*H[2][2]-H[0][2]*H[2][1])*invDet(x,y), (H[0][0]*H[2][2]-H[0][2]*H[2][0])*invDet(x,y), -(H[0][0]*H[2][1]-H[2][0]*H[0][1])*invDet(x,y)},
{ (H[0][1]*H[1][2]-H[0][2]*H[1][1])*invDet(x,y), -(H[0][0]*H[1][2]-H[1][0]*H[0][2])*invDet(x,y), (H[0][0]*H[1][1]-H[1][0]*H[0][1])*invDet(x,y)}};
// ---------------------------------------
// Step 3: Multiply derivative and Hessian
// ---------------------------------------
Expr cx = (invH[0][0] * dx(x,y) * -1.0) + (invH[0][1] * dy(x,y) * -1.0) + (invH[0][2] * ds(x,y) * -1.0);
Expr cy = (invH[1][0] * dx(x,y) * -1.0) + (invH[1][1] * dy(x,y) * -1.0) + (invH[1][2] * ds(x,y) * -1.0);
Expr ci = (invH[2][0] * dx(x,y) * -1.0) + (invH[2][1] * dy(x,y) * -1.0) + (invH[2][2] * ds(x,y) * -1.0);
Expr isClose = (abs(cx) < 0.5 && abs(cy) < 0.5 && abs(ci) < 0.5);
Expr posx = cast<float>((x + cx)*t_step);
Expr posy = cast<float>((y + cy)*t_step);
Expr det_scale = cast<float>((0.1333)*(m_filter + (ci* (m_filter - b_filter))));
Func FUNC(laplacianF);
laplacianF = getResponse(laplacian, t);
Var VAR(c);
Func FUNC(out);
out(x,y,c) = select(c==0, isExt && isClose,
select(c==1, posx,
select(c==2, posy,
select(c==3, det_scale,
select(c==4, laplacianF(x,y), 0.0)))));
//schedule
GreaterNeigh.root();
invDet.root();
return out;
}
int main()
{
UniformImage t(Float(32), 2),
m(Float(32), 2),
b(Float(32), 2),
laplacian(Float(32), 2);
Uniform<int> t_step, t_filter,
m_step, m_filter,
b_step, b_filter;
Uniform<float> threshold;
Func nms = nonMaxSuppression(t, t_step, t_filter,
m, m_step, m_filter,
b, b_step, b_filter,
laplacian,
threshold);
nms.compileJIT(); /* error */
return 0;
}
I am trying to find the cosine of a set of numbers. I used from 0 to 10. My code compiled but, i have this error while running.
Halide::DynImage::Contents::Contents(const Halide::Type&, int): Assertion `a > 0 && "Images must have positive sizes\n"' failed.
Aborted (core dumped)
I understood the second part. How to make it for negative numbers?
It seems all that's needed for the stock Lion compiler is to remove the initializer list support for images.
I am working on CUDA and Halide. I have compiled and ran few examples. When i opened my working directory, i found a file "kernel.ptx". I opened it and found this.
Does Halide support only devices with compute capability 2.0??
I have run the code given in Getting Started and test folder. It ran fine with g++. But, when i include the shell code for CUDA, .i.e.,
"g++-4.6 -std=c++0x hello_halide.cpp -L /usr/local/cuda/lib64 -lcuda halide -lHalide -ldl -lpthread -o hello_halide"
It shows,
"/usr/bin/ld: cannot find halide: File format not recognized
/usr/bin/ld: cannot find -lHalide
collect2: ld returned 1 exit status"
What to do??
For example, the following code accesses the output image incorrectly.
f(x) = x
f.root().split(x, xo, x, 2)
f.realize(...)
A failing test has been checked in as split_reuse_inner_name_bug
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.