Coder Social home page Coder Social logo

tullio.jl's Introduction

Tullio.jl

GitHub CI Buildkite GPU CI Tag Version

Tullio is a very flexible einsum macro. It understands many array operations written in index notation -- not just matrix multiplication and permutations, but also convolutions, stencils, scatter/gather, and broadcasting. For example:

@tullio M[x,y,c] := N[x+i, y+j,c] * K[i,j]     # sum over i,j, and create M

@tullio S[x] = P[x,y] * log(Q[x,y] / R[y])     # sum over y, and write into S

@tullio A[i,j] += B[i,k,l] * C[l,j] * D[k,j]   # sum over k,l, and add to values in A

@tullio (*) Z[j] := X[ind[k],j] * exp(-Y[k])   # product over k

Used by itself the macro writes ordinary nested loops much like Einsum.@einsum. One difference is that it can parse more expressions, and infer ranges for their indices. Another is that it will use multi-threading (via Threads.@spawn) and recursive tiling, on large enough arrays. But it also co-operates with various other packages, provided they are loaded before the macro is called:

  • It uses LoopVectorization.@avx to speed many things up. (Disable with keyword avx=false.) On a good day this will match the speed of OpenBLAS for matrix multiplication.

  • It uses KernelAbstractions.@kernel to make a GPU version. (Disable with cuda=false.) This is somewhat experimental, and may not be fast.

The macro also tries to provide a gradient for use with Tracker or (via ChainRules) for Zygote, Yota, etc. (Disable with grad=false, or nograd=A.) This is done in one of two ways:

  • By default it takes a symbolic derivative of the right hand side expression. This works for reductions over + or min/max. The functions as typed must be known, mostly from DiffRules.

  • The option grad=Dual uses instead ForwardDiff to differentiate the right hand side (only for reductions over +). This allows for more complicated expressions.

The entire right hand side is summed over the full possible range of any indices not appearing on the left. Pipe operators |> or <| indicate functions to be performed outside the sum, for example:

@tullio lse[j] := log <| exp(mat[i,j])   # vec(log.(sum(exp.(mat), dims=1)))

The option @tullio verbose=true will cause it to print index ranges, symbolic derivatives, and notices when it is unable to use the packages mentioned above. And verbose=2 will print everything.

If it's useful in academic work, you can cite it with this DOI: DOI

Notation

Index notation for some simple functions:

using Pkg; Pkg.add("Tullio")
using Tullio, Test
M = rand(1:20, 3, 7)

@tullio S[1,c] := M[r,c]  # sum over r ∈ 1:3, for each c ∈ 1:7
@test S == sum(M, dims=1) 

@tullio Q[ρ,c] := M[ρ,c] + sqrt(S[1,c])  # loop over ρ & c, no sum -- broadcasting
@test Q  M .+ sqrt.(S)

mult(M,Q) = @tullio P[x,y] := M[x,c] * Q[y,c]  # sum over c ∈ 1:7 -- matrix multiplication
@test mult(M,Q)  M * transpose(Q)

R = [rand(Int8, 3, 4) for δ in 1:5]

@tullio T[j,i,δ] := R[δ][i,j] + 10im  # three nested loops -- concatenation
@test T == permutedims(cat(R...; dims=3), (2,1,3)) .+ 10im

@tullio (max) X[i] := abs2(T[j,i,δ])  # reduce using max, over j and δ
@test X == dropdims(maximum(abs2, T, dims=(1,3)), dims=(1,3))

dbl!(M, S) = @tullio M[r,c] = 2 * S[1,c]  # write into existing matrix, M .= 2 .* S
dbl!(M, S)
@test all(M[r,c] == 2*S[1,c] for r  1:3, c  1:7)

More complicated examples:

using Tullio
A = [abs2(i - 11) for i in 1:21]

# Downsample -- range of i is that allowed by both terms:
@tullio B[i] := (A[2i] + A[2i+1])/2  # 1:10 == intersect(1:10, 0:10)

# Shifts -- range of i calculated in terms of that given for j:
@tullio M[i,j] := A[i+j-1]  (j in 1:15)  # i in 1:7
@tullio M[i+_,j] := A[i+j]  (j in 1:15)  # i in 0:6, automatic shift "i+_"

using OffsetArrays # Convolve a filter:
K = OffsetArray([1,-1,2,-1,1], -2:2)
@tullio C[i] := A[i+j] * K[j]    # j ∈ -2:2 implies i ∈ 3:19

# Index by the values in K
@tullio D[i,j] := A[2K[j]+i] ÷ K[j] # extrema(K)==(-1,2) implies i ∈ 3:17

# Wrapped & padded:
@tullio M[i,j] := A[mod(i+j)]  (j in 1:15, i in 1:15)   # wraps around, mod(i+j, axes(A,1))
@tullio M[i,j] := A[clamp(i+j)]  (j in 1:15, i in 1:15) # instead repeats "100"
@tullio M[i+_,j] := A[pad(i+j, 3)]  (j in 1:15)         # fills with zeros

using FFTW # Functions of the indices are OK:
S = [0,1,0,0, 0,0,0,0]
fft(S)  @tullio F[k] := S[x] * exp(-im*pi/8 * (k-1) * x)  (k  axes(S,1))

# Finalisers <| or |> are applied after sum (the two are equivalent):
@tullio N2[j] := sqrt <| M[i,j]^2     # N2 ≈ map(norm, eachcol(M))
@tullio n3[_] := A[i]^3  |> (_)^(1/3) # n3[1] ≈ norm(A,3), with _ anon. func.

# Reduction over any function:
@tullio (*) P[i] := A[i+k]  (k in 0:2) # product
@tullio (max) X[i,_] := D[i,j]         # maximum(D, dims=2), almost

min1(x,y) = ifelse(first(x) < first(y), x, y); # findmin(D, dims=1), almost:
@tullio (min1) Ts[j+_] := (D[i,j], (i,j))  init=(typemax(Int), (0,0))

# Access to fields & arrays -- this uses j ∈ eachindex(first(N).c)
N = [(a=i, b=i^2, c=fill(i^3,3)) for i in 1:10]
@tullio T[i,j] := (N[i].a // 1, N[i].c[j])

# Functions which create arrays are evaluated once:
@tullio R[i,j] := abs.((rand(Int8, 5)[i], rand(Int8, 5)[j]))

using NamedDims, AxisKeys # Dimension names, plus pretty printing:
@tullio M[row=i, col=j, z=k] := A[i+j-1]  (j in 1:15, k in 1:2)
@tullio S[i] := M[col=j-i, z=k, row=i+1] # sum over j,k

Fast & Slow

When used with LoopVectorization, on straightforward matrix multiplication of real numbers, @tullio tends to be about as fast as OpenBLAS. Depending on the size, and on your computer. Here's a speed comparison on mine: v2.5.

This race is a useful diagnostic, but isn't really the goal. There is little point in avoiding using BLAS libraries, if you want precisely what they are optimised to give you. One of the things @tullio is often very fast at is weird tensor contractions, for which you would otherwise need permutedims:

using Tullio, LoopVectorization, NNlib, BenchmarkTools

# Batched matmul, with batch index first in B:
bmm_rev(A, B) = @tullio C[i,k,b] := A[i,j,b] * B[b,k,j]  # (sum over j)

A = randn(20,30,500); B = randn(500,40,30);
bmm_rev(A, B)  NNlib.batched_mul(A, permutedims(B, (3,2,1)))  # true

@btime bmm_rev($A, $B);  # 317.526 μs μs, same speed as un-permuted
@btime NNlib.batched_mul($A, permutedims($B, (3,2,1)));  # 1.478 ms, with MKL

Complex numbers aren't handled by LoopVectorization, so will be much slower.

Chained multiplication is also very slow, because it doesn't know there's a better algorithm. Here it just makes 4 loops, instead of multiplying sequentially, 30^4 instead of 2 * 30^3 operations:

M1, M2, M3 = randn(30,30), randn(30,30), randn(30,30);
@btime $M1 * $M2 * $M3;                                   #  3.525 μs
@btime @tullio M4[i,l] := $M1[i,j] * $M2[j,k] * $M3[k,l]; # 30.401 μs

Or slightly less obviously:

M, Σ = randn(100,100), randn(100,100);
@tullio R4[i, j] := (M[μ, i] - M[μ,j])' * Σ[μ,ν] * (M[ν, i] - M[ν, j]);
begin
  S = M' * Σ * M  # two N^3 operations, instead of one N^4
  @tullio R3[i,j] := S[i,i] + S[j,j] - S[i,j] - S[j,i]
end;
R3  R4

Another thing Tullio can be very fast at is broadcast reductions, where it can avoid large allocations. Here LoopVectorization is speeding up log, and Tullio is handling tiled memory access and multi-threading:

sum_opp(X, Y=X) = @tullio s := X[i,j] * log(Y[j,i])
sum_part(X, Y=X) = @tullio S[i] := X[i,j] * log(Y[j,i])

X = rand(1000,1000);
@btime sum_opp($X)                    #   499.814 μs (93 allocations: 3.97 KiB)
@btime sum($X .* log.(transpose($X))) # 8.759 ms (2 allocations: 7.63 MiB)

@btime sum_part($X)'                           #  1.599 ms (not the same computer!)
@btime sum($X .* log.(transpose($X)), dims=2)  # 13.292 ms

At present indices using pad, clamp or mod are also slow. These result in extra checks or operations at every iteration, not just around the edges:

conv1(x,k) = @tullio y[i+_, j+_] := x[i+a, j+b] * k[a,b]
conv2(x,k) = @tullio y[i+_, j+_] := x[2i-a, 2j-b] * k[a,b]
conv3(x,k) = @tullio y[i+_, j+_] := x[pad(i-a,3), pad(j-b,3)] * k[a,b]

x100 = rand(100,100); k7 = randn(7,7);
@btime conv1($x100, $k7); #  25.574 μs
@btime conv2($x100, $k7); #  44.590 μs
@btime conv3($x100, $k7); #  86.228 μs

using Flux
x104 = reshape(x100,(100,100,1,1)); k74 = reshape(k7,(7,7,1,1)); 
conv1(x100, k7)  @btime CrossCor($k74, false)($x104)       # 586.694 μs
conv2(x100, k7)  @btime Conv($k74, false, stride=2)($x104) # 901.573 μs
conv3(x100, k7)  @btime Conv($k74, false, pad=3)($x104)    # 932.658 μs

using DSP
@btime DSP.conv($x100, $k7); # 198.331 μs

Gradients & GPU

Derivatives & GPU
using Tullio
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

A = rand(3,40); B = rand(40,500);
A * B  mul(A, B) # true

using Tracker # or Zygote
ΔA = Tracker.gradient((A,B) -> sum(mul(A, B)), A, B)[1]
ΔA  ones(3,500) * B' # true

using CUDA, KernelAbstractions # Now defined with a GPU version:
mul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

cu(A * B)  mul(cu(A), cu(B)) # true

cu(ΔA)  Tracker.gradient((A,B) -> sum(mul(A, B)), cu(A), cu(B))[1] # true

# Reduction over min/max:
Tracker.gradient(x -> (@tullio (max) res := x[i]^3), [1,2,3,-2,-1,3])[1]

Some warnings are in order:

  • Complete reductions to a number will not work on the GPU at present. They were extremely slow, and a re-organisation of multi-threading for the CPU case killed them, sorry.
  • Gradients are not calculated for scalars, only arrays. Thus for example gradient(a -> (@tullio _ := $a * A[i]), 3.14) will be zero.
  • When using grad=Dual, the right hand side is evaluated a second time during the backward pass. This avoids needing memory to store partials, but if the function is expensive, it may be slow.

Larger Expressions

The expression need not be just one line, for example:

@tullio out[x, y] := @inbounds(begin  # sum over k
        a,b = off[k]
        mat[mod(x+a), mod(y+b)]
    end) (x in axes(mat,1), y in axes(mat,2)) grad=Dual nograd=off

Here the macro cannot infer the range of the output's indices x,y, so they must be provided explicitly. (If writing into an existing array, with out[x,y] = begin ... or +=, then ranges would be taken from there.) Because it sees assignment being made, it does not attempt to sum over a,b, and it assumes that indices could go out of bounds so does not add @inbounds for you. (Although in fact mod(x+a) == mod(x+a, axes(mat,1)) is safe.) It will also not be able to take a symbolic derivative, but dual numbers will work fine.

More examples:

using Tullio, OffsetArrays

# A convolution with cyclic indices
mat = zeros(10,10,1); mat[2,2] = 101; mat[10,10] = 1;
@tullio kern[i,j] := 1/(1+i^2+j^2)  (i in -3:3, j in -3:3)

@tullio out[x,y,c] := begin
    xi = mod(x+i, axes(mat,1)) # xi = ... means that it won't be summed,
    # yj = mod(y+j, axes(mat,2))
    @inbounds trunc(Int, mat[xi, mod(y+j), c] * kern[i,j]) # and disables automatic @inbounds,
end (x in 1:10, y in 1:10) # and prevents range of x from being inferred.

# A stencil?
offsets = [(a,b) for a in -2:2 for b in -2:2 if a>=b] # vector of tuples

@tullio out[x,y,1] = begin
        a,b = offsets[k]
        i = clamp(x+a, extrema(axes(mat,1))...)
        # j = clamp(y+b, extrema(axes(mat,2))...) # can be written clamp(y+b)
        @inbounds mat[i, clamp(y+b), 1] * 10
    end # ranges of x,y read from out[x,y,1]

# Applying a vector of functions
fs = [sin, cos, tan]
xs = randn(3,100)
@tullio ys[r,c] := (fs[r])(xs[r,c])

using Zygote, ForwardDiff
rowmap(fs, xs) = @tullio ys[r,c] := (fs[r])(xs[r,c]) grad=Dual nograd=fs
Zygote.gradient(sumrowmap, fs, ones(3,2))
[f'(1) for f in fs] # agrees

Keyword Options

The default setting is: @tullio threads=true fastmath=true avx=true tensor=true cuda=256 grad=Base verbose=false A[i,j] := ...

  • threads=false turns off threading, while threads=64^3 sets a threshold size at which to divide the work (replacing the macro's best guess).
  • avx=false turns off the use of LoopVectorization, while avx=4 inserts @avx unroll=4 for i in ....
  • grad=false turns off gradient calculation, and grad=Dual switches it to use ForwardDiff (which must be loaded).
  • nograd=A turns of the gradient calculation just for A, and nograd=(A,B,C) does this for several arrays.
  • tensor=false turns off the use of TensorOperations.
  • Assignment xi = ... removes xi from the list of indices: its range is note calculated, and it will not be summed over. It also disables @inbounds since this is now up to you.
  • verbose=true prints things like the index ranges inferred, and gradient calculations. verbose=2 prints absolutely everything.
  • A[i,j] := ... makes a new array, while A[i,j] = ... and A[i,j] += ... write into an existing one. A[row=i, col=j] := ... makes a new NamedDimsArray.
  • @tullio (*) A[i,j] := ... is a product, as is @tullio A[i,j] *= .... For other reductions, @tullio (f) A[i,j] ^= ... is an in-place update.
  • init=0.0 gives the initial value for reductions. For +, *, min, min, &, | it has sensible defaults, for other reductions uses zero.

Implicit:

  • Indices without shifts must have the same range everywhere they appear, but those with shifts (even A[i+0]) run over the intersection of possible ranges.
  • Shifted output indices must start at 1, unless OffsetArrays is visible in the calling module.
  • The use of @avx, and the calculation of gradients, are switched off by sufficiently complex syntax (such as arrays of arrays).
  • Gradient hooks are attached for any or all of ReverseDiff, Tracker & Zygote. These packages need not be loaded when the macro is run.
  • Gradients are only defined for reductions over (+) (default) and min, max.
  • GPU kernels are only constructed when both KernelAbstractions and CUDA are visible. The default cuda=256 is passed to kernel(CUDA(), 256).
  • The CPU kernels from KernelAbstractions are called only when threads=false; they are not at present very fast, but perhaps useful for testing.

Extras:

  • A[i] := i^2 (i in 1:10) is how you specify a range for indices when this can't be inferred.
  • A[i] := B[i, $col] - C[i, 2] is how you fix one index to a constant (to prevent col being summed over).
  • A[i] := $d * B[i] is the preferred way to include other constants. Note that no gradient is calculated for d.
  • Within indexing, A[mod(i), clamp(j)] both maps i & j to lie within axes(A), and disables inference of their ranges from A.
  • Similarly, A[pad(i,3)] extends the range of i, inserting zeros outside of A. Instead of zero, pad=NaN uses this value as padding. The implementation of this (and mod, clamp) is not very fast at present.
  • On the left, when making a new array, an underscore like A[i+_] := inserts whatever shift is needed to make A one-based.
  • Tullio.@printgrad (x+y)*log(x/z) x y z prints out how symbolic derivatives will be done.

Macros:

  • Tullio.@tensor is a macro which uses TensorOperations to evaluate expressions, but provides gradient definitions. (Previously this was automatic behaviour, when TensorOperations.jl was loaded & the expression was suitable.)
  • Tullio.@einsum is a variant with a few changes, to allow the running of Einsum.jl's tests.

How it Works

The following three macros all end up calling the same functions as does C = A * B:

@tensor C[i,j] := A[i,k] * B[k,j]         # TensorOperations.jl
@ein C[i,j] := A[i,k] * B[k,j]            # OMEinsum.jl
@matmul C[i,j] := sum(k) A[i,k] * B[k,j]  # TensorCast.jl

But this one writes its own for-loops:

@einsum C[i,j] := A[i,k] * B[k,j]         # Einsum.jl

expanding out to roughly this:

T = promote_type(eltype(A), eltype(B))
C = Array{T}(undef, size(A,1), size(B,2))
@inbounds for j in 1:size(B,2)
    for i in 1:size(A,1)
        acc = zero(T)
        for k in 1:size(A,2)
            acc += A[i,k] * B[k,j]
        end
        C[i,j] = acc
    end
end

Tullio does something similar, but working through a few functions. Taking a slightly more complicated example, this:

@tullio C[i,j] := tanh <| A[i,k] * B[k,j]

expands to roughly this:

function act!(::Type, C::AbstractArray{T}, A, B, ax_i, ax_j, ax_k, keep=nothing, final=true) where T
    @inbounds @fastmath for i in ax_i
        for j in ax_j
            acc = isnothing(keep) ? zero(T) : C[i,j]
            for k in ax_k
                acc += A[i,k] * B[k,j]
            end
            C[i,j] = isnothing(final) ? acc : tanh(acc)
        end
    end
end

function make(A, B)
    ax_i = axes(A,1)
    ax_j = axes(B,2)
    ax_k = axes(A,2) # and check this is == axes(B,1)
    rhs(A,B,i,j,k) = tanh(A[i,k] * B[k,j])
    T = Core.Compiler.return_type(rhs, eltype.((A,B,1,1,1))) # plus a fallback
    C = similar(A, T, (ax_i, ax_j))
    Tullio.threader(act!, Array{T}, C, (A,B), (ax_i,ax_j), (ax_k,), +, 64^3)
    return C
end

C = Tullio.Eval(make, ∇make)(A, B)

This division allows it to dispatch to other methods of act!: one generated with @avx if LoopVectorization is loaded, and one for ::CuArray if KernelAbstractions is loaded.

It also allows threader to divide the work, calling act! many times, from different threads, on small tiles made by dividing the longest axis (say ax_i) in half, repeatedly. If it divides up ax_k, these are done sequentially, with keep=true on all ranges except the first, and final=nothing on all except the last. But ax_i and ax_j are safe to do in parallel.

Finally, Eval exists to give Zygote and friends somewhere to attach themselves. The gradient calculation looks roughly like this:

@adjoint function (e::Eval)(AB...)
    C = e.fwd(AB...)
    C, ΔC -> e.rev(ΔC, C, AB...)
end

function ∇act!(::Type, ΔC, ΔA, ΔB, C, A, B, ax_i, ax_j, ax_k, keep)
    for k in ax_k, i in ax_i, j in ax_j
        ex = ΔC[i,j] * (1-C[i,j])^2
        ΔA[i,k] += ex * B[k,j]
        ΔB[k,j] += A[i,k] * ex
    end
end

function ∇make(ΔC, C, A, B)
    ΔA = similar(A) .= 0
    ΔB = similar(B) .= 0
    ax_i, ax_k = axes(A); ax_j = axes(B,2)
    Tullio.∇threader(∇act!, Array{T}, (ax_k,), (ax_i, ax_j), nothing)
    return (ΔA, ΔB)
end

In this case, it is the loop over k which can be safely broken among different threads, since both ΔA and ΔB have this index. Both ΔA and ΔB are filled in at once.

Notice that the derivative of y = tanh(z) has been written in terms of y (the final result of the forward pass) but free of z (the result of the sum, which was not saved). If this is not possible, it will fail.

If using grad=Dual, the gradient kernel looks different. This method cannot handle finalisers like tanh above, but for plain @tullio C[i,j] := A[i,k] * B[k,j] it would read:

function ∇act!(::Type, ΔC, ΔA, ΔB, C, A, B, ax_i, ax_j, ax_k, keep)
    eps1 = ForwardDiff.Dual(0, (1,0))
    eps2 = ForwardDiff.Dual(0, (0,1))
    for k in ax_k, i in ax_i, j in ax_j
        res = (A[i,k] + eps1) * (B[k,j] + eps2)
        ΔA[i,k] += ForwardDiff.partials(res, 1) * ΔC[i,j]
        ΔB[k,j] += ForwardDiff.partials(res, 2) * ΔC[i,j]
    end
end

Writing @tullio verbose=2 will print all of these functions out.

Scalar reductions, such as @tullio s := A[i,j] * log(B[j,i]), are slightly different in that the act! function simply returns the sum, i.e. the variable acc above.

Elsewhere

Back-end friends & relatives:

Front-end near-lookalikes:

Things you can't run:

tullio.jl's People

Contributors

carlolucibello avatar chriselrod avatar dilumaluthge avatar jishnub avatar johnnychen94 avatar kristofferc avatar maximilian-gelbrecht avatar mcabbott avatar n3n5 avatar simeonschaub avatar vpuri3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tullio.jl's Issues

Warning when loaded with Reversediff

Executing using ReverseDiff, Tullio in REPL gives the warning

julia> using ReverseDiff, Tullio
┌ Warning: Error requiring ReverseDiff from Tullio:
│ LoadError: LoadError: MethodError: no method matching gensym(::Expr)
│ Closest candidates are:
│   gensym(!Matched::Symbol) at expr.jl:15
│   gensym(!Matched::String) at expr.jl:12
│   gensym() at expr.jl:10
│   ...
│ Stacktrace:
│  [1] @grad(::LineNumberNode, ::Module, ::Any) at /home/ritesh/.julia/packages/ReverseDiff/jFRo1/src/macros.jl:182
│  [2] include(::Function, ::Module, ::String) at ./Base.jl:380
│  [3] include at ./Base.jl:368 [inlined]
│  [4] include(::String) at /home/ritesh/.julia/packages/Tullio/oeP9x/src/Tullio.jl:1
│  [5] top-level scope at none:1
│  [6] eval at ./boot.jl:331 [inlined]
│  [7] eval at /home/ritesh/.julia/packages/Tullio/oeP9x/src/Tullio.jl:1 [inlined]
│  [8] (::Tullio.var"#156#160")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:85
│  [9] err(::Any, ::Module, ::String) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:42
│  [10] (::Tullio.var"#155#159")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:84
│  [11] withpath(::Any, ::String) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:32
│  [12] (::Tullio.var"#154#158")() at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:83
│  [13] listenpkg(::Any, ::Base.PkgId) at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:15
│  [14] macro expansion at /home/ritesh/.julia/packages/Requires/jmibq/src/require.jl:81 [inlined]
│  [15] (::Tullio.var"#153#157")() at /home/ritesh/.julia/packages/Requires/jmibq/src/init.jl:11
│  [16] __init__() at /home/ritesh/.julia/packages/Requires/jmibq/src/init.jl:18
│  [17] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:697
│  [18] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:782
│  [19] _require(::Base.PkgId) at ./loading.jl:1007
│  [20] require(::Base.PkgId) at ./loading.jl:928
│  [21] require(::Module, ::Symbol) at ./loading.jl:923
│  [22] eval(::Module, ::Any) at ./boot.jl:331
│  [23] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:134
│  [24] repl_backend_loop(::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:195
│  [25] start_repl_backend(::REPL.REPLBackend, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:180
│  [26] run_repl(::REPL.AbstractREPL, ::Any; backend_on_current_task::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:292
│  [27] run_repl(::REPL.AbstractREPL, ::Any) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288
│  [28] (::Base.var"#806#808"{Bool,Bool,Bool,Bool})(::Module) at ./client.jl:399
│  [29] #invokelatest#1 at ./essentials.jl:710 [inlined]
│  [30] invokelatest at ./essentials.jl:709 [inlined]
│  [31] run_main_repl(::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at ./client.jl:383
│  [32] exec_options(::Base.JLOptions) at ./client.jl:313
│  [33] _start() at ./client.jl:506
│ in expression starting at /home/ritesh/.julia/packages/Tullio/oeP9x/src/grad/reverse.jl:8
│ in expression starting at /home/ritesh/.julia/packages/Tullio/oeP9x/src/grad/reverse.jl:8
└ @ Requires ~/.julia/packages/Requires/jmibq/src/require.jl:44

My Config:

ReverseDiff v1.4.3
Tullio v0.2.7

Looks like the warning is related to ReverseDiff.@grad but Tracker.@grad doesn't give any such warning.

I'm trying to build a sysimg for my project (which uses ReverseDiff and Tullio) and I guess this warning wouldn't allow Tullio to be added to precompile trace (I'm not sure if it would be added).

Is there a way to get rid of this warning?

Potential for race conditions with shifted indices

This doesn’t look thread safe in the gradient, as one thread’s i+1 may be another’s i:

julia> using Tullio

julia> reg(x) = @tullio res = sqrt(abs2(x[i, j] - x[i+1, j]) +abs2(x[i, j] - x[i, j+1])) verbose=2
...
┌ Info: symbolic gradients
│   inbody =3-element Array{Any,1}:
│     :(𝛥x[i, j] = 𝛥x[i, j] + 𝛥ℛ[1] * conj((((x[i, j] - x[i + 1, j]) + (x[i, j] - x[i + 1, j])) + ((x[i, j] - x[i, j + 1]) + (x[i, j] - x[i, j + 1]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
│     :(𝛥x[i + 1, j] = 𝛥x[i + 1, j] + 𝛥ℛ[1] * conj(-(((x[i, j] - x[i + 1, j]) + (x[i, j] - x[i + 1, j]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
└     :(𝛥x[i, j + 1] = 𝛥x[i, j + 1] + 𝛥ℛ[1] * conj(-(((x[i, j] - x[i, j + 1]) + (x[i, j] - x[i, j + 1]))) * (inv(sqrt(abs2(x[i, j] - x[i + 1, j]) + abs2(x[i, j] - x[i, j + 1]))) / 2)))
┌ store.
│   arrays = [:x]

│   sharedind = [:i, :j]
│   shiftedind = [:i, :j]

Example from https://discourse.julialang.org/t/fast-gpu-kernels-differentiable-with-zygote/56756 (mostly about scalar reductions on GPU)

@tullio seems slower than equivalent for loop?

The speed gap seems large. The benchmark is shown below:

julia> a = randn(4000,4000);

julia> b = randn(4000);

julia> @btime func($a,$b);
  175.183 ms (0 allocations: 0 bytes)

julia> @btime func_tullio($a,$b);
  266.097 ms (0 allocations: 0 bytes)

julia> @btime func_avx($a,$b);
  45.472 ms (0 allocations: 0 bytes)

julia> @btime func_tullio_avx($a,$b);
  150.239 ms (0 allocations: 0 bytes)

where the functions' definition are:

func_tullio(a,b) = @tullio threads = false avx = false b[i] = sin(a[i,j]);

func_tullio_avx(a,b) = @tullio threads = false b[i] = sin(a[i,j]);

func(a,b) = begin
    ndims(a) == 2 || error("~~")
    ndims(b) == 1 || error("~~")
    size(b,1) == size(a,1) || error("~~")
    b .= 0
    for j in 1:size(a,2)  
        for i in 1:size(a,1) # no @inbounds @simd
            b[i] += sin(a[i,j])
        end
    end
end

func_avx(a,b) = begin
    ndims(a) == 2 || error("~~")
    ndims(b) == 1 || error("~~")
    size(b,1) == size(a,1) || error("~~")
    b .= 0
    for j in 1:size(a,2)
        @avx for i in 1:size(a,1)  # add @avx
            b[i] += sin(a[i,j])
        end
    end
end

Allow `div` in range inference?

This thread provides an example: https://discourse.julialang.org/t/sequential-multiple-hcat-s-of-a-range-of-rows/49852

julia> A = reshape(0:9, :,2) .+ 0
5×2 Matrix{Int64}:
 0  5
 1  6
 2  7
 3  8
 4  9

julia> multihcat(A, steps) = reduce(hcat, A[i+1:end-(steps-i), :] for i in 0:steps);

julia> multihcat(A, 2)  # desired output
3×6 Matrix{Int64}:
 0  5  1  6  2  7
 1  6  2  7  3  8
 2  7  3  8  4  9

julia> @tullio B[r,c] := begin
           i=r+(c-1)÷2
           A[i, mod(c)]
       end  (r in 1:3, c in 1:6)
3×6 Matrix{Int64}:
 0  5  1  6  2  7
 1  6  2  7  3  8
 2  7  3  8  4  9

julia> @tullio B[r,c] := A[r+(c-1)÷2, mod(c)]  r in 1:3
ERROR: LoadError: "don't know how to handle (c - 1) ÷ 2, sorry"

Support CUDA.jl

Since CuArrays.jl is being replaced with CUDA.jl, it'd be good if Tullio could move it's support to CUDA.

Bugs in range inference with `div`

Some bugs I think with range inference and ÷, exposed by trying to solve this: https://discourse.julialang.org/t/efficient-reshaping-of-2d-array-by-pairs-of-columns/52953

julia> using Tullio, OffsetArrays, TensorCast

julia> A = [ 1  2
             3  4
             5  6
             7  8
             9 10
            11 12
            13 14
            15 16
            17 18 ];  # input

julia> B = [ 1  2   7   8  13  14
             3  4   9  10  15  16
             5  6  11  12  17  18 ];  # desired output

julia> @cast B2[a,(c,b)] := A[(a,b),c]  a:3  # easy!
3×6 reshape(PermutedDimsArray(::Array{Int64, 3}, (1, 3, 2)), 3, 6) with eltype Int64:
 1  2   7   8  13  14
 3  4   9  10  15  16
 5  6  11  12  17  18

julia> @tullio B3[r, c] := A[3r+c÷3, 3r+c÷3+1]  r in 1:3  # wrong formula!
3×7 OffsetArray(::Matrix{Int64}, 1:3, -6:0) with eltype Int64 with indices 1:3×-6:0:
 2  5512452232  5512452232  5512452232  5372075888  5372075888  5372075888
 0          11          11          11  4618052352  4618052352  4618052352
 0          11          11          11  4618334208  4618334208  4618334208

julia> @tullio B4[r, c] := A[r + 3((c-1)÷2), mod(c)]  r in 1:3
3×5 OffsetArray(::Matrix{Int64}, 1:3, 1:5) with eltype Int64 with indices 1:3×1:5:
 1  2   7   8  13
 3  4   9  10  15
 5  6  11  12  17

Default Thread settings (threads=true) cause bad performance on CPU

Hey,

I'm on a 4 core machine. I started Julia with 4 threads and observe (relatively) poor performance with the default Tullio settings

using Zygote, Tullio
x = randn((2000, 2000));
f(x) = (@tullio y =  (x[i+1, j] - x[i-1, j])^2 + (x[i, j+1] - x[i, j-1])^2)
ft(x) = (@tullio threads=false y = (x[i+1, j] - x[i-1, j])^2 + (x[i, j+1] - x[i, j-1])^2)

@btime Zygote.gradient(f, x)[1];
@btime Zygote.gradient(ft, x)[1]

producing:

  100.746 ms (173 allocations: 30.53 MiB)
  11.788 ms (26 allocations: 30.52 MiB)

What's actually the reason for that? I could observe that none of the 4 Julia Threads achieved 100% CPU Usage. Rather 50 - 80% in the main thread, and around 20% in the other three.

Thanks,

Felix

Massive performance degredation between v0.2.8 and v0.2.9 with LoopVectorization.jl

I think something is not being communicated correctly to LoopVectorization.jl on version 0.2.9:

julia> using Tullio, LoopVectorization
[ Info: Precompiling Tullio [bc48ee85-29a4-5162-ae0b-a64e1601d4bc]

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)

julia> foreach((2, 10, 50, 100)) do N
           A, B = rand(N, N + 1), rand(N + 1, N + 2)
           @show N
           @btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
           @btime  mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
       end
N = 2
  51.804 ns (0 allocations: 0 bytes)
  123.709 ns (0 allocations: 0 bytes)
N = 10
  210.035 ns (0 allocations: 0 bytes)
  371.549 ns (0 allocations: 0 bytes)
N = 50
  10.550 μs (0 allocations: 0 bytes)
  13.939 μs (0 allocations: 0 bytes)
N = 100
  25.340 μs (49 allocations: 3.19 KiB)
  39.860 μs (0 allocations: 0 bytes)

(@v1.5) pkg> st Tullio LoopVectorization
Status `~/.julia/environments/v1.5/Project.toml`
  [bdcacae8] LoopVectorization v0.8.26
  [bc48ee85] Tullio v0.2.8

Now, restarting julia,

(@v1.5) pkg> add Tullio@v0.2.9
   Updating registry at `~/.julia/registries/General`
   Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Resolving package versions...
Updating `~/.julia/environments/v1.5/Project.toml`
  [bc48ee85]  Tullio v0.2.8  v0.2.9
Updating `~/.julia/environments/v1.5/Manifest.toml`
  [bc48ee85]  Tullio v0.2.8  v0.2.9

julia> using Tullio, LoopVectorization
[ Info: Precompiling Tullio [bc48ee85-29a4-5162-ae0b-a64e1601d4bc]

julia> tmul!(C, A, B) = @tullio C[i, j] = A[i, k] * B[k, j]
tmul! (generic function with 1 method)

julia> foreach((2, 10, 50, 100)) do N
           A, B = rand(N, N + 1), rand(N + 1, N + 2)
           @show N
           @btime tmul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with Tullio.jl
           @btime  mul!(C, $A, $B) setup=(C=zeros($N, $N+2)) # Matmul with OpenBLAS
       end
N = 2
  51.125 ns (0 allocations: 0 bytes)
  129.749 ns (0 allocations: 0 bytes)
N = 10
  847.338 ns (0 allocations: 0 bytes)
  372.568 ns (0 allocations: 0 bytes)
N = 50
  111.719 μs (0 allocations: 0 bytes)
  13.920 μs (0 allocations: 0 bytes)
N = 100
  261.787 μs (49 allocations: 3.19 KiB)
  38.220 μs (0 allocations: 0 bytes)

(@v1.5) pkg> st Tullio LoopVectorization
Status `~/.julia/environments/v1.5/Project.toml`
  [bdcacae8] LoopVectorization v0.8.26
  [bc48ee85] Tullio v0.2.9

Unsafe scatter with GPU arrays

From here: https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694/5

Tullio was detecting that i is unsafe to multi-thread, but wasn't passing this to KernelAbstractions:

using CUDA, Tullio, KernelAbstractions

let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
    @tullio A[I[i]] += v[i]
end

#+RESULTS:
: 4-element CuArray{Float32,1}:
:  1.0
:  0.0
:  2.0
:  0.0

After my attempt at a fix in 4d30a1f (which runs on the CPU):

julia> let A = CUDA.zeros(4), I = cu([1,3,1]), v = cu([1,2,3])
           @tullio A[I[i]] += v[i]
       end
ERROR: AssertionError: length(__workgroupsize) <= length(ndrange)
Stacktrace:
 [1] partition at /home/user/.julia/packages/KernelAbstractions/jAutM/src/nditeration.jl:103 [inlined]
 [2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#253#3"}, ::Tuple{}, ::Nothing) at /home/user/.julia/packages/KernelAbstractions/jAutM/src/KernelAbstractions.jl:385

Besides not running, this is also too cautious, as it would apply to @tullio A[I[i]] = v[i] where it would be more justifiable to over-write -- there is no guaranteed order, but accumulation ought to be right.

@dfdx points out that ScatterNNlib.jl has an approach using atomic_add which may be worth trying to copy:

https://github.com/yuehhua/ScatterNNlib.jl/blob/74622bf40437f16e10468e8a35dae824527c83cf/src/cuda/cuarray.jl#L6-L17

Use with ReverseDiff

I think this is still blocked by JuliaDiff/ReverseDiff.jl#135, here's the error:

julia> Base.gensym(ex::Expr) = gensym(string(ex)); # without this, LoadError: MethodError: no method matching gensym(::Expr) instead

julia> using ReverseDiff, Tullio
┌ Warning: Error requiring ReverseDiff from Tullio:
│ LoadError: UndefVarError: ev not defined
│ Stacktrace:
│  [1] top-level scope at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/macros.jl:190
│  [2] include(::Function, ::Module, ::String) at /Applications/Julia-1.5.app/Contents/Resources/julia/lib/julia/sys.dylib:?
│  [3] include at ./Base.jl:368 [inlined]
│  [4] include(::String) at /Users/me/.julia/packages/Tullio/HGzih/src/Tullio.jl:1
│  [5] top-level scope at none:1
│  [6] eval at ./boot.jl:331 [inlined]
│  [7] eval at /Users/me/.julia/packages/Tullio/HGzih/src/Tullio.jl:1 [inlined]
│  [8] (::Tullio.var"#150#154")() at /Users/me/.julia/packages/Requires/qy6zC/src/require.jl:85
...

julia> ReverseDiff.gradient(x -> sum(@tullio y[i] := x[i]), rand(3))
ERROR: MethodError: no method matching track(::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#7"},var"#20#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#8"}}, ::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}})
Stacktrace:
 [1] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#7"},var"#20#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#8"}})(::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}) at /Users/me/.julia/packages/Tullio/HGzih/src/grad/reverse.jl:4
 [2] (::var"#5#6")(::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}) at ./REPL[4]:1
 [3] ReverseDiff.GradientTape(::var"#5#6", ::Array{Float64,1}, ::ReverseDiff.GradientConfig{ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}}) at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/api/tape.jl:199
 [4] gradient(::Function, ::Array{Float64,1}, ::ReverseDiff.GradientConfig{ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}}) at /Users/me/.julia/packages/ReverseDiff/Thhqg/src/api/gradients.jl:22 (repeats 2 times)

What the macro does:

using TensorCast, ReverseDiff
id(x) = x;
@pretty ReverseDiff.@grad function id(x)
           ReverseDiff.value(x), Δ -> (Δ,)
       end

Gradient missing for `begin ... end`

julia> f1(x) = sum(@tullio res[i, j] := begin           
                       x[i+2, j] - 2 * x[i+1, j] + x[i, j]
                   end)
f1 (generic function with 1 method)

julia> gradient(f1, x)
ERROR: no gradient definition here!
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] (::Tullio.var"#215#216"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#11"{var"#𝒜𝒸𝓉!#10"}, Nothing}, Tuple{Matrix{Float64}}, Matrix{Float64}})(Δ::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})
   @ Tullio ~/.julia/packages/Tullio/bgqFi/src/grad/zygote.jl:7
 [3] (::Tullio.var"#85#back#217"{Tullio.var"#215#216"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#11"{var"#𝒜𝒸𝓉!#10"}, Nothing}, Tuple{Matrix{Float64}}, Matrix{Float64}}})(Δ::FillArrays.Fill{Float64, 2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}})
   @ Tullio ~/.julia/packages/ZygoteRules/OjfTt/src/adjoint.jl:59
 [4] Pullback
   @ ./REPL[39]:1 [inlined]
 [5] (::typeof((f1)))(Δ::Float64)
   @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface2.jl:0
 [6] (::Zygote.var"#41#42"{typeof((f1))})(Δ::Float64)
   @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface.jl:40
 [7] gradient(f::Function, args::Matrix{Float64})
   @ Zygote ~/.julia/packages/Zygote/KpME9/src/compiler/interface.jl:49
 [8] top-level scope
   @ REPL[40]:1

julia> f1(x) = sum(@tullio res[i, j] := x[i+2, j] - 2 * x[i+1, j] + x[i, j])
f1 (generic function with 1 method)

julia> gradient(f1, x)
([1.0 1.0  1.0 1.0; -1.0 -1.0  -1.0 -1.0;  ; -1.0 -1.0  -1.0 -1.0; 1.0 1.0  1.0 1.0],)

I encountered this because my equation got quite lengthy.

Pass LoopVectorization static size information

Minimal example:

using Tullio, LoopVectorization
@tullio out[i] := 0.1*data[i+o]  o in 0:9

Part of the expansion is:

local 𝒶𝓍o = 0:9
local 𝒶𝓍i = (Tullio.subranges)((eachindex)(data), 𝒶𝓍o)
local 𝓇𝒽𝓈(data, i, o) = begin
    #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:788 =#
    identity(0.1 * data[i + o])
end
begin
    #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:797 =#
    local 𝒯1 = Core.Compiler.return_type(𝓇𝒽𝓈, (typeof)((data, (first)(𝒶𝓍i), (first)(𝒶𝓍o))))
    #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:798 =#
    local 𝒯2 = if Base.isconcretetype(𝒯1)
        #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:799 =#
        𝒯1
    else
        #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:801 =#
        nothing
        #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:802 =#
        (typeof)(𝓇𝒽𝓈(data, (first)(𝒶𝓍i), (first)(𝒶𝓍o)))
    end
end
local 𝒯3 = 𝒯2
local 𝒯 = 𝒯3
(first)(𝒶𝓍i) == 1 || (throw)("to allow indices not starting at 1, OffsetArrays must be visible in the caller's module. Otherwise write `A[i+_] := ...` to remove offset")
local out = similar(parent(data), 𝒯, tuple(Base.OneTo(𝒶𝓍i)))
begin
    (Tullio.threader)(𝒜𝒸𝓉!, (Tullio.storage_type)(out, data), out, tuple(data), tuple(𝒶𝓍i), tuple(𝒶𝓍o), +, 262144, nothing)
    #= /home/chriselrod/.julia/packages/Tullio/u3myB/src/macro.jl:956 =#
    out
end

Making the ranges statically sized, e.g.

using ArrayInterface
StaticInt(0):StaticInt(9)

would let LoopVectorization specialize on the 0:9.

using Tullio, LoopVectorization, BenchmarkTools

data = randn(1000);
out1 = similar(data, length(data) - 9);
out2 = similar(data, length(data) - 9);

function rollingmean10_lv!(out, data)
    @avx for i  eachindex(out)
        outᵢ = zero(eltype(out))
        for j  0:9
            outᵢ += data[i+j]
        end
        out[i] = 0.1 * outᵢ
    end
    out
end
rollingmean10_tullio!(out, data) = @tullio out[i] = 0.1*data[i+o]  o in 0:9 threads=false

rollingmean10_lv!(out1, data)  rollingmean10_tullio!(out2, data)

@benchmark rollingmean10_lv!($out1, $data)
@benchmark rollingmean10_tullio!($out2, $data)

With a 10980xe, yields:

julia> rollingmean10_lv!(out1, data)  rollingmean10_tullio!(out2, data)
true

julia> @benchmark rollingmean10_lv!($out1, $data)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     280.072 ns (0.00% GC)
  median time:      285.438 ns (0.00% GC)
  mean time:        285.265 ns (0.00% GC)
  maximum time:     465.455 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     292

julia> @benchmark rollingmean10_tullio!($out2, $data)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     297.058 ns (0.00% GC)
  median time:      297.394 ns (0.00% GC)
  mean time:        297.760 ns (0.00% GC)
  maximum time:     448.108 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     259

The performance difference is a little larger on my laptop (tiger lake):

julia> @benchmark rollingmean10_lv!($out1, $data)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     280.783 ns (0.00% GC)
  median time:      295.303 ns (0.00% GC)
  mean time:        301.026 ns (0.00% GC)
  maximum time:     778.703 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     290

julia> @benchmark rollingmean10_tullio!($out2, $data)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     313.283 ns (0.00% GC)
  median time:      344.108 ns (0.00% GC)
  mean time:        345.672 ns (0.00% GC)
  maximum time:     522.862 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     240

As I found out by looking at the asm, the difference is that Tullio uses fma instructions, which are a little slowe on Tiger Lake.

With the static 10, it'll completely unroll the loop:

L80:
        vmovupd zmm1, zmmword ptr [rdi + 8*rcx - 136]
        vmovupd zmm2, zmmword ptr [rdi + 8*rcx - 64]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 128]
        vaddpd  zmm2, zmm2, zmmword ptr [rdi + 8*rcx - 72]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 120]
        vaddpd  zmm3, zmm2, zmmword ptr [rdi + 8*rcx - 56]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 112]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 48]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 104]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 40]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 96]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 32]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 88]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 24]
        vaddpd  zmm1, zmm1, zmmword ptr [rdi + 8*rcx - 80]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 16]
        vaddpd  zmm3, zmm3, zmmword ptr [rdi + 8*rcx - 8]
        vaddpd  zmm1, zmm1, zmm2
        vaddpd  zmm2, zmm3, zmmword ptr [rdi + 8*rcx]
        vmulpd  zmm1, zmm1, zmm0
        vmovupd zmmword ptr [rdx], zmm1
        vmulpd  zmm1, zmm2, zmm0
        vmovupd zmmword ptr [rdx + 64], zmm1
        add     rcx, 16
        sub     rdx, -128
        cmp     rcx, r11
        jl      L80

Versus without it:

L176:
        vfmadd231pd     zmm1, zmm0, zmmword ptr [rbx + 8*rdi - 192] # zmm1 = (zmm0 * mem) + zmm1
        vfmadd231pd     zmm2, zmm0, zmmword ptr [rbx + 8*rdi - 128] # zmm2 = (zmm0 * mem) + zmm2
        vfmadd231pd     zmm3, zmm0, zmmword ptr [rbx + 8*rdi - 64] # zmm3 = (zmm0 * mem) + zmm3
        vfmadd231pd     zmm4, zmm0, zmmword ptr [rbx + 8*rdi] # zmm4 = (zmm0 * mem) + zmm4
        inc     rdi
        cmp     rsi, rdi
        jne     L176

This version unrolls the i loop by 4 and doesn't unroll the o loop at all.
It also uses fmas, instead of adds with a multiply at the end. Which is faster is architecture dependent.
On some computers, adds and fmas are the same fast, meaning you wouldn't want the multiply at the end. Bizarely, on Haswell, fma is actually faster than add, but more commonly fma has a lower throughput.
That's an optimiation LoopVectoriation (or LLVM, given associativity) should thus be making. Just observing.

In general, LoopVectorization should do better with more static size info. In can also sometimes help compile times, e.g. it can make a big difference with 2d concolutions.

CI is failing (timing out) on Julia nightly

On the most recent commit to master (58c1fbb), CI passed on Julia 1.4 and passed on Julia 1.5 but failed (timed out) on Julia nightly. Links to the logs:

GitHub Actions will kill a job if it runs for more than six hours.

We need to figure out how to make the CI jobs run for less than six hours.

Here is one option: we separate the test suite into "groups". E.g. if currently the test suite looks like this:

using Tullio
using Test

@test f() == "a"

@test g() == "b"

@test h() == "c"

We instead divide the test suite into three groups as such:

using Tullio
using Test

test_group = get(ENV, "TULLIO_TEST_GROUP", "all")

if test_group in ["all", "1"]
    @test f() == "a"
end

if test_group in ["all", "2"]
    @test g() == "b"
end

if test_group in ["all", "3"]
    @test h() == "c"
end

And then we set up the matrix in .github/workflows/ci.yml file as such:

strategy:
  fail-fast: false
  matrix:
    JULIA_NUM_THREADS:
      - '1'
      - '6'
    TULLIO_TEST_GROUP:
      - '1'
      - '2'
      - '3'
    version:
      - '1.4'
      - '1' # automatically expands to the latest stable 1.x release of Julia
    os:
      - ubuntu-latest
    arch:
      - x64

@mcabbott @chriselrod Any thoughts?

GPU performance

I wasn't sure whether to open this issue here or on KernelAbstractions, so I figured I'd open it here first for discussion, and then we can move it to KernelAbstractions later if need be.

Right now, the GPU performance doesn't seem to scale very well on large matrices.

Here's an example plot for Matrix{Float32}=Matrix{Float16}×Matrix{Float16}, generated on a Titan V.

plot

Here's the code used to generate it:

using BLASBenchmarksGPU
import CUDA
bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
import PyPlot
BLASBenchmarksGPU.plotbench(bench_result, "plot.png")
CUDA.versioninfo()

And here's the output, including all of the TFLOPS values:

julia> using BLASBenchmarksGPU

julia> import CUDA

julia> bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
Progress: 100%|███████████████████████████████████████████████████| Time: 0:08:24
  Size:         16384
  CUBLAS:       62.9 TFLOPS
  GemmKernels:  66.19 TFLOPS
  Tullio:       0.29 TFLOPS
Bennchmark Result of Matrix{Float32}=Matrix{Float16}×Matrix{Float16}
24×4 DataFrame
 Row │ Size   Library      TFLOPS      Time
     │ Int64  Symbol       Float64     Float64
─────┼─────────────────────────────────────────────────
   1128  CUBLAS        0.148099    28321.0
   2128  GemmKernels   0.0332214  126253.0
   3128  Tullio        0.0540789   77559.0
   4256  CUBLAS        0.642928    52190.0
   5256  GemmKernels   0.268756   124851.0
   6256  Tullio        0.310169   108181.0
   7512  CUBLAS        4.36686     61471.0
   8512  GemmKernels   2.02769    132385.0
   9512  Tullio        0.677953   395950.0
  101024  CUBLAS       25.168       85326.0
  111024  GemmKernels  14.2662     150529.0
  121024  Tullio        0.850125        2.52608e6
  132048  CUBLAS       59.5005     288735.0
  142048  GemmKernels  36.4345     471527.0
  152048  Tullio        0.76592         2.24304e7
  164096  CUBLAS       84.8186          1.62039e6
  174096  GemmKernels  57.559           2.38779e6
  184096  Tullio        0.393097        3.49631e8
  198192  CUBLAS       90.1617          1.21949e7
  208192  GemmKernels  59.9127          1.83519e7
  218192  Tullio        0.324503        3.38829e9
  2216384  CUBLAS       62.9003          1.39842e8
  2316384  GemmKernels  66.1909          1.3289e8
  2416384  Tullio        0.294184        2.98999e10


julia> import PyPlot

julia> BLASBenchmarksGPU.plotbench(bench_result, "plot.png")

julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.27.4
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-beta1
- LLVM: 11.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_VERBOSE: true

1 device:
  0: TITAN V (sm_70, 658.500 MiB / 11.784 GiB available)

Migrate GPU CI to Buildkite?

Right now, Tullio.jl uses the GitLab runners for running GPU CI in the JuliaGPU group.

If I understand correctly, JuliaGPU is transitioning from GitLab to Buildkite, with the plan of eventually completely discontinuing the GitLab runners.

It would be great to migrate the GPU CI for Tullio.jl from the GitLab runners to the Buildkite runners.

@maleadt Can you provide instructions for migrating to the new Buildkite JuliaGPU setup?

syntax error when array and index shares the same symbol

This is the minimal example of the error.

a, b = randn(ComplexF64, 10, 10), randn(ComplexF64, 10, 10)
@tullio # (verbose = false, fastmath = true, threads = true, grad = :Base, avx = true, cuda = 256, tensor = true)
@tullio out[a, b] := a[a, c] * b[c, b] # works
@tullio out[A, B] := a[A, C] * conj(b[C, B]) # works
@tullio out[a, b] := a[a, c] * conj(b[c, b]) # syntax error

When @tullio replaced by @tensor in TensorOperations.jl, the last expression works well. Is this a bug of the @tullio macro?

besselj0 error

Code that used to work on 1.5.3 and still works with e.g. sin

k = LinRange(.16,37.8,Np) |> collect
rv = LinRange(0,15,1000) |> collect
gvec = zero(rv)

@tullio gvec[i] = rv[i]*sin(k[j]*rv[i])

and used to work with besselj0 now fails

@tullio gvec[i] = rv[i]*besselj0(k[j]*rv[i])
ERROR: LoadError: StackOverflowError:
Stacktrace:
 [1] besselj0(x::VectorizationBase.Vec{4, Float64})
   @ SpecialFunctions ~/.julia/packages/SpecialFunctions/24S26/src/bessel.jl:204
in expression starting at /Users/abradley/Dropbox/Research/Projects/GPESpectra/scripts/2d_trap_correlation_spectra.jl:760

Banded matrix multiplication does not LoopVectorize

I have the slightly unusual use-case dense matrix = dense matrix * banded matrix (actually more complicated than this, but this shows the issue). With the help of @mcabbott I managed to implement the following:

using LinearAlgebra
using BandedMatrices
using Tullio
using LoopVectorization

function clear_off_bands!(A::BandedMatrix{T}) where T
    m,n = size(A.data)
    l,u = bandwidths(A)

    for j = 1:u
        A.data[1:u-j+1,j] .= zero(T)
    end
    mpn = m+n
    for j = n-l+1:n
        A.data[mpn-l-j+1:end,j] .= zero(T)
    end
    A
end

function my_mul!(w, v, A::BandedMatrix)
    # This implementation assumes that the off-band values, i.e. the
    # upper-left and lower-right corners, are zero, cf. clear_off_bands!
    l,u = bandwidths(A)
    m,n = size(A)
    @tullio verbose=2 w[i,j] = begin
        k = min(max(h + j - u - 1, 1), m)
        v[i,k] * A.data[h,j]
    end
    w
end

n = 4000
v = Matrix((1.0+0im)*I, n, n) # Complex identity matrix
w = similar(v)
w2 = similar(v)
w3 = similar(v)

A = clear_off_bands!(brand(ComplexF64, n, n, 5, 5))
At = BandedMatrix(transpose(A))

@time mul!(w, v, A) # This is exceedingly slow, since this uses generic dense multiplication
@time mul!(transpose(w2), At, transpose(v)) # This will BLAS
@time my_mul!(w3, v, A)

@show norm(w-w2), norm(w-w3), norm(A-w3)

which gives a warning

┌ Warning: LoopVectorization failed
│   err =
│    LoadError: ArgumentError: Collection has multiple elements, must contain exactly 1 element
│    in expression starting at /Users/jagot/.julia/packages/Tullio/bgqFi/src/macro.jl:1092
└ @ Tullio ~/.julia/packages/Tullio/bgqFi/src/macro.jl:1118

but other than that seems to work well. On my laptop I get the following timings:

  • Naïve dense: 76.651751 s
  • BLAS with transposes: 1.152771 seconds
  • Tullio w/o AVX: 0.529101 seconds 🎉

cc @chriselrod

Make sure to avoid race conditions with nontrivial LHS indexing

The following is valid Tullio code:

@tullio C[i ÷ 2, k] += A[i, j] * B[j, k]

but a naive implementation could incur in a race condition (as different values of i end up writing in the same slot in C). Following a conversation on slack, the default should be "safe but slow", so avoid multithreading on unsafe indices.

Bounds checking too conservative on sub-arrays

The following minimal example complains about the indices of a being out of range even though 1 <= a[1, j] <= 3. I wonder if bounds checking could be modified to handle scenarios like this?

using Tullio

a = [1 2 3
     4 5 6]
b = [1, 2, 3]
@tullio x := b[a[1, j]]

# "extrema of index a[1, j] must fit within b"

Spurious printing of "DiffRules._abs_deriv" when LoopVectorization.jl loaded

julia> using Tullio

julia> let v = [-1, 0, 1, 2]
           @tullio out := abs(v[i])
       end
4

julia> using LoopVectorization

julia> let v = [-1, 0, 1, 2]
           @tullio out := abs(v[i])
       end
DiffRules._abs_deriv
4

julia> typeof(ans)
Int64

julia> let v = [-1, 0, 1, 2]
           @tullio out := abs(v[i])
       end
DiffRules._abs_deriv
4

Is this maybe some println debugging that didn't get removed?

VSCode debug with @einsum stops in macro.jl

Dear Michael,

I'm using @einsum from Tullio and I get an unexpected behavior when debugging a julia program in VSCode, https://www.julia-vscode.org/. The debugger always stops at the same line in macro.jl.
Say, I have a program like

using Tullio: @einsum

v = [1; 2; 3];
w = [4; 5; 6];       # <== 1st breakpoint
@einsum a[i,j] := v[i] * w[j]; 
println("a=", a); 
b  = 4 * a;          # <== 2nd breakpoint 
println("b=", b);

After I start the debugger with F5 VSCode shows correct behavior at the 1st breakpoint, see the attached file 2_debugger_at_1st_BP.PNG. When I continue with F5 the debugger stops in line 826 of macro.jl. See attached file 3_debugger_stops_in_macro_jl.PNG. Another continue with F5 leads me to the 2nd BP, see file 4_debugger_at_2n_BP.PNG.

In case I have several @einsum statements the debugger stops several times always in macro.jl at line 826. That can be disturbing if I want to debug code after many @einsum statements.

This behavior may be also related to VSCode but maybe you have a hint?

Thank you

Ralf

After starting the debugger with F5 at the 1st breakpoint (as expected)

2_debugger_at_1st_BP

Continue F5 and the debugger stops in macro.jl (unexpected behavior)

3_debugger_stops_in_macro_jl

Another Continue F5 and the debugger stops at the 2nd break point (as expected)

4_debugger_at_2nd_BP

When supplied range conflicts with un-shifted index...

Hey,

the following one:

julia> using Tullio

julia> y = ^C

julia> y = (1:10) .* (1:10)'
10×10 Matrix{Int64}:
  1   2   3   4   5   6   7   8   9   10
  2   4   6   8  10  12  14  16  18   20
  3   6   9  12  15  18  21  24  27   30
  4   8  12  16  20  24  28  32  36   40
  5  10  15  20  25  30  35  40  45   50
  6  12  18  24  30  36  42  48  54   60
  7  14  21  28  35  42  49  56  63   70
  8  16  24  32  40  48  56  64  72   80
  9  18  27  36  45  54  63  72  81   90
 10  20  30  40  50  60  70  80  90  100

julia> f(x) = (@tullio res[i, j] := x[i, j+1] (i in 1:5, j in 1:5))
f (generic function with 1 method)

julia> f(y)
ERROR: "range of index i must agree"
Stacktrace:
 [1] ℳ𝒶𝓀ℯ
   @ ~/.julia/packages/Tullio/bgqFi/src/macro.jl:968 [inlined]
 [2] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#180"{var"#𝒜𝒸𝓉!#177"}, var"#735#∇ℳ𝒶𝓀ℯ#179"{var"#∇𝒜𝒸𝓉!#178"}})(args::Matrix{Int64})
   @ Tullio ~/.julia/packages/Tullio/bgqFi/src/eval.jl:20
 [3] f(x::Matrix{Int64})
   @ Main ./REPL[119]:1
 [4] top-level scope
   @ REPL[120]:1

julia> f(x) = (@tullio res[i, j] := x[i+0, j+1] (i in 1:5, j in 1:5))
f (generic function with 1 method)

julia> f(y)
5×5 Matrix{Int64}:
  2   3   4   5   6
  4   6   8  10  12
  6   9  12  15  18
  8  12  16  20  24
 10  15  20  25  30

What is the reason that we need the artificial +0? My goal was to calculate only a subset of the initial x

Thanks,

Felix

Idea: on Buildkite, only run the GPU-related tests?

We now have Buildkite working 🎉

On CPU, the test suite is quite long - on Julia 1.5 the CPU tests take e.g. 2 hours and 12 minutes (https://github.com/mcabbott/Tullio.jl/runs/1557475138).

There's no need to run all the CPU-only tests on Buildkite, since we will already be running them on GitHub Actions.

We can e.g. set an environment variable on Buildkite, e.g. BUILDKITE="true". And then in the test suite, we will do:

is_buildkite = parse(Bool, get(ENV, "BUILDKITE", "false"))

And then if is_buildkite is true, we only run the GPU-related tests.

InexactError when the intermediate reduction type differs from the final type

I came into an InexactError error when using Tullio to perform reduction and then abs2 over complex vectors.

The MWE might be easier to explain the issue:

using FFTW, Tullio
x = rand(120) .- .5
y = rand(120) .- .5
u = rand(500) .- .5
v = rand(500) .- .5
a = rand(120)
k = 2π/.01

function tfunc(u,v,a,k,x,y)
    @tullio vals[i] := exp(im * (k * (x[j]*u[i] + y[j]*v[i]) + a[j])) |> abs2
end

tfunc(u,v,a,k,x,y)

Which results in

ERROR: LoadError: InexactError: Float64(-13.863000778992978 + 2.707549484306247im)

In my example the issue could easily be resolved by performing the abs2 directly on the vals array returned by the @tullio macro.
I briefly discussed this with @mcabbott on slack and this appears to only happen when threading is enabled in @tullio.

How to implement n-dimensional functions?

Hey,

I'm trying to figure out how we could use Tullio to generalize some n-dimensional functions.
Let's say I've got a n-dimensional array and I want to use Tullio for that. At the moment I need to treat dimensions differently.

using Tullio
using Random

Random.seed!(42)

function spatial_grad_square_1D(arr)
    resi = let 
         @tullio resi = abs2(arr[i + 1] - arr[i - 1])
    end
    return resi
end

function spatial_grad_square_2D(arr)
    resi = let 
         @tullio resi = abs2(arr[i + 1, j] - arr[i - 1, j])
    end
    resj = let 
         @tullio resj = abs2(arr[i, j + 1] - arr[i, j - 1])
    end
    return resi + resj
end

arr_1D = randn((10))
arr_2D = randn((10, 10))
spatial_grad_square_1D(arr_1D) 
spatial_grad_square_2D(arr_2D)

Do you have any hints on how to achieve this in a more elegant way? Using the Julia metaprogramming documentation I couldn't figure it out 😞

Thanks,

Felix

A case where TrackedReal escapes?

julia> using Tracker, Zygote, Tullio, LinearAlgebra

julia> function gg(x)
         res = 0.0
         n = [norm(x)]
         for i in 1:100
            res += @tullio res_ := sin(exp(x[j]) - n[1])
         end
         atan(res)
       end;

julia> Zygote.gradient(gg, rand(10))
([-0.3105827909434481, -1.435298638568034, -0.22031585984529584, -0.03951256409987808, 0.14018859477924406, -0.49862889658929765, -0.15284449100831132, -0.8243820635907274, -0.06310467526562058, -0.1548104899931962],)

julia> Tracker.gradient(gg, rand(10))
ERROR: MethodError: no method matching Float64(::Tracker.TrackedReal{Float64})
Closest candidates are:
  Float64(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
  Float64(::T) where T<:Number at boot.jl:716
  Float64(::BigInt) at gmp.jl:451
  ...
Stacktrace:
 [1] convert(::Type{Float64}, ::Tracker.TrackedReal{Float64}) at ./number.jl:7
 [2] setindex!(::Array{Float64,1}, ::Tracker.TrackedReal{Float64}, ::Int64) at ./array.jl:847
 [3] macro expansion at /Users/me/.julia/dev/Tullio/src/symbolic.jl:53 [inlined]
 [4] ∇𝒜𝒸𝓉! at /Users/me/.julia/dev/Tullio/src/macro.jl:1207 [inlined]

Segmentation fault for consecutive Tullio statements

Hey,

the following code leads (sometimes, 30 - 50 % roughly) to a segmentation fault.

using Tullio
using Zygote

arr = [1 2; 3 4]
function f(arr)
    @tullio res1 = arr[i, k] - arr[i - 1, k]
    @tullio res2 = arr[i, k] - arr[i, k + 1]
    return res1 + res2
end

f(arr)
print(gradient(f, arr))

It seems to be caused by the addition of the two statements. Furthermore this only happens once Zygote is loaded and gradient is called.
The output is then:

julia> include("tullio_bug.jl")
([1 -1; 2 -1],)
julia> include("tullio_bug.jl")

signal (11): Segmentation fault
in expression starting at <path>/tullio_bug.jl:12
jl_gc_pool_alloc at /usr/bin/../lib/libjulia.so.1 (unknown line)
jl_gc_alloc at /usr/bin/../lib/libjulia.so.1 (unknown line)
jl_alloc_array_1d at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff849fc672)
unknown function (ip: 0x7eff84a11a38)
unknown function (ip: 0x7eff84a1234a)
jl_uncompress_ast at /usr/bin/../lib/libjulia.so.1 (unknown line)
_uncompressed_ast at ./reflection.jl:950 [inlined]
uncompressed_ast at ./reflection.jl:946
#meta#2 at /home/fxw/.julia/packages/IRTools/BpoqK/src/reflection/reflection.jl:112
meta at /home/fxw/.julia/packages/IRTools/BpoqK/src/reflection/reflection.jl:101 [inlined]
_lookup_grad at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/emit.jl:99
#s3320#1912 at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:21 [inlined]
#s3320#1912 at ./none:0
unknown function (ip: 0x7eff776bd06e)
unknown function (ip: 0x7eff84a3edab)
jl_code_for_staged at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff7761066a)
unknown function (ip: 0x7eff77615cca)
unknown function (ip: 0x7eff776aaad7)
unknown function (ip: 0x7eff776ab3d4)
unknown function (ip: 0x7eff776ab57f)
unknown function (ip: 0x7eff849ed555)
unknown function (ip: 0x7eff849edf04)
jl_apply_generic at /usr/bin/../lib/libjulia.so.1 (unknown line)
macro expansion at /home/fxw/.julia/packages/Tullio/xESL5/src/macro.jl:617 [inlined]
f at /home/fxw/Documents/Uni/FSU/2semester/Internship/DeconvOptim.jl/tullio_bug.jl:6 [inlined]
Pullback at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface2.jl:0
#36 at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:46
unknown function (ip: 0x7eff0cb171b2)
gradient at /home/fxw/.julia/packages/Zygote/YeCEW/src/compiler/interface.jl:55
unknown function (ip: 0x7eff84a02d85)
unknown function (ip: 0x7eff84a02a0e)
unknown function (ip: 0x7eff84a04670)
unknown function (ip: 0x7eff84a050e0)
unknown function (ip: 0x7eff84a210a1)
unknown function (ip: 0x7eff849f76e6)
jl_load at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff77955c1f)
unknown function (ip: 0x7eff84a02d85)
unknown function (ip: 0x7eff84a02a0e)
unknown function (ip: 0x7eff84a04670)
unknown function (ip: 0x7eff84a050e0)
unknown function (ip: 0x7eff84a210a1)
unknown function (ip: 0x7eff84a21748)
jl_toplevel_eval_in at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7eff77707df4)
unknown function (ip: 0x7eff77935179)
unknown function (ip: 0x7eff77935474)
unknown function (ip: 0x7eff84a08cc9)
unknown function (ip: (nil))
Allocations: 80635502 (Pool: 80625319; Big: 10183); GC: 66
[1]    395611 segmentation fault (core dumped)  julia

Could this be related to fastmath?

Thanks,

Felix

Edit: The following code seems to work. Notice the minor change in res2:

using Tullio
using Zygote

arr = [1 2; 3 4]
function f(arr)
    @tullio res1 = arr[i, k] - arr[i - 1, k]
    @tullio res2 = arr[i, k] - arr[i, k]
    return res1 + res2
end

f(arr)
print(gradient(f, arr))

Error when taking gradient of gpu expression

Running this code gives an error, when setting CUDA.allowscalar(false) it does show scalar operations are being done.
On cpu everything works as expected.

using CUDA, KernelAbstractions, Tullio, Flux

f(x) = @tullio out[m, n, batch] := W[m, i] * x[i, n, batch] + b[m]
a = cu(rand(3, 5, 6)); W = cu(rand(10, 3)); b = cu(rand(10))

Flux.gradient(Flux.Params(W)) do
    sum(f(a))
end
AssertionError: length(__workgroupsize) <= length(ndrange)

Stacktrace:
 [1] partition at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\nditeration.jl:103 [inlined]
 [2] partition(::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#421#156"}, ::Tuple{}, ::Nothing) at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\KernelAbstractions.jl:368
 [3] (::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,var"#gpu_##🇨🇺#421#156"})(::CuArray{Float32,2,Nothing}, ::Vararg{Any,N} where N; ndrange::Tuple{}, dependencies::Nothing, workgroupsize::Nothing, progress::Function) at C:\Users\jules\.julia\packages\KernelAbstractions\yw9SF\src\backends\cuda.jl:196
 [4] ∇𝒜𝒸𝓉! at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:814 [inlined]
 [5] (::var"#∇𝒜𝒸𝓉!#154")(::Type{CuArray{Float32,N,Nothing} where N}, ::CuArray{Float32,2,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,1,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,2,Nothing}, ::CuArray{Float32,3,Nothing}, ::CuArray{Float32,1,Nothing}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}, ::Base.OneTo{Int64}) at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:811
 [6] ∇threader(::Function, ::Type{CuArray{Float32,N,Nothing} where N}, ::Tuple{CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing}}, ::Tuple{}, ::NTuple{4,Base.OneTo{Int64}}; block::Int64) at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\eval.jl:74
 [7] #750#∇ℳ𝒶𝓀ℯ at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\macro.jl:908 [inlined]
 [8] #217 at C:\Users\jules\.julia\packages\Tullio\YtOlX\src\grad\zygote.jl:8 [inlined]
 [9] (::Tullio.var"#632#back#219"{Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#159"{var"#𝒜𝒸𝓉!#150"},var"#750#∇ℳ𝒶𝓀ℯ#158"{var"#∇𝒜𝒸𝓉!#154"}},Tuple{CuArray{Float32,2,Nothing},CuArray{Float32,3,Nothing},CuArray{Float32,1,Nothing}},CuArray{Float32,3,Nothing}}})(::CuArray{Float32,3,Nothing}) at C:\Users\jules\.julia\packages\ZygoteRules\6nssF\src\adjoint.jl:49
 [10] Affine at .\In[23]:11 [inlined]
 [11] (::typeof(∂(λ)))(::CuArray{Float32,3,Nothing}) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface2.jl:0
 [12] #237 at .\In[88]:4 [inlined]
 [13] (::typeof(∂(#237)))(::Float32) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface2.jl:0
 [14] (::Zygote.var"#54#55"{Zygote.Params,Zygote.Context,typeof(∂(#237))})(::Float32) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface.jl:177
 [15] gradient(::Function, ::Zygote.Params) at C:\Users\jules\.julia\packages\Zygote\EjVY4\src\compiler\interface.jl:54
 [16] top-level scope at In[88]:3

Upon removing the b[m] term, the code doesn't error anymore but scalar iteration seems to take place.

Not sure what to do with x + a + c

I tried to run the following code:

@tullio dP[x,y,z] -= -Diff[a+2, b+2] * Diff[c+2, d+2] * P[mod(x+a+c), mod(y+b+d), z] * P[mod(x+a),mod(y+b),z] (a in -1:1, b in -1:1, c in -1:1, d in -1:1, z in 1:3, x in 1:n, y in 1:n)

I could decompose it into:

V[x,y,z] = Diff[c+2,d+2] * P[mod(x+c), mod(y+d), z] (c in -1:1, d in -1:1, x in 1:n, y in 1:n, z in 1:3)
dP[x,y,z] -= Diff[a+2, b+2] * V[mod(x+a), mod(y+b), z] * P[mod(x+a), mod(y+b), z] (a in -1:1, b in -1:1, x in 1:n, y in 1:n, z in 1:3) 

Is there good reason why mod(x+a+c) not supported?

Error requiring LoopVectorization from Tullio

I get the following warning on Tullio v0.2.10 when loading LoopVectorization v0.9.1:

┌ Warning: Error requiring LoopVectorization from Tullio:
│ UndefVarError: SVec not defined
...

I believe it is due to the renaming of an internal struct SVec in previous versions of VectorizationBase to Vec in current versions.

I'm not familiar enough with your internals to make a PR to fix this, but naively it looks like a simple find and replace.

Scalar reductions

As pointed out by @mforets, there is a surprisingly large overhead for small scalar reductions:

using LinearAlgebra, Tullio
v = rand(10);

td(v,w) = @tullio d := v[i] * w[i]
@btime td($v, $v)  # 2 μs
@btime dot($v, $v) # 20 ns

ts(v) = @tullio s := v[i]
@btime ts($v)  # 2 μs
@btime sum($v) # 4 ns

Perhaps this is mostly from threading (or rather, thinking whether to multi-thread). Disabling that brings it closer to the array-output case:

td1(v,w) = @tullio d := v[i] * w[i]  threads=false
@btime td1($v, $v)  # 60 ns

tt(v) = @tullio vv[i] := 2*v[i];
@btime tt($v)  # 50 ns
@btime 2 .* $v # 40 ns

tt1(v) = @tullio vv[i] := 2*v[i] threads=false;
@btime tt1($v) # 36 ns

The threaded reduction may also use init a number of times, and ideally it would check that this is compatible with the reduction, as suggested #24 (comment).

Finally, I think that @tullio d := sqrt <| v[i] * v[i] should be disallowed, as there's no reason to have the macro handle this.

Slowdown when multiple variables looped

I didn't see a warning about this anywhere so I thought to open an issue. When I have multiple variables in the inner loop it takes much longer to finish, I'm assuming because it isn't contracting indices intelligently (though I could be wrong). Here is a test case where I multiply 3 matrices, maybe there is a better way to have written it within the Tullio framework:

MWE:

using BenchmarkTools
using Tullio

function mul3(A, B, C) 
    A * (B * C)
end

function mul3_tullio(A, B, C) 
    @tullio Y[i, j] := A[i, k] * B[k, l] * C[l, j]
end

A, B, C = rand(100, 100); rand(100, 100); rand(100, 100)
@btime mul3($A, $B, $C)
#   89.030 μs (4 allocations: 156.41 KiB)

@btime mul3_tullio($A, $B, $C)
#   111.093 ms (3 allocations: 78.25 KiB)

mul3(A, B, C)  mul3_tullio(A, B, C)
# true

This is on Julia 1.4.1, Tullio version 0.2.7

error on julia 1.6 with LoopVectorization on MacOS

julia> using LoopVectorization

julia> using Tullio

julia> a=randn(10,10);

julia> b=randn(10,10);

julia> c=randn(10,10);

julia> @tullio c[i,j]=a[i,k]b[k,j]
ERROR: Module IR does not contain specified entry function
Stacktrace:
 [1] assume
   @ ~/.julia/packages/SIMDPirates/EVSvY/src/llvm_utils.jl:308 [inlined]
 [2] macro expansion
   @ ~/.julia/packages/LoopVectorization/pHMnJ/src/reconstruct_loopset.jl:503 [inlined]
 [3] _avx_!(::Val{(0, 0, -1, 4)}, ::Type{Tuple{:LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), Symbol(""), :isnothing, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000001, LoopVectorization.compute, 0x00, 0x02), :numericconstant, Symbol("##zero#257"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x03), :LoopVectorization, :conditionalload, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000005, LoopVectorization.memload, 0x01, 0x04), :LoopVectorization, :~, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000002, LoopVectorization.compute, 0x00, 0x05), :LoopVectorization, :vifelse, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000020304, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000023, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x07), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000031, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x08), :numericconstant, Symbol("##reductzero#266"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x09), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000231, 0x0000000000000003, 0x0000000000000000, 0x0000000000070809, LoopVectorization.compute, 0x00, 0x09), :LoopVectorization, :reduced_add, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000a06, LoopVectorization.compute, 0x00, 0x06), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x000000000000000b, LoopVectorization.memstore, 0x01, 0x0a)}}, ::Type{Tuple{LoopVectorization.ArrayRefStruct{:ℛ, Symbol("##vptr##_ℛ")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:a, Symbol("##vptr##_a")}(0x0000000000000101, 0x0000000000000203, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:b, Symbol("##vptr##_b")}(0x0000000000000101, 0x0000000000000301, 0x0000000000000000)}}, ::Type{Tuple{0, Tuple{}, Tuple{1}, Tuple{}, Tuple{}, Tuple{(3, LoopVectorization.IntOrFloat), (9, LoopVectorization.IntOrFloat)}, Tuple{}}}, ::Type{Tuple{:j, :i, :k}}, ::Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::Nothing, ::typeof(isnothing))
   @ LoopVectorization ~/.julia/packages/LoopVectorization/pHMnJ/src/reconstruct_loopset.jl:503
 [4] 𝒜𝒸𝓉!
   @ ~/.julia/packages/Tullio/oeP9x/src/macro.jl:932 [inlined]
 [5] tile_halves(fun!::var"#𝒜𝒸𝓉!#1", ::Type{Matrix{Float64}}, As::Tuple{Matrix{Float64}, Matrix{Float64}, Matrix{Float64}}, Is::Tuple{UnitRange{Int64}, UnitRange{Int64}}, Js::Tuple{UnitRange{Int64}}, keep::Nothing, final::Bool)
   @ Tullio ~/.julia/packages/Tullio/oeP9x/src/threads.jl:142
 [6] tile_halves
   @ ~/.julia/packages/Tullio/oeP9x/src/threads.jl:139 [inlined]
 [7] threader(fun!::var"#𝒜𝒸𝓉!#1", ::Type{Matrix{Float64}}, Z::Matrix{Float64}, As::Tuple{Matrix{Float64}, Matrix{Float64}}, I0s::Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, J0s::Tuple{Base.OneTo{Int64}}, redfun::Function, block::Int64, keep::Nothing)
   @ Tullio ~/.julia/packages/Tullio/oeP9x/src/threads.jl:68
 [8] top-level scope
   @ ~/.julia/packages/Tullio/oeP9x/src/macro.jl:802
julia> versioninfo()
Julia Version 1.6.0-DEV.1093
Commit 28330a2fef (2020-09-30 14:46 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

Different results: for loop vs Tullio

Hey,

the code below produces different results when using Tullio and for loops. Once I'm using Float64 the results are very close to each other. Is this due to @fastmath? If yes, can we disable it?

Thanks,

Felix

using TestImages
using Tullio
img = convert(Array{Float32}, testimage("fabio_gray_512"))
x = reshape(img, 512, 512)

@tullio res_t = (x[i + 1, j] + x[i - 1, j])^2 (i in 2:size(x)[1] - 1)
print(res_t, "\n")

@tullio res_t = (x[i + 1, j] + x[i - 1, j])^2
print(res_t, "\n")

res_f = 0
for i = 2:size(x)[1] -1
    for j = 1:size(x)[2]
        res_f += (x[i+1, j] + x[i - 1, j])^2
    end
end
print(res_f)

output:

219891.22
219891.22
219914.31

Wrong derivative reducing over `min`, when using KernelAbstractions

MWE:

julia> using Tullio, CUDA, KernelAbstractions, Zygote, FiniteDifferences

julia> g(μ) = ifelse== 4, 1, -1)
g (generic function with 1 method)

julia> push!(Tullio.BASE_NOGRAD, :g);

julia> f2(w, k) = @tullio min f[m] := w[j, m] * g(μ) * (k[j, μ] - k[m, μ])^2
f2 (generic function with 1 method)

julia> using Random; Random.seed!(1);

julia> k = rand(Float32, 2, 4)
2×4 Array{Float32,2}:
 0.547994  0.567737  0.27934   0.8135
 0.819285  0.557336  0.777828  0.00389743

julia> w = rand(Float32, 2, 2)
2×2 Array{Float32,2}:
 0.699683  0.903188
 0.568097  0.416541

julia> Δ = rand(Float32, 2)
2-element Array{Float32,1}:
 0.51531243
 0.13594759

julia> Zygote.pullback(f2, w, k)[2](Δ)[2]
2×4 Array{Float32,2}:
 0.0  0.0   0.414277  0.0
 0.0  0.0  -0.414277  0.0

julia> Zygote.pullback(f2, cu(w), cu(k))[2](cu(Δ))[2]
2×4 CuArray{Float32,2}:
 0.0  0.0   0.414277  0.0
 0.0  0.0  -0.122415  0.0

julia> j′vp(central_fdm(5, 1), f2, Δ, w, k)[2]
2×4 Array{Float32,2}:
 5.59755f-9  5.59755f-9   0.414277  5.59755f-9
 5.59755f-9  5.59755f-9  -0.414277  5.59755f-9

Seems fine if I use + instead of min as the reducer.

Slower performance on GPU than CPU with Zygote

I'm experiencing some slower performances when taking gradients on CuArrays, than normal CPU Arrays, using Zygote. Here's an MWE (the Tullio operations implement a mixture linear regression):

using Tullio
using Zygote
using CUDA, KernelAbstractions

CUDA.allowscalar(false)

function prediction(V, W, r)
    @tullio mixture[a, b, d] := V[a, b, c] * W[d, c] 
    @tullio r̂[a, d] := mixture[a, b, d] * r[b, d] 
    returnend

### CuArray version
V = CuArray(randn(15, 15, 3))
W = CuArray(randn(150, 3))


@time begin 
    grad = gradient(Params([V])) do 
        l = 0
        for t = 1:5
            r = CuArray(randn(15, 150))
            r̂ = prediction(V, W, r) 
            l += sum(r - r̂)
        end
        l
    end
end
# output:   0.304601 seconds (516.80 k allocations: 15.125 MiB)

If I replace all the arrays to CPU, I get 0.151464 seconds (209.86 k allocations: 13.321 MiB).

I'm wondering if this is expected? Or am I forgetting something when using Cuda arrays that slows down the performance?

Unexpected behaviour when summing with different sign

Hey,

I recognized the following today. The only difference between correct one and wrong one are the signs.

julia> using Tullio
julia> arr = [1 2; 1 2; 3 4; 3 4]
4×2 Array{Int64,2}:
 1  2
 1  2
 3  4
 3  4
julia> @tullio x2[i1, i2] := (arr[2i1 + 1, 2i2 + 1] + arr[2i1 + 0, 2i2 + 1] + arr[2i1 + 1, 2i2 + 0] + arr[2i1 + 0, 2i2 + 0])
1×0 Array{Int64,2}
julia> @tullio x2[i1, i2] := (arr[2i1 - 1, 2i2 - 1] + arr[2i1 + 0, 2i2 - 1] + arr[2i1 - 1, 2i2 + 0] + arr[2i1 + 0, 2i2 + 0])
2×1 Array{Int64,2}:
  6
 14

I didn't expect this because your downsample example is also formulated in a "+" way.

Thanks,

Felix

Edit: I guess I understand it. It simply starts at i1=1 and then sums up from there (2* 1 + 1 and 2* 1 + 0).
So that's wanted behaviour?

Issues with offset indices and GPU

Trying to implement a slightly weird "half-convolution" as follows:

using NNlib, Tullio

_tullio_conv(u, v) = @tullio y[i+_] := u[i+a] * v[a]


function tullio_conv(u::AbstractVector, v::AbstractVector)
    # Left pad
    upad = NNlib.pad_zeros(u, (length(v) - 1, 0))
    v_rev = @view v[end:-1:1]

    return _tullio_conv(upad, v_rev)
end

which results in "KernelAbstractions can't handle OffsetArrays here" when running tullio_conv(a, b) with a and b being CuArray of same size. Works fine on CPU though.

Not sure if this is Tullio.jl's or KernelAbstractions.jl's "fault".

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

Macro hygiene issue `ndims`

Observe this initially baffling bit of code:

julia> using Tullio

julia> let s = [1,2,3,4]
           @tullio x[i] := s[i]
       end
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> let s = [1,2,3,4], ndims=maximum(s)
           @tullio x[i] := s[i]
       end
ERROR: MethodError: objects of type Int64 are not callable
Stacktrace:
 [1] top-level scope at /home/mason/.julia/packages/Tullio/Q2RDS/src/macro.jl:970
 [2] top-level scope at REPL[3]:2

I think somewhere in the macroexpansion, you need to interpolate the Tullio.ndims function otherwise a user having a variable called ndims causes trouble.

Error in scalar reduction on the GPU

I hope I'm not doing it wrong but I've been playing around with Tullio.jl (super neat package!) and I can't seem to get this simple expression to work on the GPU:

julia> using CUDA, KernelAbstractions, Tullio

julia> A = rand(5, 5) |> CuArray;

julia> B = rand(5, 5) |> CuArray;

julia> @tullio C = A[i, j] / 2 + B[i, j] / 3

produces this stacktrace

ERROR: MethodError: Cannot `convert` an object of type Nothing to an object of type Float64
Closest candidates are:
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  convert(::Type{T}, ::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  ...
Stacktrace:
 [1] thread_scalar at /home/alir/.julia/packages/Tullio/bgqFi/src/threads.jl:237 [inlined]
 [2] (::var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"})(::CuArray{Float64,2}, ::CuArray{Float64,2}) at /home/alir/.julia/packages/Tullio/bgqFi/src/macro.jl:805
 [3] (::Tullio.Eval{var"#ℳ𝒶𝓀ℯ#10"{var"#𝒜𝒸𝓉!#1"},var"#38#∇ℳ𝒶𝓀ℯ#9"{var"#∇𝒜𝒸𝓉!#5"}})(::CuArray{Float64,2}, ::Vararg{CuArray{Float64,2},N} where N) at /home/alir/.julia/packages/Tullio/bgqFi/src/eval.jl:20
 [4] top-level scope at /home/alir/.julia/packages/Tullio/bgqFi/src/macro.jl:975

with

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, cascadelake)

and

Status `~/.julia/environments/v1.5/Project.toml`
  [052768ef] CUDA v2.4.1
  [63c18a36] KernelAbstractions v0.5.3
  [bc48ee85] Tullio v0.2.12

Force avx to "avx = false" when use LHS scattering, unless the index is checked to be unique.

using Tullio
using LoopVectorization
J = [1,3,1,1]
A = [i^2 for i in 1:10]

Example 1(wrong):

@tullio B[J[i],k] := A[k]
@tullio B[J[i],k] += A[k]
# 3×10 Array{Int64,2}:
#  1  4  9  16  25  36  49  64  81  100
#  0  0  0   0   0   0   0   0   0    0
#  1  4  9  16  25  36  49  64  81  100

Example 2(right):

@tullio B[J[i],k] := A[k]
@tullio avx=false B[J[i],k] += A[k]
# 3×10 Array{Int64,2}:
#  4  16  36  64  100  144  196  256  324  400
#  0   0   0   0    0    0    0    0    0    0
#  2   8  18  32   50   72   98  128  162  200

As @avx can't tell the user all the misusing, maybe the default set should be avx = false.

Gradient with Zygote and Identity

Hey,

I discovered a strange behaviour:

using Zygote
using Tullio

x = ones((5,5))

function f_working(arr)
    @tullio res = sqrt(arr[i, j])
    return res
end

function f_not_working(arr)
    @tullio res = identity(arr[i, j])
    return res
end

Calling the second function with Zygote:

julia> gradient(f_not_working, ones((3,3)))
ERROR: no gradient definition here!
Stacktrace:
 [1] (::Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#9"{var"#𝒜𝒸𝓉!#8"},Nothing},Tuple{Array{Float64,2}},Array{Float64,1}})(::FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}) at <path>/.julia/packages/Tullio/bnAxO/src/grad/zygote.jl:7
 [2] (::Tullio.var"#195#back#219"{Tullio.var"#217#218"{Tullio.Eval{var"#ℳ𝒶𝓀ℯ#9"{var"#𝒜𝒸𝓉!#8"},Nothing},Tuple{Array{Float64,2}},Array{Float64,1}}})(::FillArrays.Fill{Float64,1,Tuple{Base.OneTo{Int64}}}) at <path>/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49
 [3] f_not_working at ./REPL[5]:3 [inlined]
 [4] (::typeof(∂(f_not_working)))(::Float64) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface2.jl:0
 [5] (::Zygote.var"#37#38"{typeof(∂(f_not_working))})(::Float64) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface.jl:45
 [6] gradient(::Function, ::Array{Float64,2}) at <path>/.julia/packages/Zygote/z3bQd/src/compiler/interface.jl:54
 [7] top-level scope at REPL[6]:1

Pure Zygote is able to make the derivative of identity, so it must be somehow due to the interplay of Tullio and Zygote.

Thanks,

Felix

Edit:
I think it also occurs in a more general case:

foo = sqrt

function f_not_working(arr)
    @tullio res = foo(arr[i, j])
    return res
end

The initialization for user defined reduction function is always zero(T)

Lead to

a = fill(100,4,4)
@tullio avx = false (&) b[i] := sin(a[i,j])+2 >= 0 

4-element Array{Bool,1}:
 0 # should be 1
 0 # should be 1
 0 # should be 1
 0 # should be 1

Also some time user want keep the initial value for defined reduction function like += and *=
Maybe a synthesis like below is fine.

@tullio (fun_for_reduction,fun_for_initialization,keep) b[i] = sin(a[i,j])+2 >= 0

Adding a "Quick Start" section to the README, with a list of recommended packages to install?

Could we add a "Quick Start" section to the README, to help beginners that want to quickly get started playing around with Tullio? The section would include a list of recommended packages for use with Tullio.

E.g. maybe something like this?

Quick Start

julia> using Pkg

julia> Pkg.add(["Tullio", "LoopVectorization", "TensorOperations", "KernelAbstractions"])

julia> using Tullio, LoopVectorization, TensorOperations, KernelAbstractions

julia> matmul(A, B) = @tullio C[i,k] := A[i,j] * B[j,k]

julia> X = [1.0 2.0 3.0; 4.0 5.0 6.0]

julia> Y = [7.0 8.0; 9.0 10.0; 11.0 12.0]

julia> matmul(X, Y)

Are these the right packages for someone to start off with?

  • Tullio
  • LoopVectorization
  • TensorOperations
  • KernelAbstractions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.