Deion There have been some performance benchm

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Low level "bit-twiddling" performance... about roc HOT 8 OPEN

GordonBGood commented on May 30, 2024

Low level "bit-twiddling" performance...

from roc.

Comments (8)

GordonBGood commented on May 30, 2024 1

@lukewilliamboswell,

Have you ran this with roc build --optimize to ensure it is optimised using LLVM to run fast?

Yes, the run times quoted in the OP were all using optimized builds and without optimization it is about four time slower than this.

I assumed that Roc must be using LLVM for the optimized builds although the documentation didn't list LLVM as a pre-requisite. Fedora 39 currently uses LLVM 17 as its distro standard so that is what is installed.

from roc.

GordonBGood commented on May 30, 2024

Adding another comparison, F#/Fable, (an almost pure FP language other than it has optional mutability of all bindings, always mutable array contents, and uncontrolled IO side-effects) can transpile to Rust code from F# code, with the following code implementing the same benchmark as the OP:

let cLOOPS = 1000

let primesBench() =
  let cmpsts = Array.zeroCreate 16384
  let rec loopn n =
    if n < cLOOPS then
      let rec loopi i =
        let s0 = (i + i) * (i + 3) + 3
        if s0 < 131072 then
          if cmpsts.[i >>> 3] &&& (1uy <<< (i &&& 7)) <> 0uy then loopi (i + 1) else
          let bp = i + i + 3
          let slmt = s0 + 8 * bp
          let rec loops s =
            if s < slmt then
              let msk = 1uy <<< (s &&& 7)
              let rec loopc c =
                if c < 16384 then
                  cmpsts.[c] <- cmpsts.[c] ||| msk
                  loopc (c + bp)
              loopc (s >>> 3); loops (s + bp)
          loops s0; loopi (i + 1)
      loopi 0; loopn (n + 1)      
  loopn 0
  {0 .. 131071} 
    |> Seq.fold
      (fun a i -> a - ((int cmpsts.[i >>> 3] >>> (i &&& 7)) &&& 1))
      131073

[<EntryPoint>]
let main args = 
    printfn "PrimesBench in F#"
    let start = System.DateTime.Now.Ticks
    let answr = primesBench()
    let elpsd = (System.DateTime.Now.Ticks - start) / 10000L
    printfn "Found %d primes to 262146 for %d loops in %d milliseconds." answr cLOOPS elpsd
    0

On the same computer as in the OP, this has the following run times:

Compiled to JavaScript (optimized): 197 milliseconds.
Compiled to DotNet IR (Release mode): 213 milliseconds.
Compiled to Rust (release/optimized): 169 milliseconds.

The above Rust results, which are a little faster than as run by JavaScript on Node.js or DotNet version 7.0 IR, are perhaps a better comparison to Roc as there is an automatic array bounds check per array access, which is one of the main reasons it is slower than the C version other than it also reloads the base array pointer on every inner culling loop (doesn't "lift" this outside the loop as it should). Using Rust's unsafe mutable pointers is not an option from Fable currently as it doesn't support the interop package that would allow that.

from roc.

lukewilliamboswell commented on May 30, 2024

Have you ran this with roc build --optimize to ensure it is optimised using LLVM to run fast? On linux I think the dev backend is the current default which is very much not optimised generation.

I would suggest you share this analysis on zulip also, it looks really interesting. I always enjoy the resulting discussion on performance topics. 😃

from roc.

bhansconnect commented on May 30, 2024

I am gonna make a guess that this has nothing to do with bit twiddling. Probably and issue with closures or some sort of recursion doing work it shouldn't.

Also, might just be a failure of opportunistic mutation.

from roc.

bhansconnect commented on May 30, 2024

Can you pull the various loop functions into top level functions instead of nested functions and then test it? I am curious how that plays out.

from roc.

GordonBGood commented on May 30, 2024

@bhansconnect,

I am gonna make a guess that this has nothing to do with bit twiddling.

Yes, I agree that it likely doesn't have to do with bit twiddling itself, just that bit-twiddling takes so very little time that it really shows up the overheads.

Probably and issue with closures or some sort of recursion doing work it shouldn't.

Well, all the the work is done in the innermost tiny loop closure, and that is barely a closure, only capturing the update function from outside its scope. I tried passing that update function in as an argument so the cull function isn't a closure, but no improvement...

Also, might just be a failure of opportunistic mutation.

Well, that would be your issue then, as the way the benchmark is constructed should produce in place mutation. I don't think it can be copying the whole array on every mutation, although I suppose it could be copying the array pointer more often than it should (as in never)...

If I remove the update function, it is fast although the answer is incorrect (of course)...

Is there an easy way to view the assembly emitted by LLVM?

from roc.

GordonBGood commented on May 30, 2024

@bhansconnect,

Can you pull the various loop functions into top level functions instead of nested functions and then test it?

There isn't much chance that is the problem as it is pretty fast when I just remove the List.update from the inner "loop", but just to be sure I added the bp as an argument to really make it not a closure and moved it to outside the benchmark function with no change; I even added a type signature to no effect...

from roc.

bhansconnect commented on May 30, 2024

Adding this to the issue. I think I have figured out what is going on.

Ok I have been digging into the IR and here are the two main problems from what I can see:

List.get is a regular function, not a low level (and we don't have borrow inference). As such, before every call to the function we increment the refcount and in the function we decrement the refcount (checking if we need to deallocate each time).
The same root cause as #6213: By having allocas in locations other than the entry block stops llvm from being able to inline List.get. LLVM is smart enough to realize that inlining that function with allocas in the wrong place would lead to a stack space explosion. As such, it refuses to inline the function unless the the allocas are fixed.

So I did some testing to verify. One, I updated and used my borrowing-hacks branch to avoid the first issue. Then I manually edited the llvm ir to fix the second issue.
The result: Found 23000 primes to 262146 in 103 milliseconds.

Note: both changes are need for the perf.

No changes: 1.5s
Just borrowing-hacks: 1.2s
Just alloca fix: 1.0s
both: 100ms

from roc.

Low level "bit-twiddling" performance... about roc HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent