Coder Social home page Coder Social logo

Comments (8)

GordonBGood avatar GordonBGood commented on May 30, 2024 1

@lukewilliamboswell,

Have you ran this with roc build --optimize to ensure it is optimised using LLVM to run fast?

Yes, the run times quoted in the OP were all using optimized builds and without optimization it is about four time slower than this.

I assumed that Roc must be using LLVM for the optimized builds although the documentation didn't list LLVM as a pre-requisite. Fedora 39 currently uses LLVM 17 as its distro standard so that is what is installed.

from roc.

GordonBGood avatar GordonBGood commented on May 30, 2024

Adding another comparison, F#/Fable, (an almost pure FP language other than it has optional mutability of all bindings, always mutable array contents, and uncontrolled IO side-effects) can transpile to Rust code from F# code, with the following code implementing the same benchmark as the OP:

let cLOOPS = 1000

let primesBench() =
  let cmpsts = Array.zeroCreate 16384
  let rec loopn n =
    if n < cLOOPS then
      let rec loopi i =
        let s0 = (i + i) * (i + 3) + 3
        if s0 < 131072 then
          if cmpsts.[i >>> 3] &&& (1uy <<< (i &&& 7)) <> 0uy then loopi (i + 1) else
          let bp = i + i + 3
          let slmt = s0 + 8 * bp
          let rec loops s =
            if s < slmt then
              let msk = 1uy <<< (s &&& 7)
              let rec loopc c =
                if c < 16384 then
                  cmpsts.[c] <- cmpsts.[c] ||| msk
                  loopc (c + bp)
              loopc (s >>> 3); loops (s + bp)
          loops s0; loopi (i + 1)
      loopi 0; loopn (n + 1)      
  loopn 0
  {0 .. 131071} 
    |> Seq.fold
      (fun a i -> a - ((int cmpsts.[i >>> 3] >>> (i &&& 7)) &&& 1))
      131073

[<EntryPoint>]
let main args = 
    printfn "PrimesBench in F#"
    let start = System.DateTime.Now.Ticks
    let answr = primesBench()
    let elpsd = (System.DateTime.Now.Ticks - start) / 10000L
    printfn "Found %d primes to 262146 for %d loops in %d milliseconds." answr cLOOPS elpsd
    0

On the same computer as in the OP, this has the following run times:

  1. Compiled to JavaScript (optimized): 197 milliseconds.
  2. Compiled to DotNet IR (Release mode): 213 milliseconds.
  3. Compiled to Rust (release/optimized): 169 milliseconds.

The above Rust results, which are a little faster than as run by JavaScript on Node.js or DotNet version 7.0 IR, are perhaps a better comparison to Roc as there is an automatic array bounds check per array access, which is one of the main reasons it is slower than the C version other than it also reloads the base array pointer on every inner culling loop (doesn't "lift" this outside the loop as it should). Using Rust's unsafe mutable pointers is not an option from Fable currently as it doesn't support the interop package that would allow that.

from roc.

lukewilliamboswell avatar lukewilliamboswell commented on May 30, 2024

Have you ran this with roc build --optimize to ensure it is optimised using LLVM to run fast? On linux I think the dev backend is the current default which is very much not optimised generation.

I would suggest you share this analysis on zulip also, it looks really interesting. I always enjoy the resulting discussion on performance topics. 😃

from roc.

bhansconnect avatar bhansconnect commented on May 30, 2024

I am gonna make a guess that this has nothing to do with bit twiddling. Probably and issue with closures or some sort of recursion doing work it shouldn't.

Also, might just be a failure of opportunistic mutation.

from roc.

bhansconnect avatar bhansconnect commented on May 30, 2024

Can you pull the various loop functions into top level functions instead of nested functions and then test it? I am curious how that plays out.

from roc.

GordonBGood avatar GordonBGood commented on May 30, 2024

@bhansconnect,

I am gonna make a guess that this has nothing to do with bit twiddling.

Yes, I agree that it likely doesn't have to do with bit twiddling itself, just that bit-twiddling takes so very little time that it really shows up the overheads.

Probably and issue with closures or some sort of recursion doing work it shouldn't.

Well, all the the work is done in the innermost tiny loop closure, and that is barely a closure, only capturing the update function from outside its scope. I tried passing that update function in as an argument so the cull function isn't a closure, but no improvement...

Also, might just be a failure of opportunistic mutation.

Well, that would be your issue then, as the way the benchmark is constructed should produce in place mutation. I don't think it can be copying the whole array on every mutation, although I suppose it could be copying the array pointer more often than it should (as in never)...

If I remove the update function, it is fast although the answer is incorrect (of course)...

Is there an easy way to view the assembly emitted by LLVM?

from roc.

GordonBGood avatar GordonBGood commented on May 30, 2024

@bhansconnect,

Can you pull the various loop functions into top level functions instead of nested functions and then test it?

There isn't much chance that is the problem as it is pretty fast when I just remove the List.update from the inner "loop", but just to be sure I added the bp as an argument to really make it not a closure and moved it to outside the benchmark function with no change; I even added a type signature to no effect...

from roc.

bhansconnect avatar bhansconnect commented on May 30, 2024

Adding this to the issue. I think I have figured out what is going on.


Ok I have been digging into the IR and here are the two main problems from what I can see:

  1. List.get is a regular function, not a low level (and we don't have borrow inference). As such, before every call to the function we increment the refcount and in the function we decrement the refcount (checking if we need to deallocate each time).
  2. The same root cause as #6213: By having allocas in locations other than the entry block stops llvm from being able to inline List.get. LLVM is smart enough to realize that inlining that function with allocas in the wrong place would lead to a stack space explosion. As such, it refuses to inline the function unless the the allocas are fixed.

So I did some testing to verify. One, I updated and used my borrowing-hacks branch to avoid the first issue. Then I manually edited the llvm ir to fix the second issue.
The result: Found 23000 primes to 262146 in 103 milliseconds.

Note: both changes are need for the perf.

No changes: 1.5s
Just borrowing-hacks: 1.2s
Just alloca fix: 1.0s
both: 100ms

from roc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.