Coder Social home page Coder Social logo

Comments (20)

BlobCodes avatar BlobCodes commented on May 27, 2024 4

You can already do this using the BDWGC environment variable GC_DONT_GC

Example:

$ GC_DONT_GC=1 time crystal build src/app.cr
1.36user 0.95system 0:01.91elapsed 121%CPU (0avgtext+0avgdata 518428maxresident)k
0inputs+9384outputs (0major+170188minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/app.cr
2.18user 0.74system 0:02.52elapsed 115%CPU (0avgtext+0avgdata 399432maxresident)k
0inputs+9288outputs (0major+153022minor)pagefaults 0swaps

As you can see, in this test I get about ~30% faster compilation (2.52s -> 1.91s) with ~30% increased max memory (399MB -> 518MB).
I don't know how this scales for bigger projects.

from crystal.

z64 avatar z64 commented on May 27, 2024 4

@beta-ziliani Cool! Yea sometime last year we found that this helps majorly in building Kagi's web server, bringing some of our dev's times down from 5mins -> 2mins (2.5-1min in my personal case).

As a hint, I did some profiling with perf and reviewing in hotspot, and it seems a good chunk of CPU time is specifically on calls stacks bottoming out in GC.add_finalizer. Specifically, this appears to be for def finalize in src/regex/pcre2.cr. This doesn't account for all of the time save from disabling GC completely, but its a significant amount, saving me almost 30s. I will get the flamegraphs again, but I encourage you / others to look deeper.

from crystal.

beta-ziliani avatar beta-ziliani commented on May 27, 2024 3

I wonder if this shouldn't be the default, and have a --gc option instead

from crystal.

oprypin avatar oprypin commented on May 27, 2024 3

@jwoertink Hmm if the Crystal process exits, there should be no way that the memory use persists. If this is Linux (or if Mac is similar), make sure you're checking "available" memory rather than "free". Because the "free" memory metric in the free command is merely "wasted" memory that isn't used for any caches that can be dropped any time if necessary.

from crystal.

beta-ziliani avatar beta-ziliani commented on May 27, 2024 1

@beta-ziliani Without...? I have both GC_DO_GC and GC_DONT_GC there. Did I miss something 😅

Sorry! misread it. You're compiling in release mode, that explains those huge numbers! 😓

from crystal.

beta-ziliani avatar beta-ziliani commented on May 27, 2024

Brilliant idea! I was using GC_INITIAL_HEAP_SIZE to give space GC, and that alone reduced the compilation times ~20% on the compiler. But disabling it entirely never crossed my mind, and it does wonders.

With the compiler GC_DONT_GC=1 makes a huge difference: roughly from 80s to 52s on a cold run and from 50s to 16s on a hot run. And memory doesn't grow beyond 4GB.

from crystal.

MistressRemilia avatar MistressRemilia commented on May 27, 2024

I don't know how this scales for bigger projects.

When compiling Benben, I get these numbers on my laptop (Ryzen 3 3200U, I removed ~/.cache/crystal each time):

GC_DO_GC=1 time rake
195.47user 2.24system 3:15.96elapsed 100%CPU (0avgtext+0avgdata 1324580maxresident)k
0inputs+77256outputs (0major+563717minor)pagefaults 0swaps

GC_DONT_GC=1 time rake
199.82user 2.62system 3:20.38elapsed 101%CPU (0avgtext+0avgdata 1635376maxresident)k
0inputs+77288outputs (0major+640754minor)pagefaults 0swaps

from crystal.

beta-ziliani avatar beta-ziliani commented on May 27, 2024

@MistressRemilia what are the times without it?

from crystal.

oprypin avatar oprypin commented on May 27, 2024

@z64 That's an amazing finding 🤩
I might dig into eliminating finalize from pcre2 usages. Actually can someome add to those benchmark results what happens if you just delete def finalize from all regex source files?
Or perhaps someone could profile for the biggest regex usages.

from crystal.

z64 avatar z64 commented on May 27, 2024

Test:

  • Compiler build in debug mode for symbols
  • Building Kagi debug build (~70k SLOC)

Stock v1.11.0: (bottom-up flamegraph)

image

Attribution of 40% cycles from PCRE finalizer: (new:_source:_options points here)

image

With the following patch applied:

--- a/src/regex/pcre2.cr
+++ b/src/regex/pcre2.cr
@@ -254,12 +254,12 @@ module Regex::PCRE2
     end
   end

-  def finalize
-    @match_data.consume_each do |match_data|
-      LibPCRE2.match_data_free(match_data)
-    end
-    LibPCRE2.code_free @re
-  end
+  # def finalize
+  #   @match_data.consume_each do |match_data|
+  #     LibPCRE2.match_data_free(match_data)
+  #   end
+  #   LibPCRE2.code_free @re
+  # end

image

add_finalizer evaporates from the flamegraph

from crystal.

MistressRemilia avatar MistressRemilia commented on May 27, 2024

@beta-ziliani Without...? I have both GC_DO_GC and GC_DONT_GC there. Did I miss something 😅

from crystal.

BlobCodes avatar BlobCodes commented on May 27, 2024

@MistressRemilia You're executing your whole rakefile without GC, which can decrease performance dramatically.

It probably:

  • Executes the rake process
  • Executes crystal to compile the rakefile
  • Executes the resulting binary
  • Runs shards install
  • Runs the compiler again
  • ...

Try to only show the difference in the final crystal build command

from crystal.

syeopite avatar syeopite commented on May 27, 2024

Here's the result from Invidious with and without the cache:

Without cache:

$ GC_DONT_GC=1 time crystal build src/invidious.cr
25.07user 6.41system 0:24.19elapsed 130%CPU (0avgtext+0avgdata 2594384maxresident)k
0inputs+201128outputs (0major+748465minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/invidious.cr
23.71user 3.69system 0:21.76elapsed 125%CPU (0avgtext+0avgdata 1404500maxresident)k
0inputs+169472outputs (0major+811067minor)pagefaults 0swaps

With cache:

$ GC_DONT_GC=1 time crystal build src/invidious.cr
2.73user 1.98system 0:04.28elapsed 110%CPU (0avgtext+0avgdata 1544100maxresident)k
0inputs+36840outputs (0major+316309minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/invidious.cr
4.35user 2.09system 0:05.96elapsed 108%CPU (0avgtext+0avgdata 1235280maxresident)k
0inputs+36840outputs (0major+421919minor)pagefaults 0swaps

Compiled with an AMD Ryzen 9 7950X

from crystal.

MistressRemilia avatar MistressRemilia commented on May 27, 2024

@BlobCodes Crystal isn't compiling the Rakefile, that's a Ruby script. The way I have it written, it would first check that the shards.yml and shards.lock weren't updated (which they weren't, and all the shards had been installed), and then do a shards build to make the bin/benben binary.

Here are numbers calling the compiler directly on the same machine using the same command I have in my Rakefile for a release binary, they're pretty much the same numbers:

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
191.92user 2.38system 3:12.39elapsed 100%CPU (0avgtext+0avgdata 1635320maxresident)k
0inputs+61176outputs (0major+632918minor)pagefaults 0swaps

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
201.37user 2.12system 3:21.63elapsed 100%CPU (0avgtext+0avgdata 1324228maxresident)k
0inputs+61176outputs (0major+558645minor)pagefaults 0swaps

And without removing the cache:

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
193.09user 2.45system 3:15.98elapsed 99%CPU (0avgtext+0avgdata 1635816maxresident)k
0inputs+47712outputs (0major+607148minor)pagefaults 0swaps

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
189.86user 2.07system 3:12.29elapsed 99%CPU (0avgtext+0avgdata 1317728maxresident)k
0inputs+47712outputs (0major+528796minor)pagefaults 0swaps

Without the cache on my desktop (Core i9-10850K):

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
91.75user 1.22system 1:32.49elapsed 100%CPU (0avgtext+0avgdata 1324520maxresident)k
384inputs+61176outputs (0major+560001minor)pagefaults 0swaps

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
90.76user 1.48system 1:31.66elapsed 100%CPU (0avgtext+0avgdata 1635224maxresident)k
384inputs+61176outputs (0major+634256minor)pagefaults 0swaps

And with the cache:

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
90.32user 1.18system 1:31.65elapsed 99%CPU (0avgtext+0avgdata 1324388maxresident)k
384inputs+47712outputs (0major+534157minor)pagefaults 0swaps

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
89.63user 1.41system 1:31.15elapsed 99%CPU (0avgtext+0avgdata 1635600maxresident)k
384inputs+47712outputs (0major+608612minor)pagefaults 0swaps

from crystal.

BlobCodes avatar BlobCodes commented on May 27, 2024

@MistressRemilia

Note that LLVM is of course not garbage collected.
Only the crystal compiler itself will benefit from not garbage collecting.
To evaluate this feature request, it makes no sense to compile using --release (because then 90% of the time is spent in llvm; even if the crystal code would finish instantly, you'd only see small improvements in performance).

from crystal.

MistressRemilia avatar MistressRemilia commented on May 27, 2024

Even without --release I get at best a 1 second difference, cache or not.

from crystal.

oprypin avatar oprypin commented on May 27, 2024

I made a patch that avoids using finalizers for libpcre2 and instead wires it up to normal GC.
oprypin@058d581

Unfortunately I am not observing any run time difference between these when using a release compiler to build a non-release compiler. (So disregard the commit message, it was hopeful 😅)

from crystal.

oprypin avatar oprypin commented on May 27, 2024

Actually I am also not observing any difference if I just remove the def finalize from libpcre2 like #14223 (comment). Maybe I'm missing something and someone else will have better luck with benchmarking that patch.

from crystal.

z64 avatar z64 commented on May 27, 2024

Actually I am also not observing any difference if I just remove the def finalize from libpcre2 like #14223 (comment). Maybe I'm missing something and someone else will have better luck with benchmarking that patch.

Entirely possible something in our codebase has something the compiler is quite unhappy about WRT regex. I can't tell though from hotspot, can't follow the callers to any specific cause - will probably have to skim through IR. (Perhaps just too many literals, though I don't think we have that many really. We heavily discourage the use over regular string ops that are usually faster)

from crystal.

jwoertink avatar jwoertink commented on May 27, 2024

Adding this flag has been amazing for development for me. It has sped up the development compilation time enough that my Lucky app now compiles faster than the javascript build! (which takes over a minute itself...)

However, I've noticed that as I save files and Lucky re-compiles the app, over the day I start losing memory. For example, I'll start with 30gb free, save and recompile about 10 times, then I'll have 28gb free. By the end of the day I'm sitting at 2gb free. I guess this is to be expected if garbage collecting isn't happening on each rebuild, but I figured I would at least report my findings.

from crystal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.