Feature Request Crystal build times can be slow, and the garbage c

You can already do this using the BDWGC environment variable <code class="notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

I don't know how this scales for bigger projects. <p di

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Test: Compiler build in debug mode for symbols Buildi

Add `--no-gc` to `crystal build` about crystal HOT 20 OPEN

nobodywasishere commented on May 27, 2024 3

Add `--no-gc` to `crystal build`

from crystal.

Comments (20)

BlobCodes commented on May 27, 2024 4

You can already do this using the BDWGC environment variable GC_DONT_GC

Example:

$ GC_DONT_GC=1 time crystal build src/app.cr
1.36user 0.95system 0:01.91elapsed 121%CPU (0avgtext+0avgdata 518428maxresident)k
0inputs+9384outputs (0major+170188minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/app.cr
2.18user 0.74system 0:02.52elapsed 115%CPU (0avgtext+0avgdata 399432maxresident)k
0inputs+9288outputs (0major+153022minor)pagefaults 0swaps

As you can see, in this test I get about ~30% faster compilation (2.52s -> 1.91s) with ~30% increased max memory (399MB -> 518MB).
I don't know how this scales for bigger projects.

from crystal.

z64 commented on May 27, 2024 4

@beta-ziliani Cool! Yea sometime last year we found that this helps majorly in building Kagi's web server, bringing some of our dev's times down from 5mins -> 2mins (2.5-1min in my personal case).

As a hint, I did some profiling with perf and reviewing in hotspot, and it seems a good chunk of CPU time is specifically on calls stacks bottoming out in GC.add_finalizer. Specifically, this appears to be for def finalize in src/regex/pcre2.cr. This doesn't account for all of the time save from disabling GC completely, but its a significant amount, saving me almost 30s. I will get the flamegraphs again, but I encourage you / others to look deeper.

from crystal.

beta-ziliani commented on May 27, 2024 3

I wonder if this shouldn't be the default, and have a --gc option instead

from crystal.

oprypin commented on May 27, 2024 3

@jwoertink Hmm if the Crystal process exits, there should be no way that the memory use persists. If this is Linux (or if Mac is similar), make sure you're checking "available" memory rather than "free". Because the "free" memory metric in the free command is merely "wasted" memory that isn't used for any caches that can be dropped any time if necessary.

from crystal.

beta-ziliani commented on May 27, 2024 1

@beta-ziliani Without...? I have both GC_DO_GC and GC_DONT_GC there. Did I miss something 😅

Sorry! misread it. You're compiling in release mode, that explains those huge numbers! 😓

from crystal.

beta-ziliani commented on May 27, 2024

Brilliant idea! I was using GC_INITIAL_HEAP_SIZE to give space GC, and that alone reduced the compilation times ~20% on the compiler. But disabling it entirely never crossed my mind, and it does wonders.

With the compiler GC_DONT_GC=1 makes a huge difference: roughly from 80s to 52s on a cold run and from 50s to 16s on a hot run. And memory doesn't grow beyond 4GB.

from crystal.

MistressRemilia commented on May 27, 2024

I don't know how this scales for bigger projects.

When compiling Benben, I get these numbers on my laptop (Ryzen 3 3200U, I removed ~/.cache/crystal each time):

GC_DO_GC=1 time rake
195.47user 2.24system 3:15.96elapsed 100%CPU (0avgtext+0avgdata 1324580maxresident)k
0inputs+77256outputs (0major+563717minor)pagefaults 0swaps

GC_DONT_GC=1 time rake
199.82user 2.62system 3:20.38elapsed 101%CPU (0avgtext+0avgdata 1635376maxresident)k
0inputs+77288outputs (0major+640754minor)pagefaults 0swaps

from crystal.

beta-ziliani commented on May 27, 2024

@MistressRemilia what are the times without it?

from crystal.

oprypin commented on May 27, 2024

@z64 That's an amazing finding 🤩
I might dig into eliminating finalize from pcre2 usages. Actually can someome add to those benchmark results what happens if you just delete def finalize from all regex source files?
Or perhaps someone could profile for the biggest regex usages.

from crystal.

z64 commented on May 27, 2024

Test:

Compiler build in debug mode for symbols
Building Kagi debug build (~70k SLOC)

Stock v1.11.0: (bottom-up flamegraph)

Attribution of 40% cycles from PCRE finalizer: (new:_source:_options points here)

With the following patch applied:

--- a/src/regex/pcre2.cr
+++ b/src/regex/pcre2.cr
@@ -254,12 +254,12 @@ module Regex::PCRE2
     end
   end

-  def finalize
-    @match_data.consume_each do |match_data|
-      LibPCRE2.match_data_free(match_data)
-    end
-    LibPCRE2.code_free @re
-  end
+  # def finalize
+  #   @match_data.consume_each do |match_data|
+  #     LibPCRE2.match_data_free(match_data)
+  #   end
+  #   LibPCRE2.code_free @re
+  # end

add_finalizer evaporates from the flamegraph

from crystal.

MistressRemilia commented on May 27, 2024

@beta-ziliani Without...? I have both GC_DO_GC and GC_DONT_GC there. Did I miss something 😅

from crystal.

BlobCodes commented on May 27, 2024

@MistressRemilia You're executing your whole rakefile without GC, which can decrease performance dramatically.

It probably:

Executes the rake process
Executes crystal to compile the rakefile
Executes the resulting binary
Runs shards install
Runs the compiler again
...

Try to only show the difference in the final crystal build command

from crystal.

syeopite commented on May 27, 2024

Here's the result from Invidious with and without the cache:

Without cache:

$ GC_DONT_GC=1 time crystal build src/invidious.cr
25.07user 6.41system 0:24.19elapsed 130%CPU (0avgtext+0avgdata 2594384maxresident)k
0inputs+201128outputs (0major+748465minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/invidious.cr
23.71user 3.69system 0:21.76elapsed 125%CPU (0avgtext+0avgdata 1404500maxresident)k
0inputs+169472outputs (0major+811067minor)pagefaults 0swaps

With cache:

$ GC_DONT_GC=1 time crystal build src/invidious.cr
2.73user 1.98system 0:04.28elapsed 110%CPU (0avgtext+0avgdata 1544100maxresident)k
0inputs+36840outputs (0major+316309minor)pagefaults 0swaps

$ GC_DO_GC=1 time crystal build src/invidious.cr
4.35user 2.09system 0:05.96elapsed 108%CPU (0avgtext+0avgdata 1235280maxresident)k
0inputs+36840outputs (0major+421919minor)pagefaults 0swaps

Compiled with an AMD Ryzen 9 7950X

from crystal.

MistressRemilia commented on May 27, 2024

@BlobCodes Crystal isn't compiling the Rakefile, that's a Ruby script. The way I have it written, it would first check that the shards.yml and shards.lock weren't updated (which they weren't, and all the shards had been installed), and then do a shards build to make the bin/benben binary.

Here are numbers calling the compiler directly on the same machine using the same command I have in my Rakefile for a release binary, they're pretty much the same numbers:

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
191.92user 2.38system 3:12.39elapsed 100%CPU (0avgtext+0avgdata 1635320maxresident)k
0inputs+61176outputs (0major+632918minor)pagefaults 0swaps

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
201.37user 2.12system 3:21.63elapsed 100%CPU (0avgtext+0avgdata 1324228maxresident)k
0inputs+61176outputs (0major+558645minor)pagefaults 0swaps

And without removing the cache:

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
193.09user 2.45system 3:15.98elapsed 99%CPU (0avgtext+0avgdata 1635816maxresident)k
0inputs+47712outputs (0major+607148minor)pagefaults 0swaps

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
189.86user 2.07system 3:12.29elapsed 99%CPU (0avgtext+0avgdata 1317728maxresident)k
0inputs+47712outputs (0major+528796minor)pagefaults 0swaps

Without the cache on my desktop (Core i9-10850K):

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
91.75user 1.22system 1:32.49elapsed 100%CPU (0avgtext+0avgdata 1324520maxresident)k
384inputs+61176outputs (0major+560001minor)pagefaults 0swaps

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
90.76user 1.48system 1:31.66elapsed 100%CPU (0avgtext+0avgdata 1635224maxresident)k
384inputs+61176outputs (0major+634256minor)pagefaults 0swaps

And with the cache:

GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
90.32user 1.18system 1:31.65elapsed 99%CPU (0avgtext+0avgdata 1324388maxresident)k
384inputs+47712outputs (0major+534157minor)pagefaults 0swaps

GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
89.63user 1.41system 1:31.15elapsed 99%CPU (0avgtext+0avgdata 1635600maxresident)k
384inputs+47712outputs (0major+608612minor)pagefaults 0swaps

from crystal.

BlobCodes commented on May 27, 2024

@MistressRemilia

Note that LLVM is of course not garbage collected.
Only the crystal compiler itself will benefit from not garbage collecting.
To evaluate this feature request, it makes no sense to compile using --release (because then 90% of the time is spent in llvm; even if the crystal code would finish instantly, you'd only see small improvements in performance).

from crystal.

MistressRemilia commented on May 27, 2024

Even without --release I get at best a 1 second difference, cache or not.

from crystal.

oprypin commented on May 27, 2024

I made a patch that avoids using finalizers for libpcre2 and instead wires it up to normal GC.
oprypin@058d581

Unfortunately I am not observing any run time difference between these when using a release compiler to build a non-release compiler. (So disregard the commit message, it was hopeful 😅)

from crystal.

oprypin commented on May 27, 2024

Actually I am also not observing any difference if I just remove the def finalize from libpcre2 like #14223 (comment). Maybe I'm missing something and someone else will have better luck with benchmarking that patch.

from crystal.

z64 commented on May 27, 2024

Actually I am also not observing any difference if I just remove the def finalize from libpcre2 like #14223 (comment). Maybe I'm missing something and someone else will have better luck with benchmarking that patch.

Entirely possible something in our codebase has something the compiler is quite unhappy about WRT regex. I can't tell though from hotspot, can't follow the callers to any specific cause - will probably have to skim through IR. (Perhaps just too many literals, though I don't think we have that many really. We heavily discourage the use over regular string ops that are usually faster)

from crystal.

jwoertink commented on May 27, 2024

Adding this flag has been amazing for development for me. It has sped up the development compilation time enough that my Lucky app now compiles faster than the javascript build! (which takes over a minute itself...)

However, I've noticed that as I save files and Lucky re-compiles the app, over the day I start losing memory. For example, I'll start with 30gb free, save and recompile about 10 times, then I'll have 28gb free. By the end of the day I'm sitting at 2gb free. I guess this is to be expected if garbage collecting isn't happening on each rebuild, but I figured I would at least report my findings.

from crystal.

Add `--no-gc` to `crystal build` about crystal HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent