Comments (20)
You can already do this using the BDWGC environment variable GC_DONT_GC
Example:
$ GC_DONT_GC=1 time crystal build src/app.cr
1.36user 0.95system 0:01.91elapsed 121%CPU (0avgtext+0avgdata 518428maxresident)k
0inputs+9384outputs (0major+170188minor)pagefaults 0swaps
$ GC_DO_GC=1 time crystal build src/app.cr
2.18user 0.74system 0:02.52elapsed 115%CPU (0avgtext+0avgdata 399432maxresident)k
0inputs+9288outputs (0major+153022minor)pagefaults 0swaps
As you can see, in this test I get about ~30% faster compilation (2.52s -> 1.91s) with ~30% increased max memory (399MB -> 518MB).
I don't know how this scales for bigger projects.
from crystal.
@beta-ziliani Cool! Yea sometime last year we found that this helps majorly in building Kagi's web server, bringing some of our dev's times down from 5mins -> 2mins (2.5-1min in my personal case).
As a hint, I did some profiling with perf
and reviewing in hotspot
, and it seems a good chunk of CPU time is specifically on calls stacks bottoming out in GC.add_finalizer
. Specifically, this appears to be for def finalize
in src/regex/pcre2.cr
. This doesn't account for all of the time save from disabling GC completely, but its a significant amount, saving me almost 30s. I will get the flamegraphs again, but I encourage you / others to look deeper.
from crystal.
I wonder if this shouldn't be the default, and have a --gc
option instead
from crystal.
@jwoertink Hmm if the Crystal process exits, there should be no way that the memory use persists. If this is Linux (or if Mac is similar), make sure you're checking "available" memory rather than "free". Because the "free" memory metric in the free
command is merely "wasted" memory that isn't used for any caches that can be dropped any time if necessary.
from crystal.
@beta-ziliani Without...? I have both GC_DO_GC and GC_DONT_GC there. Did I miss something 😅
Sorry! misread it. You're compiling in release mode, that explains those huge numbers! 😓
from crystal.
Brilliant idea! I was using GC_INITIAL_HEAP_SIZE
to give space GC, and that alone reduced the compilation times ~20% on the compiler. But disabling it entirely never crossed my mind, and it does wonders.
With the compiler GC_DONT_GC=1
makes a huge difference: roughly from 80s to 52s on a cold run and from 50s to 16s on a hot run. And memory doesn't grow beyond 4GB.
from crystal.
I don't know how this scales for bigger projects.
When compiling Benben, I get these numbers on my laptop (Ryzen 3 3200U, I removed ~/.cache/crystal each time):
GC_DO_GC=1 time rake
195.47user 2.24system 3:15.96elapsed 100%CPU (0avgtext+0avgdata 1324580maxresident)k
0inputs+77256outputs (0major+563717minor)pagefaults 0swaps
GC_DONT_GC=1 time rake
199.82user 2.62system 3:20.38elapsed 101%CPU (0avgtext+0avgdata 1635376maxresident)k
0inputs+77288outputs (0major+640754minor)pagefaults 0swaps
from crystal.
@MistressRemilia what are the times without it?
from crystal.
@z64 That's an amazing finding 🤩
I might dig into eliminating finalize
from pcre2 usages. Actually can someome add to those benchmark results what happens if you just delete def finalize
from all regex source files?
Or perhaps someone could profile for the biggest regex usages.
from crystal.
Test:
- Compiler build in debug mode for symbols
- Building Kagi debug build (~70k SLOC)
Stock v1.11.0
: (bottom-up flamegraph)
Attribution of 40% cycles from PCRE finalizer: (new:_source:_options
points here)
With the following patch applied:
--- a/src/regex/pcre2.cr
+++ b/src/regex/pcre2.cr
@@ -254,12 +254,12 @@ module Regex::PCRE2
end
end
- def finalize
- @match_data.consume_each do |match_data|
- LibPCRE2.match_data_free(match_data)
- end
- LibPCRE2.code_free @re
- end
+ # def finalize
+ # @match_data.consume_each do |match_data|
+ # LibPCRE2.match_data_free(match_data)
+ # end
+ # LibPCRE2.code_free @re
+ # end
add_finalizer
evaporates from the flamegraph
from crystal.
@beta-ziliani Without...? I have both GC_DO_GC and GC_DONT_GC there. Did I miss something 😅
from crystal.
@MistressRemilia You're executing your whole rakefile without GC, which can decrease performance dramatically.
It probably:
- Executes the rake process
- Executes crystal to compile the rakefile
- Executes the resulting binary
- Runs
shards install
- Runs the compiler again
- ...
Try to only show the difference in the final crystal build command
from crystal.
Here's the result from Invidious with and without the cache:
Without cache:
$ GC_DONT_GC=1 time crystal build src/invidious.cr
25.07user 6.41system 0:24.19elapsed 130%CPU (0avgtext+0avgdata 2594384maxresident)k
0inputs+201128outputs (0major+748465minor)pagefaults 0swaps
$ GC_DO_GC=1 time crystal build src/invidious.cr
23.71user 3.69system 0:21.76elapsed 125%CPU (0avgtext+0avgdata 1404500maxresident)k
0inputs+169472outputs (0major+811067minor)pagefaults 0swaps
With cache:
$ GC_DONT_GC=1 time crystal build src/invidious.cr
2.73user 1.98system 0:04.28elapsed 110%CPU (0avgtext+0avgdata 1544100maxresident)k
0inputs+36840outputs (0major+316309minor)pagefaults 0swaps
$ GC_DO_GC=1 time crystal build src/invidious.cr
4.35user 2.09system 0:05.96elapsed 108%CPU (0avgtext+0avgdata 1235280maxresident)k
0inputs+36840outputs (0major+421919minor)pagefaults 0swaps
Compiled with an AMD Ryzen 9 7950X
from crystal.
@BlobCodes Crystal isn't compiling the Rakefile, that's a Ruby script. The way I have it written, it would first check that the shards.yml
and shards.lock
weren't updated (which they weren't, and all the shards had been installed), and then do a shards build
to make the bin/benben
binary.
Here are numbers calling the compiler directly on the same machine using the same command I have in my Rakefile for a release binary, they're pretty much the same numbers:
GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
191.92user 2.38system 3:12.39elapsed 100%CPU (0avgtext+0avgdata 1635320maxresident)k
0inputs+61176outputs (0major+632918minor)pagefaults 0swaps
GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
201.37user 2.12system 3:21.63elapsed 100%CPU (0avgtext+0avgdata 1324228maxresident)k
0inputs+61176outputs (0major+558645minor)pagefaults 0swaps
And without removing the cache:
GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
193.09user 2.45system 3:15.98elapsed 99%CPU (0avgtext+0avgdata 1635816maxresident)k
0inputs+47712outputs (0major+607148minor)pagefaults 0swaps
GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
189.86user 2.07system 3:12.29elapsed 99%CPU (0avgtext+0avgdata 1317728maxresident)k
0inputs+47712outputs (0major+528796minor)pagefaults 0swaps
Without the cache on my desktop (Core i9-10850K):
GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
91.75user 1.22system 1:32.49elapsed 100%CPU (0avgtext+0avgdata 1324520maxresident)k
384inputs+61176outputs (0major+560001minor)pagefaults 0swaps
GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
90.76user 1.48system 1:31.66elapsed 100%CPU (0avgtext+0avgdata 1635224maxresident)k
384inputs+61176outputs (0major+634256minor)pagefaults 0swaps
And with the cache:
GC_DO_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
90.32user 1.18system 1:31.65elapsed 99%CPU (0avgtext+0avgdata 1324388maxresident)k
384inputs+47712outputs (0major+534157minor)pagefaults 0swaps
GC_DONT_GC=1 time crystal build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40 -Dremiaudio_wd40 -Dremisound_wd40 src/main.cr -o bin/benben
89.63user 1.41system 1:31.15elapsed 99%CPU (0avgtext+0avgdata 1635600maxresident)k
384inputs+47712outputs (0major+608612minor)pagefaults 0swaps
from crystal.
Note that LLVM is of course not garbage collected.
Only the crystal compiler itself will benefit from not garbage collecting.
To evaluate this feature request, it makes no sense to compile using --release
(because then 90% of the time is spent in llvm; even if the crystal code would finish instantly, you'd only see small improvements in performance).
from crystal.
Even without --release I get at best a 1 second difference, cache or not.
from crystal.
I made a patch that avoids using finalizers for libpcre2 and instead wires it up to normal GC.
oprypin@058d581
Unfortunately I am not observing any run time difference between these when using a release compiler to build a non-release compiler. (So disregard the commit message, it was hopeful 😅)
from crystal.
Actually I am also not observing any difference if I just remove the def finalize
from libpcre2 like #14223 (comment). Maybe I'm missing something and someone else will have better luck with benchmarking that patch.
from crystal.
Actually I am also not observing any difference if I just remove the
def finalize
from libpcre2 like #14223 (comment). Maybe I'm missing something and someone else will have better luck with benchmarking that patch.
Entirely possible something in our codebase has something the compiler is quite unhappy about WRT regex. I can't tell though from hotspot, can't follow the callers to any specific cause - will probably have to skim through IR. (Perhaps just too many literals, though I don't think we have that many really. We heavily discourage the use over regular string ops that are usually faster)
from crystal.
Adding this flag has been amazing for development for me. It has sped up the development compilation time enough that my Lucky app now compiles faster than the javascript build! (which takes over a minute itself...)
However, I've noticed that as I save files and Lucky re-compiles the app, over the day I start losing memory. For example, I'll start with 30gb free, save and recompile about 10 times, then I'll have 28gb free. By the end of the day I'm sitting at 2gb free. I guess this is to be expected if garbage collecting isn't happening on each rebuild, but I figured I would at least report my findings.
from crystal.
Related Issues (20)
- Range#size returns an Union instead of an Int32 HOT 1
- Inconsistent behaviour regarding line break requirement in method definition HOT 1
- `Module validation failed` with proc in recursive def HOT 3
- Crystal REPL instantly executes statement with intertwined heredoc HOT 1
- FormData part.body.read does not fill buffer HOT 3
- `Hash#rehash` removes `@first` elements from hash
- Add API for `Base64.encode` / `Base64.decode` with an `IO` as the source HOT 3
- Add methods for filling a buffer from an IO greedily/lazily HOT 5
- Enable `ameba` in this repo HOT 9
- formatter cause syntax error. HOT 2
- `Crypto::Subtle.constant_time_compare` does not work with `StaticArray` / Cannot overwrite `StaticArray` variable HOT 2
- Parser failure on argument list with trailing comma HOT 2
- RFC: Tracing HOT 5
- `Crystal::System::User#from_*?` et al. don't work if required buffer size greater than initial buffer size HOT 1
- If any system user entry exceeds `GETPW_R_SIZE_MAX`, retreiving any user is impossible HOT 2
- `IO#same_content?` returns `true` if `stream1` is a prefix of `stream2` HOT 1
- `docs_main.cr` is a mess HOT 1
- Interpreter Error: can't cast Pointer(Void) to (Pointer(Void) | String)
- Parser considers empty argument list with trailing comma as invalid HOT 5
- Sockets are inherited by subprocesses HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crystal.