1BRC: One Billion Row Challenge

WIP Learning high performance Go by doing the One Billion Row Challenge

Find more details about:

the One Billion Row Challenge at https://1brc.dev/
the leaderboard and how to generate or run the solutions https://github.com/gunnarmorling/1brc

My system information (Dell Inspiron 5567 with Ubuntu 22.04.1 LTS):

$ lscpu | sed -n '1p;5p;8p;11p;20,23p'
Architecture:                       x86_64
CPU(s):                             4
Model name:                         Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz
Thread(s) per core:                 2
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           512 KiB (2 instances)
L3 cache:                           4 MiB (1 instance)

$ free -g -h -t | grep Total | cut -b 17-20
17Gi

Apple M1 Pro:

$ system_profiler SPHardwareDataType | sed -n '8,9p'
      Chip: Apple M1 Pro
      Total Number of Cores: 10 (8 performance and 2 efficiency)

$ uname -m
arm64

$ sysctl -a | grep cache | sed -n '32,35p'
hw.cachelinesize: 128
hw.l1icachesize: 131072
hw.l1dcachesize: 65536
hw.l2cachesize: 4194304

$ sysctl -n hw.memsize | numfmt --to=si
35G

Reference implementations on my machine:

Execution time for 1B row

$ time ./calculate_average_baseline.sh > /dev/null  
378,93s user 12,82s system 99% cpu 6:32,96 total

$ time ./calculate_average_thomaswue.sh > /dev/null  
0,23s user 0,17s system 0% cpu 1:02,58 total

$ time ./calculate_average_AlexanderYastrebov.sh > /dev/null  
167,48s user 15,02s system 307% cpu 59,309 total

Execution time for 1M row

$ time ./calculate_average_baseline.sh  
1,39s user 0,30s system 149% cpu 1,132 total

$ time ./calculate_average_thomaswue.sh > /dev/null  
1,16s user 0,22s system 200% cpu 0,689 total

$ time ./calculate_average_AlexanderYastrebov.sh  
0,24s user 0,10s system 171% cpu 0,194 total

Reference implementations on M1 Pro:

Execution time for 1B row

$ time ./calculate_average_baseline.sh > /dev/null  
206.19s user 6.63s system 100% cpu 3:32.62 total

$ time ./calculate_average_thomaswue.sh > /dev/null  
24.19s user 2.20s system 674% cpu 3.914 total

$ time ./calculate_average_AlexanderYastrebov.sh > /dev/null  
45.35s user 1.90s system 858% cpu 5.505 total

Execution time for 1M row

$ time ./calculate_average_baseline.sh  
0.57s user 0.08s system 99% cpu 0.653 total

$ time ./calculate_average_thomaswue.sh > /dev/null  
0.69s user 0.06s system 137% cpu 0.552 total

$ time ./calculate_average_AlexanderYastrebov.sh  
0.13s user 0.03s system 137% cpu 0.115 total

For reference:

1BRC Baseline: 392.96s (M1: 212.62s)
Thomas Wuerthinger: 62.58s (M1: 3.91s)
Alexander Yastrebov (Go): 59.30s (M1: 5.50s)

My solution steps:

Step	Description	Exec. time	Improvement	Baseline imp.	Commit
1	Naive approach	286s (M1: 167s)	-	-	6bc5f94
2	Parallel measurement processors	243s (M1: 164s)	1.177x (M1: 1.018x)	1.177x (M1: 1018.x)	b652f32
3	Batch read file lines	167s (M1: 62s)	1.455x (M1: 2.645x)	1.712x (M1: 2.693x)	66f92ce
4	Batch process lines	168s (M1: 68s)	0.994x (M1: 0.911x)	1.702x (M1: 2.455x)	a9a44aa
5	Use L3 size chunks	93s (M1: 16s)	1.806x (M1: 4.25x)	3.075x (M1: 10.437x)	2668364
6	Refactor to parallel read and process	100s (M1: 17s)	0.930x (M1: 0.941x)	2.860x (M1: 9.823x)	b0fac51
7	Optimize map allocation for processing	97s (M1: 16s)	1.031x (M1: 1.062x)	2.948x (M1: 10.437x)	6ba10e5
8	PGO and GC tuning	86s (M1: 15s)	1.128x (M1: 1.066x)	3.325x (M1: 11.133x)	14b9f35

Comments for the steps:

Naive approach: Sequential file read and processing using 1 CPU core.
Parallel measurement processors: Sequential file read with multiple parallel measurement processors. The processors are sharded and one stations measurement will always be processed by the same processor. The results are merged and printed at the end. No concurrency optimizations were made at this point.
Batch read file lines: After doing a trace analysis (using file with 1M lines) we could see that processMeasurements function takes ~48% of the total time and more than half of it's time it is waiting for chanrecv (also sharding is suboptimal, less processing is done as expected). On the other hand readFileLines takes ~22% of the total time but it does a lot of IO reads with waits between them. The next step was optimizing only(!) the file read to read lines in batches and ignore splitting and converting them to strings. After this readFileLines took 1% of the total time (~20% waiting and half of it in syscall) and processMeasurements took 77% (~20% waiting). processMeasurements changed mostly the hashing, and now a station could land on multiple worker so at the end we need to merge them. With these changes the CPU cores are more busy and readFileLines does more syscalls in bigger chunks. Benchmark before and after for readFileLines:

❯ benchstat stats.old.txt stats.txt | tail -n 11
                │ stats.old.txt │             stats.txt              │
                │    sec/op     │   sec/op     vs base               │
ReadFileLines-4    3960.8µ ± 9%   130.9µ ± 2%  -96.70% (p=0.002 n=6)

                │ stats.old.txt  │              stats.txt               │
                │      B/op      │     B/op       vs base               │
ReadFileLines-4   15636.2Ki ± 0%   1008.1Ki ± 0%  -93.55% (p=0.002 n=6)

                │ stats.old.txt │           stats.txt           │
                │   allocs/op   │ allocs/op   vs base           │
ReadFileLines-4      5.000 ± 0%   5.000 ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

Batch process lines: Read chunks in getMeasurements and distribute them to parallel workers to process the measurements (processMeasurements). The processMeasurements function split the lines and aggregates the chunks. The result is sent back to getMeasurements where the aggregated subresults are merged. Both processMeasurements (pm_stats) and getMeasurements (gm_stats) are improved, but the channel synchronizations are degrading the performance (what should be fixed next time)

❯ benchstat pm_stats.orig.txt pm_stats.txt | tail -n 11
                      │ pm_stats.orig.txt │            pm_stats.txt            │
                      │      sec/op       │   sec/op     vs base               │
ProcessMeasurements-4         3.905µ ± 1%   1.942µ ± 1%  -50.26% (p=0.002 n=6)

                      │ pm_stats.orig.txt │            pm_stats.txt             │
                      │       B/op        │    B/op      vs base                │
ProcessMeasurements-4          680.0 ± 0%   7064.0 ± 0%  +938.82% (p=0.002 n=6)

                      │ pm_stats.orig.txt │           pm_stats.txt            │
                      │     allocs/op     │ allocs/op   vs base               │
ProcessMeasurements-4          8.000 ± 0%   4.000 ± 0%  -50.00% (p=0.002 n=6)

❯ benchstat gm_stats.orig.txt gm_stats.txt | tail -n 11
                  │ gm_stats.orig.txt │            gm_stats.txt            │
                  │      sec/op       │   sec/op     vs base               │
GetMeasurements-4       126.02µ ± 23%   11.13µ ± 2%  -91.16% (p=0.002 n=6)

                  │ gm_stats.orig.txt │            gm_stats.txt             │
                  │       B/op        │     B/op      vs base               │
GetMeasurements-4       633.69Ki ± 0%   25.71Ki ± 0%  -95.94% (p=0.002 n=6)

                  │ gm_stats.orig.txt │           gm_stats.txt            │
                  │     allocs/op     │ allocs/op   vs base               │
GetMeasurements-4          17.00 ± 0%   21.00 ± 0%  +23.53% (p=0.002 n=6)

Use L3 size chunks
Refactor to parallel read and process: Refactor from concurrent heavy approach to parallel heavy approach. Read and process the file in equally distributed parts across all the CPU cores and merge the results at the end. Turned out this is a bit slower, however the goroutines has less synchronization, they spend ~15% more time in execution and uses far less system memory (visible in Top). This code looks easier to improve and could gain more performance with bigger number of cores.
Optimize map allocation for processing: After doing more profiling we can see that processMeasurements function does a lot of allocations (for the measurement object) and map lookups. By using FNV-1a hash as a key (int32) we could improve the execution time by ~10%, but it's not possible to use that for finding out the station names without adding extra complexity and we lose our gain. So the only improvement what I could see here is to properly pre-allocate the map and not have re-allocations during the execution. On a small dataset this only change gives as a 55% gain on the function level.

❯ benchstat pm_stats.orig.txt pm_stats.txt | tail -n 11
                      │ pm_stats.orig.txt │            pm_stats.txt            │
                      │      sec/op       │   sec/op     vs base               │
ProcessMeasurements-4         3.565µ ± 3%   1.574µ ± 4%  -55.85% (p=0.002 n=6)

                      │ pm_stats.orig.txt │            pm_stats.txt             │
                      │       B/op        │     B/op      vs base               │
ProcessMeasurements-4        7.008Ki ± 0%   1.109Ki ± 0%  -84.17% (p=0.002 n=6)

                      │ pm_stats.orig.txt │           pm_stats.txt            │
                      │     allocs/op     │ allocs/op   vs base               │
ProcessMeasurements-4          4.000 ± 0%   3.000 ± 0%  -25.00% (p=0.002 n=6)

I also optimized the getResults function, but it's not a big overall improvement because it runs only on the end of the process for a short time (I did it just for the fun :) ).

❯ benchstat gr_stats.orig.txt gr_stats.txt | tail -n 11
            │ gr_stats.orig.txt │            gr_stats.txt            │
            │      sec/op       │   sec/op     vs base               │
GetResult-4        3629.5n ± 2%   668.3n ± 1%  -81.59% (p=0.002 n=6)

            │ gr_stats.orig.txt │           gr_stats.txt           │
            │       B/op        │    B/op     vs base              │
GetResult-4          560.0 ± 0%   576.0 ± 0%  +2.86% (p=0.002 n=6)

            │ gr_stats.orig.txt │           gr_stats.txt            │
            │     allocs/op     │ allocs/op   vs base               │
GetResult-4         25.000 ± 0%   2.000 ± 0%  -92.00% (p=0.002 n=6)

Looks this part did a lot of allocations.

❯ benchstat main_stats.orig.txt main_stats.txt | tail -n 3
       │ main_stats.orig.txt │          main_stats.txt           │
       │      allocs/op      │ allocs/op   vs base               │
Main-4          53023.0 ± 0%   125.0 ± 2%  -99.76% (p=0.002 n=6)

PGO and GC tuning: This is not part of the official challenge because it contains non-code optimization. Profile Guided Optimization resulted in a 4% gain. GC was configured to run 100x times less (at around 6-8GB on my machine), what resulted ~50% system memory usage on my machine instead of ~1%. This resulted in another 8% gain. The gain on M1 was minimal, less then 0.5s with PGO and nearly 1s with GC tuning.

Out of curiosity I solved the same problem with GitHub Copilot. I tried to give the description from the 1brc challenge GitHub page with as small change as possible. The generated codes were working and they needed just tiny modifications regarding to the output formatting and mean calculation (I had this issue as well so I basically copied my version). They were ready to run within a couple of minutes.

The first generated code (as the most suggestions) was single threaded. It ran for 424 seconds (148s on M1).
For the second version I tried to pick a suggestion what tried to use some concurrency. It used around 2.5-3 CPU cores but it used up all the system memory and stopped responding after nearly 15 minutes (finnished running on M1 in 443s).

I also wrote the measurement file generator but my logic behind generating measurements was not that sophisticated like the original one. On my machine I could reduce the generation time to 1:46 min compared to 4:27 min used by create_measurements3.sh. It was interesting that the same code was running totally different on M1. On my machine the threads were busy and didn't had to wait much for scheduling goroutines (used ~360% cpu), but on M1 this was much worse (used ~510% cpu). Because of this on M1 my multithreaded version was slower (2:40) compared to the single threaded original version (2:09). (I leave here a TODO to investigate and improve this).

domahidizoltan / 1brc Goto Github PK

1brc's Introduction

1BRC: One Billion Row Challenge

1brc's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent