Comments (2)
Hi, thanks for opening this interesting issue!
To begin with, I'd like to mention that I voluntarily chose to distribute statically musl
linked binaries in this repository to maximize their portability and usefulness. glibc
does not properly support static linking, and newer toolchains tend to link to symbols introduced in recent glibc
releases, so systems running older glibc
releases will have a hard time getting binaries built on up-to-date environments to work.
Building glibc
-linked binaries using older toolchains is a workable solution to that problem, but eliminates any benefits newer glibc
releases introduce. AppImages are a nice solution in theory, but they rely on FUSE to work, a fact which is known to cause distribution problems, and if it's performance what we're after, we probably should take into account the CPU time penalty they impose due to their compressed FUSE filesystem. As far as I know, no solution for distributing binaries on Linux is free of downsides: they just make different tradeoffs.
On the other hand, while developing and profiling OptiVorbis I came up to the conclusion that it is a userspace compute intensive workload, with the hottest code being, by far and large, related to prefix code tree traversal. The code does few heap allocations, and the only system calls it does are either for allocating memory or file I/O. Rust also ships its own standard library that provides replacements for most of the standard C library functions, so I don't expect that swapping musl
with glibc
will have a significant performance effect for OptiVorbis. (In fact, I recently changed the memory allocator used in the WASM binaries at the demo website and noticed no performance difference.)
Of course, my performance expectations might be wrong, and I'd gladly take them back given appropriate benchmarks to the contrary. However, my overall feeling about the matter is that the performance difference is likely to not exist, so making things harder for people running slightly older distros is not justified.
Ultimately, I think that the maintainers of distro-specific package repositories are better positioned to build software in the way that integrates better with their ecosystem. For example, @Chocobo1 has kindly packaged OptiVorbis for Arch Linux, and the OptiVorbis binary in that package is linked against the distro glibc
version. (Thank you a lot @Chocobo1 for doing this by the way, I didn't notice until recently! ❤️)
from optivorbis.
I finally had the time and will to properly benchmark the relative performance of musl
vs. glibc
-linked OptiVorbis binaries, and came up with interesting conclusions.
To start with, I generated release builds, almost like the CI workflow does, but with symbol information, for both the musl
and glibc
x86_64 Linux targets. I used the latest nightly Rust build at the time of writing, rustc 1.73.0-nightly (08d00b40a 2023-08-09)
. I had to remove the panic = "unwind"
line from the bench
profile im Cargo.toml
because the unwinding panic strategy does not work well with the optimization switches used by the CI workflow:
cargo build --target x86_64-unknown-linux-{gnu,musl} --profile bench -Z build-std=core,std,alloc,proc_macro,panic_abort -Z build-std-features=panic_immediate_abort
Afterwards, I benchmarked both binaries with hyperfine --warmup 3 'target/x86_64-unknown-linux-gnu/release/optivorbis -r ogg2ogg /tmp/input.ogg /dev/null' 'target/x86_64-unknown-linux-musl/release/optivorbis -r ogg2ogg /tmp/input.ogg /dev/null'
, which to my surprise showed that the musl
binary was ~20% slower (the particular input file makes no difference):
This was totally unexpected to me, as readelf
quickly confirmed my previous statement that OptiVorbis makes few libc
calls: a few for file I/O, a few for math operations, a handful for threading... and quite a few for memory management (malloc
, memcpy
, memmove
, memcmp
, memset
, and more). Intrigued by the unexpected performance difference, I decided to find out what was causing it.
The conventional wisdom in the aforementioned posts is that musl
's heap allocator performance has traditionally been worse than glibc
's, as the well-performing, general-purpose allocator it uses is complex, which goes against the musl
's goal of simplicity. So I started with the obvious experiment of trying the mimalloc
and rpmalloc
allocators instead, and repeating the same hyperfine
benchmark.
Contrary to popular oversimplified views, and as I expected due to how little OptiVorbis uses malloc
, replacing the default musl
allocator with either mimalloc
or rpmalloc
made no difference. What's the matter with musl
then, if heap allocation performance is not to blame?
To gain some insight into the performance characteristics of linking against each libc
flavor, I recorded a perf
profile for a run of each release binary, using the same inputs (by the way, a recent CPU and the --call-graph lbr
switch are necessary to get accurate caller information in the generated profiles):
for libc in gnu musl; do
perf record --call-graph lbr -F max -o perf_$libc.data -- target/x86_64-unknown-linux-$libc/release/optivorbis -r ogg2ogg /tmp/input.ogg /dev/null
done
Visual side-by-side inspection of both profiles using perf report
immediately revealed that the musl
binary spent ~12% of its execution time in memcpy
calls, while the same function was very cold with glibc
, thus explaining more than 50% of the runtime difference:
glibc
's memcpy
is known for being optimized to the level of using hand-written assembly for common cases on common platforms, so it's understandable that musl
performs worse here, but OptiVorbis does not copy large buffers all the time: in fact, the BitpackWriter::write_unsigned_integer
function that allegedly calls memcpy
does no explicit buffer-to-buffer copying. OptiVorbis should not even try to call memcpy
to move at most 4 bytes around in the first place, as doing so even with glibc
's optimized memcpy
implementation still incurs on the overhead of a function call.
Thus, I went down one level of abstraction and used cargo asm --rust --profile bench -p vorbis_bitpack 'vorbis_bitpack::BitpackWriter<W>::write_unsigned_integer'
to inspect the assembly code responsible for the excessive memcpy
usage. It was immediately clear that I had rediscovered the cause of an old Rust PR: Rust's codegen always calls memcpy
when more than one byte is copied from and to buffered I/O objects. However, that PR introduced an efficient code path for reading and writing single bytes, so I decided to do exactly that in my code, at the cost less efficient handling of unbuffered I/O objects (which shouldn't be used, anyway).
With this simple change, the runtime difference between both libc
flavors was reduced to a much more palatable ~5%. The glibc
-linked binary got ~1% faster, too (although this measurement is well within error margin):
As a result of the now reduced performance difference between linking OptiVorbis against musl
or glibc
, and given the possibility of narrowing the gap even further, I believe that the main motivation for officially distributing glibc
binaries is moot, and significantly outweighed by the distribution benefits. Of course, I still recommend that package repository maintainers continue to link OptiVorbis against whatever glibc
version their distribution uses, as that will generate smaller executables with little headaches, but for upstream distribution, I'll keep using musl
and minimizing runtime differences.
Edit: I have continued benchmarking the performance gap between libc
implementations with other input files, and have noticed that in some cases the musl
binaries are now on par with glibc
, or even slightly faster. It's probably hard for the wider user community to find significant runtime differences now, but if anyone does, please consider posting a comment on this issue.
from optivorbis.
Related Issues (7)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from optivorbis.