Coder Social home page Coder Social logo

Comments (2)

AlexTMjugador avatar AlexTMjugador commented on July 3, 2024 1

Hi, thanks for opening this interesting issue!

To begin with, I'd like to mention that I voluntarily chose to distribute statically musl linked binaries in this repository to maximize their portability and usefulness. glibc does not properly support static linking, and newer toolchains tend to link to symbols introduced in recent glibc releases, so systems running older glibc releases will have a hard time getting binaries built on up-to-date environments to work.

Building glibc-linked binaries using older toolchains is a workable solution to that problem, but eliminates any benefits newer glibc releases introduce. AppImages are a nice solution in theory, but they rely on FUSE to work, a fact which is known to cause distribution problems, and if it's performance what we're after, we probably should take into account the CPU time penalty they impose due to their compressed FUSE filesystem. As far as I know, no solution for distributing binaries on Linux is free of downsides: they just make different tradeoffs.

On the other hand, while developing and profiling OptiVorbis I came up to the conclusion that it is a userspace compute intensive workload, with the hottest code being, by far and large, related to prefix code tree traversal. The code does few heap allocations, and the only system calls it does are either for allocating memory or file I/O. Rust also ships its own standard library that provides replacements for most of the standard C library functions, so I don't expect that swapping musl with glibc will have a significant performance effect for OptiVorbis. (In fact, I recently changed the memory allocator used in the WASM binaries at the demo website and noticed no performance difference.)

Of course, my performance expectations might be wrong, and I'd gladly take them back given appropriate benchmarks to the contrary. However, my overall feeling about the matter is that the performance difference is likely to not exist, so making things harder for people running slightly older distros is not justified.

Ultimately, I think that the maintainers of distro-specific package repositories are better positioned to build software in the way that integrates better with their ecosystem. For example, @Chocobo1 has kindly packaged OptiVorbis for Arch Linux, and the OptiVorbis binary in that package is linked against the distro glibc version. (Thank you a lot @Chocobo1 for doing this by the way, I didn't notice until recently! ❤️)

from optivorbis.

AlexTMjugador avatar AlexTMjugador commented on July 3, 2024 1

I finally had the time and will to properly benchmark the relative performance of musl vs. glibc-linked OptiVorbis binaries, and came up with interesting conclusions.

To start with, I generated release builds, almost like the CI workflow does, but with symbol information, for both the musl and glibc x86_64 Linux targets. I used the latest nightly Rust build at the time of writing, rustc 1.73.0-nightly (08d00b40a 2023-08-09). I had to remove the panic = "unwind" line from the bench profile im Cargo.toml because the unwinding panic strategy does not work well with the optimization switches used by the CI workflow:

cargo build --target x86_64-unknown-linux-{gnu,musl} --profile bench -Z build-std=core,std,alloc,proc_macro,panic_abort -Z build-std-features=panic_immediate_abort

Afterwards, I benchmarked both binaries with hyperfine --warmup 3 'target/x86_64-unknown-linux-gnu/release/optivorbis -r ogg2ogg /tmp/input.ogg /dev/null' 'target/x86_64-unknown-linux-musl/release/optivorbis -r ogg2ogg /tmp/input.ogg /dev/null', which to my surprise showed that the musl binary was ~20% slower (the particular input file makes no difference):

Initial Hyperfine benchmark results

This was totally unexpected to me, as readelf quickly confirmed my previous statement that OptiVorbis makes few libc calls: a few for file I/O, a few for math operations, a handful for threading... and quite a few for memory management (malloc, memcpy, memmove, memcmp, memset, and more). Intrigued by the unexpected performance difference, I decided to find out what was causing it.

The conventional wisdom in the aforementioned posts is that musl's heap allocator performance has traditionally been worse than glibc's, as the well-performing, general-purpose allocator it uses is complex, which goes against the musl's goal of simplicity. So I started with the obvious experiment of trying the mimalloc and rpmalloc allocators instead, and repeating the same hyperfine benchmark.

Contrary to popular oversimplified views, and as I expected due to how little OptiVorbis uses malloc, replacing the default musl allocator with either mimalloc or rpmalloc made no difference. What's the matter with musl then, if heap allocation performance is not to blame?

To gain some insight into the performance characteristics of linking against each libc flavor, I recorded a perf profile for a run of each release binary, using the same inputs (by the way, a recent CPU and the --call-graph lbr switch are necessary to get accurate caller information in the generated profiles):

for libc in gnu musl; do
perf record --call-graph lbr -F max -o perf_$libc.data -- target/x86_64-unknown-linux-$libc/release/optivorbis -r ogg2ogg /tmp/input.ogg /dev/null
done

Visual side-by-side inspection of both profiles using perf report immediately revealed that the musl binary spent ~12% of its execution time in memcpy calls, while the same function was very cold with glibc, thus explaining more than 50% of the runtime difference:

memcpy being a hot function in musl binary

glibc's memcpy is known for being optimized to the level of using hand-written assembly for common cases on common platforms, so it's understandable that musl performs worse here, but OptiVorbis does not copy large buffers all the time: in fact, the BitpackWriter::write_unsigned_integer function that allegedly calls memcpy does no explicit buffer-to-buffer copying. OptiVorbis should not even try to call memcpy to move at most 4 bytes around in the first place, as doing so even with glibc's optimized memcpy implementation still incurs on the overhead of a function call.

Thus, I went down one level of abstraction and used cargo asm --rust --profile bench -p vorbis_bitpack 'vorbis_bitpack::BitpackWriter<W>::write_unsigned_integer' to inspect the assembly code responsible for the excessive memcpy usage. It was immediately clear that I had rediscovered the cause of an old Rust PR: Rust's codegen always calls memcpy when more than one byte is copied from and to buffered I/O objects. However, that PR introduced an efficient code path for reading and writing single bytes, so I decided to do exactly that in my code, at the cost less efficient handling of unbuffered I/O objects (which shouldn't be used, anyway).

With this simple change, the runtime difference between both libc flavors was reduced to a much more palatable ~5%. The glibc-linked binary got ~1% faster, too (although this measurement is well within error margin):

Final Hyperfine benchmark results

As a result of the now reduced performance difference between linking OptiVorbis against musl or glibc, and given the possibility of narrowing the gap even further, I believe that the main motivation for officially distributing glibc binaries is moot, and significantly outweighed by the distribution benefits. Of course, I still recommend that package repository maintainers continue to link OptiVorbis against whatever glibc version their distribution uses, as that will generate smaller executables with little headaches, but for upstream distribution, I'll keep using musl and minimizing runtime differences.

Edit: I have continued benchmarking the performance gap between libc implementations with other input files, and have noticed that in some cases the musl binaries are now on par with glibc, or even slightly faster. It's probably hard for the wider user community to find significant runtime differences now, but if anyone does, please consider posting a comment on this issue.

from optivorbis.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.