Coder Social home page Coder Social logo

reproducible files about optivorbis HOT 10 CLOSED

fekir avatar fekir commented on July 3, 2024
reproducible files

from optivorbis.

Comments (10)

fekir avatar fekir commented on July 3, 2024 1

Thank you for the fast reply.

I'm interested in a reproducible output as part of a bigger project (https://reproducible-builds.org/), comparing the files and metadata is something I want to avoid.

Section 4 of that specification states that every Ogg stream is composed of logical bitstreams with an identifying serial number, and "this unique serial number is created randomly".

Yes, I have no idea how an ogg file is structured, but by doing a diff I can see that the differences are minimal. With diffoscope it is possible to see that only serial no. and checksum differs.

And I can also confirm that the option works as you've described.

doing so makes it possible to easily create valid, chained Ogg streams by catting them together

Does it mean that catting a file to itself is not supported?

Either way, since the input file and the output file are platonically the same audio file, wouldn't it make sense to reuse the same ids? Would be there any disadvantages?

from optivorbis.

fekir avatar fekir commented on July 3, 2024 1

Than you for the detailed explanation.

However, it can be argued that catting (i.e., chaining) Ogg files is a very niche use case. [...] and hardly anyone on the Internet mentions it as an advantage of the Ogg Vorbis format.

Yeah, I'm not interested in catting ogg files together, especially now, knowing that cat might create an invalid file depending on the input, and the user has no control/feedback.
I suppose there are better tools for concatenating audio files together.

The Ogg specification says that a bitstream serial number "does not have any connection to the content or encoder of the logical bitstream it represents". My interpretation of that sentence is that serial numbers should always be randomly generated, as otherwise there would be a "connection" between the content of the logical bitstream and its serial number.

Mhm, OK, I hoped it would have be a way to have a good compromise: non-random ids, and different ids for file that had different ids before the optimization.
Since optvorbis is optimizing an existent file, I also thought that changing ids might be out-of scope, as the two files should behave the same.

It seems to me you would prefer to creating random id by default, and non-random explicitly (the status quo).

Obviously I would prefer it to be the other way round XD

Would it at least be possible to support the environment variable SOURCE_DATE_EPOCH (see https://reproducible-builds.org/docs/source-date-epoch/)?

Yes, I can set the flag --remuxer_option randomize_stream_serials=false, or define a shell alias.

The main advantage of SOURCE_DATE_EPOCH is that it is a standardized environment variable; a single API for controlling multiple tools (compiler, archiving tools, ...).

Being an environment variable, it works "well" when I'm not invoking optivorbis directly; for example if it used internally by some other tool.

from optivorbis.

AlexTMjugador avatar AlexTMjugador commented on July 3, 2024 1

Just thinking out loud; I think with ffmpeg it is possibly to concatenate without transcoding/reencodign the audiostream, but maybe only in some situation (like equal sample rate)?

It may be possible, but I couldn't get FFmpeg to output an Ogg Vorbis file with two logical bitstreams, as cat would do. It looks like the lossless concatenation features offered by FFmpeg insist on outputting a single logical bitstream, which, unlike the catting approach, can only work well if two Ogg Vorbis files share exactly the same codec parameters (number of audio channels, sampling frequency, and codec setup data that OptiVorbis adapts to each file to achieve more efficient encoding).

I do not think I've understand that part

I was just thinking out loud how that idea would prevent repeating serials under most circumstances, don't worry about the specifics! I'm happy to read that the environment variable is an acceptable solution for you, and I'm also happy to not make things harder for any folks out there who might use Ogg stream chaining by default 😉

I'm reopening this issue to better track progress on the related reproducible builds standard.

from optivorbis.

AlexTMjugador avatar AlexTMjugador commented on July 3, 2024 1

Hey @fekir, I've just implemented support for the SOURCE_DATE_EPOCH environment variable! 🎉

Please feel free to look at the corresponding commit above and give feedback on the changes. There hasn't been a release with these changes yet, so you'll need to get OptiVorbis executables from CI (which is currently broken due to a regression in rustdoc that should be fixed soon) or by building from source.

I'll close this issue as soon as I receive positive feedback about the new feature or I decide to release it, whichever happens sooner.

from optivorbis.

fekir avatar fekir commented on July 3, 2024 1

Thank you for looking at it.

I did not test it (I do not have the possibility to build it right now), but the changes you made makes at least sense :)

from optivorbis.

AlexTMjugador avatar AlexTMjugador commented on July 3, 2024

Hi, thanks for reaching out! 😄

By default, OptiVorbis strives to adhere closely to the Ogg format specification. Section 4 of that specification states that every Ogg stream is composed of logical bitstreams with an identifying serial number, and "this unique serial number is created randomly". Therefore, it is expected that OptiVorbis will generate slightly different files for the same input by default, since the logical bitstreams they contain should have different serial numbers.

However, if reproducibility of the generated files is your uttermost concern, you can opt out of randomizing stream serials by using the already available randomize_stream_serials remuxer option, like this: optivorbis --remuxer_option randomize_stream_serials=false file1.ogg file2.ogg (see also the related first_stream_serial_offset option). After doing this, the generated files for the same input and options should always be byte-identical (i.e., have the same hashes).

Please note that randomly generating stream serials is not just a pedantic requirement of the Ogg specification: doing so makes it possible to easily create valid, chained Ogg streams by catting them together, which would not be possible if serial numbers were shared across files. (Source.)

If you want to verify that OptiVorbis has indeed generated equivalent audio files for the same inputs without giving up on random serial number generation, I suggest comparing the decoded audio samples and other file metadata instead. You can do this by comparing the output of the ogginfo and oggdec command-line tools for the generated files.

I'm closing this as I believe the feature you requested is already available and this comment provided the necessary context to use it properly, but please feel free to get in touch if you have anything else to add.

from optivorbis.

AlexTMjugador avatar AlexTMjugador commented on July 3, 2024

I'm glad to hear that the option worked well for you!

Does it mean that catting a file to itself is not supported?

Yes, that's correct. Chaining an Ogg file to itself with cat will not work, because it would create an invalid physical Ogg bitstream, with at least two logical bitstreams having the same serial number. Decoders strictly require that the serial numbers within a file are different: otherwise they can't properly parse and seek each bitstream.

Either way, since the input file and the output file are platonically the same audio file, wouldn't it make sense to reuse the same ids? Would be there any disadvantages?

The Ogg specification says that a bitstream serial number "does not have any connection to the content or encoder of the logical bitstream it represents". My interpretation of that sentence is that serial numbers should always be randomly generated, as otherwise there would be a "connection" between the content of the logical bitstream and its serial number.

On the practical side, the default behavior of generating random serial numbers allows catting a file to an equivalent version of itself, and also enables catting of files that share stream serial numbers, where such numbers were only unique within their own files. Passing through serial numbers would not guarantee these properties, which I think are nice to have by default.

However, it can be argued that catting (i.e., chaining) Ogg files is a very niche use case. Frankly, I didn't know about it until I did the necessary Ogg format research to develop OptiVorbis, and hardly anyone on the Internet mentions it as an advantage of the Ogg Vorbis format. The fact that two different files share bitstream serials does not matter for decoding: after all, decoders have to work with each file independently, the possibility of serial number collisions between files is not zero, and that serial numbers are generated "randomly" is hard to test for. So, if your users will never chain Ogg files, not randomizing serial generation can be the right call. Besides the chaining use case, I'm not aware of any other drawbacks to not randomizing serials.

Given these considerations, I don't think I will introduce functionality to copy stream serial numbers in OptiVorbis, but you are welcome to disable their randomization if that fits your use case better. If you are inclined to copy stream serials anyway, you might want to fetch the original serial with ogginfo, and then use the randomize_stream_serials and first_stream_serial_offset options to tell OptiVorbis to use that serial for the bitstream it will generate. Alternatively, you can use the rogg_serial tool, which is part of the rogg suite of tools mentioned in the OptiVorbis README.

from optivorbis.

AlexTMjugador avatar AlexTMjugador commented on July 3, 2024

[...] I suppose there are better tools for concatenating audio files together.

The unusual beauty of chaining is that, unlike joining audio files together with tools such as Audacity, GStreamer or ffmpeg, no audio data is re-encoded, which is lossless and much faster. But this feature is little known, poses problems with some popular decoders, and has the pitfalls we have identified, so it's barely used in practice. A proper tool for Ogg Vorbis file chaining would ensure that serials are unique and then cat the files together, but I'm not aware of its existence. The necessary building blocks are already publicly available, though.

Would it at least be possible to support the environment variable SOURCE_DATE_EPOCH (see https://reproducible-builds.org/docs/source-date-epoch/)?

I think this is a great idea! The way I see it, it could work by fixing an RNG algorithm and seed, and then perturbing the resulting predictable RNG sequence with a checksum of the input data. This way, different files in different OptiVorbis runs would still have random but reproducible and different serial numbers, while the same files optimized in the same OptiVorbis run (which is currently only possible via the Rust API, not the CLI) would still get different serials. This could accommodate all of the use cases I'm thinking of quite nicely, what do you think? 😄

from optivorbis.

fekir avatar fekir commented on July 3, 2024

[...] tools such as Audacity, GStreamer or ffmpeg, no audio data is re-encoded

Just thinking out loud; I think with ffmpeg it is possibly to concatenate without transcoding/reencodign the audiostream, but maybe only in some situation (like equal sample rate)?

what do you think

Well, I would still prefer not have to defined the variable (it was a surprise for me that different runs on the same input generate different outputs, just lucky me that I decided to check), but it sounds good.

the same files optimized in the same OptiVorbis run (which is currently only possible via the Rust API, not the CLI) would still get different serials

I do not think I've understand that part, unless you meant

the same files optimized in the same OptiVorbis run (which is currently only possible via the Rust API, not the CLI) would still get same serials

or

the different files optimized in the same OptiVorbis run (which is currently only possible via the Rust API, not the CLI) would still get different serials

from optivorbis.

AlexTMjugador avatar AlexTMjugador commented on July 3, 2024

After some other QoL changes, OptiVorbis v0.2.0 has been released with this feature! 🎉

Sorry it took me so long to make a release. I hope it was at least worth the wait 🙂

from optivorbis.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.