Coder Social home page Coder Social logo

Reproducible builds about rules_r HOT 8 CLOSED

grailbio avatar grailbio commented on August 18, 2024
Reproducible builds

from rules_r.

Comments (8)

hchauvin avatar hchauvin commented on August 18, 2024

Hi, this has become a nuisance for me as well, so I dug in a bit.

Concerning the build timestamp (which also contains R version and operation system), it leads to an additional entry in the DESCRIPTION file, and it does not seem to be present anywhere else. I could not find reference to the operating system anywhere else either.

Concerning the source directory/library directory references, for the packages I surveyed at least, they could be found in the .so, .a, ... binary files. They are not stamps added by R, but come with the debug symbols that are added by R by default. It does not seem to be possible to remove those debug symbols with a flag (see https://stackoverflow.com/questions/9607155/make-gcc-put-relative-filenames-in-debug-information), so the best option IMO might be to invoke strip on all the binary files, after they are generated by R CMD INSTALL. If debug symbols are needed, then reproducibility can probably be put aside anyway, and we can have a '--define' Bazel option to disable stripping. strip is present with Xcode on Mac OSX and is part of the binutils package on Ubuntu/Debian, and installed by default. It should be invoked with '-S' instead of '-d' for Mac OSX/Linux compatibility.

I do not guarantee this will make the builds reproducible, but it should address the two issues you pointed out, without having to acquire a lock.

If this sounds good to you, I'll try to do a proof-of-concept with an additional reproducibility test (I hope it will pass!) sometime during the week.

from rules_r.

siddharthab avatar siddharthab commented on August 18, 2024

Hi Hadrien,

It's not just the compiler adding the debug symbols. When you take a checksum of all the files in the installed package, you will see that the checksum of some .rdb/.rdx files vary as well. I was able to load one of these files in R and see that it had references to the library directory. These checksums become identical when you keep the --library flag constant. The --built_timestamp flag is available to make the package completely reproducible but they assume that the destination library is constant.

If after this, we still want to strip the debug symbols, we can add a default Makevars file with the appropriate flags.

from rules_r.

hchauvin avatar hchauvin commented on August 18, 2024

Ok got it, I was wrong. Do you have this issue resolved internally?

I just looked at whether I could find any path in the output files (like, grep -R ...), and I could only find them in the debug symbols of the .so files, so I thought "problem solved!". I don't know how this info ends up in the .rdb/rdx files, but actually even if I remove the debug symbols in the .so files, there are still a few bits that differ in the ELF header, for whatever reason.

So, to have a reproducible build for things that ultimately go into a container layer, built-timestamp, R_MAKEVARS_USER and the package path (e.g., R CMD INSTALL ) must be constant.

from rules_r.

siddharthab avatar siddharthab commented on August 18, 2024

Resolved as much as possible in 5bb812b.

See full commit message for details and caveats.

from rules_r.

jayvdb avatar jayvdb commented on August 18, 2024

I've noticed in openSUSE RPMs , and it appears to also be Fedora RPMs, that the builds are not reproducible so these tricks here havent made their way into R or build systems. I havent checked Debian yet. I did notice that https://salsa.debian.org/reproducible-builds/diffoscope/commit/4d31312 is adding analysis of R packages, esp. the files which embed timestamps and paths.

Is there any ongoing effort to have R support reproducible builds?

from rules_r.

siddharthab avatar siddharthab commented on August 18, 2024

It is not clear with your message if you are building with bazel. This project is an extension to the bazel build system.

These rules should have reproducible builds, at least from R 3.4 onwards.

If you are building outside of bazel, use at least R 3.6, give the --built-timestamp flag when building. I have not tested it, but it will take you a longer distance. For packages with native code, you will also need to set some C flags.

from rules_r.

jayvdb avatar jayvdb commented on August 18, 2024

Hiya @siddharthab , I am referring to the general problem of R reproducible builds, which bazel appears to be trail-blazing.

--built-timestamp helps, but I couldn't find any inbuilt R install mechanism to avoid the varying paths in the .rdb/.rdx files. Ideally we find a way to get your solution here, merged into R core.

openSUSE/build-compare#34 does the opposite approach of what you have done here, which is ignoring those specific items which change in every build, so they dont replace the existing 'identical' build artifacts.

from rules_r.

siddharthab avatar siddharthab commented on August 18, 2024

I thought staged installs in R 3.6 solved the problem of hard-coded paths. But I suppose the stage directory itself is not constant. R will simply need to accept a user setting as the stage directory prefix to get complete reproducibility. I suppose it can be brought up in the r-devel mailing list.

from rules_r.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.