nuest / ten-simple-rules-dockerfiles Goto Github PK

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science

Home Page: https://doi.org/10.1371/journal.pcbi.1008316

License: Creative Commons Attribution 4.0 International

TeX 99.18% Dockerfile 0.63% Makefile 0.19%

dockerfiles ten-simple-rules reproducible-research reproducible-science reproducible-paper open-science containerization containerisation

ten-simple-rules-dockerfiles's Introduction

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science

The manuscript is published as a preprint: https://osf.io/fsd7t

We welcome your feedback, e.g., by opening issues on this repository or with OSF annotations. We especially welcome your help by creating strong illustrating examples, see issue #4.

Ten Simple Rules Collection on PLOS

Current draft as PDF

Author contributions

DN conceived the idea and contributed to conceptualisation, methodology, and writing - original draft, review & editing, and validation. VS contributed to conceptualisation, methodology, and writing - original draft, and review & editing. BM contributed to writing – review & editing. SJE contributed to conceptualisation, writing – review & editing, and validation. THe contributed to conceptualisation. THi contributed to writing – review & editing. BDE contributed to conceptualisation, writing – review & editing, visualisation, and validation. This articles was written collaboratively on GitHub, where contributions in form of text or discussions comments are documented: https://github.com/nuest/ten-simple-rules-dockerfiles/.

Run container for editing the document

First, build the container. It will install the dependencies that you need for compiling the LaTex.

docker build -t ten-simple-rules-dockerfiles .

Then run it! You'll need to set a password to login with user "rstudio."

PASSWORD=simple
docker run --rm -it -p 8787:8787 -e PASSWORD=$PASSWORD -v $(pwd):/home/rstudio/ten-simple-rules-dockerfiles ten-simple-rules-dockerfiles

Open http://localhost:8787 to get to RStudio, log in, and navigate to the directory ~/ten-simple-rules-dockerfiles to open the Rmd file and start editing. Use the "Knit" button to render the PDF. The first rendering takes a bit longer, because required LaTeX packages must be installed.

See more options in the Rocker docs.

Run container for building the PDF

See the end of the Dockerfile for instructions.

Useful snippets

Get all author's GitHub handles:

cat *.Rmd | grep ' # https://github.com/' | sed 's|    # https://github.com/|@|'

Get all author's emails:

cat *.Rmd | grep 'email:' | sed 's|    email: ||'

[Work in progress!] Get a .docx file out of the Rmd so one can compare versions and generate marked-up copies of changes:

# https://github.com/davidgohel/officedown
library("officedown")
rmarkdown::render("ten-simple-rules-dockerfiles.Rmd", output_format = officedown::rdocx_document(), output_file = "tsrd.docx")

# https://noamross.github.io/redoc/articles/mixed-workflows-with-redoc.html
library("redoc")
rmarkdown::render("ten-simple-rules-dockerfiles.Rmd", output_format = redoc::redoc(), output_file = "tsrd.docx")

Compare with latexdiff

# get a specific version of the text file
wget -O submission.v2.tex https://raw.githubusercontent.com/nuest/ten-simple-rules-dockerfiles/submission.v2/ten-simple-rules-dockerfiles.tex
# compare it with current version
latexdiff --graphics-markup=2 submission.v2.tex ten-simple-rules-dockerfiles.tex > diff.tex
# render diff.tex with RStudio

License

This manuscript is published under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, see file LICENSE.md.

ten-simple-rules-dockerfiles's People

Stargazers

Watchers

Forkers

psychemedia lgatto benmarwick markuskonk youqam bdevans crbrennecka davan690 singha53-zz jimsforks brendan-bertone shadoweyes75 ivanasbg ingewiekenkamp ddipayan00

ten-simple-rules-dockerfiles's Issues

Paper Build not Reproducible

Heads up @nuest your container instructions for knitting the RMarkdown aren't reproducible, I get :

tlmgr: package repository http://mirror.utexas.edu/ctan/systems/texlive/tlnet (not verified: gpg unavailable)
[1/1, ??:??/??:??] install: xstring [11k]
running mktexlsr ...
chmod(420,/opt/TinyTeX/tlpkg/texlive.tlpdb) failed: Operation not permitted at /opt/TinyTeX/tlpkg/TeXLive/TLUtils.pm line 1161.
done running mktexlsr.
tlmgr: package log updated: /opt/TinyTeX/texmf-var/web2c/tlmgr.log
tlmgr path add
! Missing number, treated as zero.
<to be read again> 
                   \protect 
l.976 ...nter}\rule{0.5\linewidth}{\linethickness}
                                                  \end{center} 

Error: Failed to compile ten-simple-rules-dockerfiles.tex. See https://yihui.name/tinytex/r/#debugging for debugging tips. See ten-simple-rules-dockerfiles.log for more info.
Execution halted

I think it's a bit risky to require installation of dependencies to the container at runtime - this is error prone (as already not working) and hugely not likely to work in the future. I'll see if I can put together a proper Dockerfile for this repo, likely tomorrow because I'm doing server work today.

Scope

Current statements on scope and audience:

The official images and images by established organisations should be your preferred source of snippets if you craft your own Dockerfile. Automated builds can be164complex to set up, and details are out of scope of this article.
Research Software Engineers (RSEs) are not the target audience for this work, but ...
The goal of this article is to guide you to write a Dockerfile so that it best facilitates interactive development and computer-based research, as well as the higher goals of reproducibility and preservation of knowledge.

Original notes

not a single word about docker run ? > unrealistic and incomplete
if authors already have a package structure, e.g. R or PyPI, are they out of scope?
examples from which languages are helpful/relevant?
- R - yes
- Python - yes
- (Daniel) IMO the above cover 80% of the addressed researchers
- Julia?
- Octave?
- Java?
- Perl?
- Scala?
- LibreOffice calc in headless mode - in scope??

ContainDS

https://containds.com/

https://twitter.com/danlester/status/1216686349832589312

Reuse and reproducibility in other build environments

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 70 in 4a87e3e

    
           To build Docker containers, we write text files that follow a particular format called `Dockerfile`s [@docker_inc_dockerfile_2019].

Docker / Dockerfiles provide a formalised, text based recipe for building a Docker image. One of the features of other build/provisioning systems like Puppet or Ansible (vagrant to a lesser extent) are the community contributed packages/modules for performing particular tasks, cf. package maintainers in R or Python. I'm not sure to what extent those communities have guidelines for producing community packages, or how it is policed?

With docker, there is the Docker Hub, where containers are shared for the running thereof, sometimes with Dockerfile, sometimes not. The ability to share parts of Dockerfiles that install a particular package, or recipes for performing particular tasks, are perhaps not quite so well supported?

"you should consider switching tools"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 353 in 4a87e3e

    
           A workflow that does not support headless execution is arguably not ideal for a container, and you should consider switching tools.

Switch tools to what? And why?

Containers are lightweight VMs and can be used to share desktops via a browser using things like novnc; XPRA, RDP etc also provide other ways of accessing the GUI element, which may be the primary, indeed the only, way of using whatever's inside the container?

"A tag like `latest` "

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 125 in 4a87e3e

    
           A tag like `latest` is good in that security fixes and updates are likely present, however it is bad in that it's a moving target so that it is more likely that an updated base could break your workflow.

It's worth maybe differentiating between "semantic" tags and automatic / machine generated hashes. When building images locally, if you keep (re)building an image with the same name/tag, then I think each image is retained and associated with it's own unique hash, whilst the most recently built image is the one that gets the tag.

"`x11` support, `x11docker` is recommended"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 354 in 4a87e3e

    
           Using a browser to expose a user interface (e.g, RStudio, Jupyter, web tools) on a particular port is a commonly done practice, and for systems without native `x11` support, `x11docker` is recommended [REF].

Recommended by whom? I quite like RDP because there are cross-platform clients and things like audio even work sometimes and there are examples out there of how to drop this sort of thing into a container.

I also note there are bridge containers that can bridge from X11 to an XPRA/HTML UI.

On the topic of desktop UIs, I note things like https://github.com/yuvipanda/jupyter-desktop-server which give you a way of dropping a desktop UI, accessed via the browser, if you need it, proxied by a Jupyter server.

Author summary and/or abstract

Some of the Ten Simple Rules articles (full list: https://collections.plos.org/ten-simple-rules) do not have an abstract, only a few I randomly checked have the "author summary" available in the template.

Any strong opinions about either? I suggest to keep the abstract, if only for better indexing and article metadata?

Optimising container for use

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 103 in 4a87e3e

    
           The container should be built to be optimized for its use case, whether it is intended to be shared or used by a single user.

Should is a danger word... it begs the question: what it an optimised image? Are there rules for creating Dockerfiles that as a side-effect make it more, rather than less, likely that your image will be "optmized" ?

"configure an automated build"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 123 in 4a87e3e

    
           Alternatively copy relevant instructions into your own Dockerfile, acknowledge the source in a comment, and configure an automated build for your own image.

This reads a bit along the lines of "just configure an automated build", which is a whole load of other skills about which folk are not necessarily aware.

As repo2docker has already been mentioned, it may be worth finding, or writing, something elsewhere that could be referenced showing how repo2docker can be used as part of an automated build and push system in Github (example).

Disussion - Rule 9. Publish a Dockerfile per project in a code repository with version control

Typo: several proecceses

Use one Dockerfile per workflow or project and put one "thing" in; TO DISCUSS: argue against the above rule and recommend having a process manager and multiple processes in one container

I think original best practice was to have one thing per container and then use docker-compose to build eg workbenches from several inter-networked containers.

Recent examples show how to use things like supervisord to run and manage multiple services with the same container.

To try to get my head round this, I did some naive exploration here. One issue that arose was if you are enabling multiple services, how should you build the image? All the installs in a single Dockerfile? Or a set of staged Dockerfiles that build on top of each other (note - this is not a multistage build, where you essentially extract from prior images to build layers you then incorporate in your final image. I'm not sure what best practice would be?

Also, this raises the issue of Dockerfile stacks, eg as per Jupyter docker stacks or the legcay DIT4C containers, as a good practice strategy?

(In a research group, you may want a base image that every other project builds on; in edu, I've been trying to explore how a base institution container might contain a minimum viable, branded notebook server, for example, that could be used as a base for course customised environments/ images.)

This rule mentions connecting to databases; one issue there is that you may also want a recipe that builds a seeded database, not just a database. Building images that provide access to computational environment + data environment is often what is required for reproducibility.

On this point, where do things like notebooks sit? They are outside the computational environment, but inside the analysis environment, along with particular datasets (are datasets inside the analysis environment, or a sibling of the computational and analysis environments?). To make something reproducible, you need a computational environment and a data environment that the analysis scripts can run against?

Good Points

https://offby2.com/posts/001-docker-lesser-known-tips/

"maintained by the Docker library"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 115 in 4a87e3e

    
           It's good practice to use base images that are maintained by the Docker library.

Docker Hub is a free for all. There are official images, identified by a _ in place of the repository name, built from Dockerfiles written according to best practices, which also benefit from things like vulnerability scanning.

(Official Dockerfiles, like conda recipes, ca be great things to crib from when trying to build something similar yourself.)

There are also repositories that are maintained by organisations for their own "official" containers. Note that when it comes to using trusted sources, there are also opportunities for associated maliciousness (example: Two malicious Python libraries caught stealing SSH and GPG keys.)

(By the by, I notice that Github is now warning me of, and automatically updating vulnerabilities in my repos (maybe I opted in to something?!) , an example of the repository environment modifying my repos in a way that may be helpful in terms of security, at least, but that may actually break something in the thing I'm trying to build.)

"small data files and software that is essential for the function of a container"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 287 in 4a87e3e

    
           For example, small data files and software that is essential for the function of a container to distribute a reproducible analysis is essential to add.

Earlier it was mentioned how it can be useful to add instructions into the Dockerfile that document examples of how to actually build the container.

It can also make sense to include simple demos of how to use the environment that has been built. Such demos also often play the role of "human operated tests" — when you've built your image, does your demo file work correctly? (Another way of working out how to use whatever is in in a container is to look for any test files, and poke through them..)

Continuous build has been mentioned elsewhere. For this to be most valuable, the build system should also be able to run a test to check the build has built something that works as expected, not just that the build has run to completion without error.

Add note about audience for paper to introduction

After the rounds of edits are finished and the paper is in some ready state, I would like to give one more read through, and also take a shot at adding a section to the beginning that clearly defines the audience for the paper - likely focused on single application containers for data scientists.

Discussion: Rule 2. Use versioned and automatically built base images

One of the things I try to remember do is fork the original Dockerfiles or add reference link in my Dockerfile to the location of the Dockerfile for the the image I am pulling from.

This means I syand a chance of recreating something akin to any base container I pull on from my own copy of the Dockerfile. (Of course, if the Dockerfile uses "latest" package versions, then a build today may result in a container that is different from a build tomorrow.)

When pulling from a tagged/versioned image in a Docker repository, that image may have been built from a Dockerfile that:

no longer exists;
has since been updated;
did not pin package versions of installed packages.

So whilst my image may be reproducibly built from my Dockerfile as long as the image I'm building from exists, it may no longer be reproducible if that image disappears, or changes (eg changes because I didn't pin the exact image version ).

Depending on levels of trust, or when working in a closed / private environment, you may want to build your own, versioned image from someone else's Dockerfile (or one you have forked) and then pull on that.

I don't know if there is any emerging practice around archiving software projects that are based around Dockerised environments? I would imagine a complete archiving service might run it's own Dockerhub and push locally built or cloned images to it if that service were making runnable archived environments available?

"try to design the `RUN` statements"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 205 in 4a87e3e

    
           Generally, try to design the `RUN` statements so that each performs one scoped action (e.g., download, compile, and install one tool).

This also speaks to optimisation of an each; each statement generates a layer in the container, so to build efficient images you may want to have some quite convoluted RUN statements. Good layout, and convention in writing individual RUN lines, can help here.

Invite feedback

open a feedback issue, including the planned submission date

Ping people

@nuest
- Sam Abbott: #6 (comment)
- Noam Ross, potentially via "Make tends to be my highest-level abstraction - for things like building dependencies, docker images, etc. I've started using drake for analysis pipeline below that." (https://twitter.com/noamross/status/1205514611912531968?s=09)
- reply to https://twitter.com/nordholmen/status/1164446676427431936
- as Dav Clark (Gigantum), their tool focusses on UI, but still
- #containers in RSE Slack - "many of you must help researchers with containerisation - maybe this is of help for you or you can share your experiences"
- https://twitter.com/Stikov/status/1191847797177421825?s=09)
- ping authors of https://f1000research.com/articles/7-742/v2
- Fotis E. Psomopoulos, https://twitter.com/fopsom

Platforms docker containers are tested on

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 78 in 4a87e3e

    
           Importantly, the researcher should clearly state the platforms (by way of one or more Docker containers) have been tested on.

Tooling is starting to appear that makes it easier to build things for different platforms. For example, Using multi-arch Docker images to support apps on any architecture describes buildx, an experimental service developed by Docker to make it easier to "build, push, pull, and run images seamlessly on different compute architectures".

Unaddressed issues

Some issues that maybe aren't addressed, or not addressed in detail?

optimising builds: good practice around building efficient layers (layer size; efficiency when repeatedly building from the top); some cribs I need to chase on this: @simonw on building smaller Python Docker images, Docker blog on Intro Guide to Dockerfile Best Practices); there was also some discussion on this repo2docker issue on the need for speed;
linters / support tools; eg fromlatest.io; a more risky approach, reproducibility-wise, is to give your image to another tool and let it optimise it for you without changing the Dockerfile (eg docker-slim);
use of multi-stage builds (eg I had to use a multistage build here to try to get a legacy app with horrible build dependencies into a notebook container that could server-proxy it);
tools for analysing layers (eg dive).

Discussion - Rule 4. Pin versions

This relates to both any pulled from image, as well as packages installed by the current Dockerfile.

One issue I keep running in to is how best to maintain a Dockerfile that runs with latest package versions compared to one that has explicitly bound package versions.

The result is that my Dockerfile is generally unpinned, using latest, but that I then push particular tagged versions of specific image builds and call on those images when launching containers. The image is thus fixed, but there is probably no way I can easily and directly rebuild it from the Dockerfile. The Dockerfile build process is therefore used to update images that I may then call on explicitly.

Discussion - Rule 5. Mount data and control code

Typo: to reause code

Is there also a converse side to this: when should you use COPY / ADD? (Also, is it worth commenting on the difference between them?)

Does mounting also raise issues about how to mount local directories into a container, or persistent data volumes against a container? Discussion about whether linked data volumes introduce a "hidden state" problem in a running docker deployment environment?

if you have a "stable" published software library, install it from source from the source code repo or from the software repository (so that users find the project in the future)

How does the above relate to Rule 4 / installing known version packages. Should Rule 4 reference installing packages local to the Dockerfile, or from repositories that are not package manager repositories?

Reusing / extending a Dockerfile

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 86 in 4a87e3e

    
           You would first want to determine if there is an already existing container that you can use, and in this case, use it and add to your workflow documentation instructions for doing so.

This is more about direct reuse than using something to help you generate a new Dockerfile? To me, generation is more akin to something like https://github.com/RobInLabUJI/ROSLab, a tool for generating Dockerfiles from a form driven app.

"a tool like `docker-compose`"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 375 in 4a87e3e

    
           In the case of more complex web applications that require application and web servers, databases, and workers or messenging, the entire infrastructure can easily be brought up or down with a tool like `docker-compose` [REF].

Docker-compose also speaks to architectural design. When docker first appeared, best practice seemed to be to put one service per container, with docker-compose wiring containers together to share services amongst them. More recently, it seems like there are now robust patterns for developing containers that run multiple services.

When it comes to best practice, it may be worth reflecting on this.

Disussion - Rule 8. End the `Dockerfile` with build and run commands

This is an eminently sensible and practical rule, though it maybe also hints back to Rule 6?

The rationale is to make it easy for a novice to get started; as such, it makes sense to show how to:

deploy / run / access services (Rule 7);
mount data volumes (Rule 5)
pass in envt vars (eg Rule 6?)

Port mapping is something else to mention?

One note: don't do things like map: -p 8888:8888; for a novice, if I want to expose on localhost port 8899, which one do I change?

I think novices can also be confused by things in {}, things prefixed with $, etc; you also need to take care not to introduce platform shell dependencies in the way things are passed (Windows, Max/Linux vars / commands etc).

Generally, I try to remember to give an example of pretty much every likely command switch over several (explained/ contextualised) docker run commands.

If you are making your container available for other people to access from eg dockerhub, also provide a docker push command, along with relevant comments about tagging strategies used.

Discussion - Rule 6. Capture environment metadata

This would perhaps benefit from a best suggested practice example?

How does this relate to Rule 8 (End the Dockerfile with build and run commands)?

Also:

conventions for naming and tagging images built from a Dockerfile that can be called on from other Dockerfiles?
in practice, there may also be issues regarding naming containers run from images, particularly when it comes to managing linked data volumes attached to containers that are perhaps started, stopped and restarted; or started, stop, deleted, and started de novo.

The "environment" word also makes me think of ENV and ARG variables. eg my example use case for this is On Not Faffing Around With Jupyter Notebook Docker Container Auth Tokens.

"simply pushing the `Dockerfile` to the service"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 402 in 4a87e3e

    
           If the container is linked to an automated build provided by a version control service, simply pushing the `Dockerfile` to the service can easily trigger the build.

It's also possible to set up CI tools so that things are built according to a schedule (cron job) or in response to particular actions (eg making a Github release) not just on every commit, which may get ridiculous (eg your half hour build is triggered each time you fix a typo in the README). CI builds can also be set to ignore rebuilds when commit messages contain particular tags, for example.

Writing a good CI script, that caches things sensibly, is a skill in and of itself and that could perhaps also benefit from a guide such as this one...

Coauthors | Coordination

@betatim @sje30 @benmarwick

I would really enjoy to move this idea along during the next couple of months, maybe have a first version ready by the end of the year. Does that work for you (if you're in) ?

IMO: we're open to more views and coauthors, so feel free to suggest potential collaborators. Reaching a broader "consensus" will be harder, but only make the rules stronger 💪 !

Discussion - Rule 7. Enable interactive usage and one-click execution

So this is about usability of the thing being containerised, right?

I think containers can do several things:

run headless services and expose an HTML UI over http;
run desktop GUI apps exposed via either an HTML UI over http (eg novnc) or via something like VMC or RDP;
provide a preconfigured shell environment to work in, eg via ssh;
run as a command line application to process local files and write local files. I have an old example here; a more recent example is the way Jupyter Book can use a container to build the book pages.

Other Docker images may be built as base containers that are provided simply as a base for other images, and as such offer no UI / expose no "useful" end-user application/service.

This rule also hints at things like repo2docker and services that can be used to 1-click publish and run containerised services if they are suitably exposed.

The rule may also hint at supporting the ability to run tests against the container? (I don't think tests are mentioned anywhere else?)

Acknowledgements

@vsoch @benmarwick @sje30 @betatim @psychemedia Please add any funding etc. you need to the acknowledgements section.

Add link to Docker Blog

The Docker blog has a "Best practices" post from last year. I suggest to add a link to it in the "there are tutorials" sentence, because it nicely demonstrates that these best practices are not touching on many things that we worry about in the context of data science:

https://www.docker.com/blog/intro-guide-to-dockerfile-best-practices/

Final check of content

Go through https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003858#s6 and see if we're on track.

"Package managers generally provide reliable mirrors"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 251 in 4a87e3e

    
           Package managers generally provide reliable mirrors or endpoints to download software, and many packages are tested before release.

Package managers also support versioning, so you can pin a version of a package as installed from a package manager

Discussion: Rule 1 - Don't write Dockerfiles by hand

Whilst repo2docker can be used to generate Dockerfiles, I'm pretty sure I read somewhere that they don't recommend it?

Ah... here.

Formatting Dockerfiles

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 154 in 4a87e3e

    
           It's good practice to think of the `Dockerfile` as being human _and_ machine readable.

One way of conforming to style when writing a Dockerfile is to use a linter; there are several around (eg here, here, and "officially" in VSCode here).

But who sets the rules? Is this one way a community can start to recommend and police convention, by requiring Dockerfiles are linted before they can be shared?

Material

(feel free to add, edit, or comment)

Use https://allcontributors.org/ ?

"a private dataset ... should never be added to a container"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 295 in 4a87e3e

    
           For example, a private dataset with personal health information (PHI) should never be added to a container.

This relates to the use of volumes and how data can be mounted into a container at run time. (This is a good example of something that should be documented in the Dockerfile.)

Discussion - Rule 10. Use the container daily, rebuild the image weekly

So... this touches on comments to Rule 2 and Rule 4 - what is the stable thing that can be reused with confidence? A particular image instance, or the thing produced by building against a particular Dockerfile?

For testing, I often test against MyBinder, although that can be subject to dependencies introduced by the repo2docker process that are outside my control.

I also use Dockerhub to auto build on commits to particular repos. I have one repo that builds and tags against each branch, so I can call a tagged image that represents the latest build for each branch. I think other rules let you build against particular paths in a specific branch.

you cannot expect to take a year old Docker image form the shelf and that can be extended, it will likely "run" but just as-is

But - if you want to access a computational environment that you were using a year ago, that old image is exactly the image you want to be using? I have a few images running applications that are legacy and that I would probably struggle to ever build again; when I lose the image, I've lost the application / environment.

As an archiving strategy, I should probably be running those images every so often and saving a new image from them so that the external packaging (whatever data structures docker uses to create/define the image wrapper around my environment) will presumably get updated in the image.

To maintain the image, I could probably also try updating some of the packages inside the image, or using it as a base layer in a new image that does run particular updates that don't seem to break the original application.

Final check of form

See https://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-references

"You should list commands"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 211 in 4a87e3e

    
           You should list commands _in order_ of least likely to change to most likely to change and use the `--no-cache` flag to force a re-build of all layers.

For the novice, this my need contextualising. The container is built by executing statements in the Dockerfile in order. When one step is completed, the result is cached, and the build moves to the next step. If you change something in the Dockerfile, and rebuild the container, each statement is inspected in turn. If it hasn't changed, the cached layer is used and the build progresses. If the line has changed, that build step is executed afresh and then every following step will have to be rebuilt in case the line that changed changed something for a following step.

Go through specification of Dockerfile and re-read all instructions with reproducibility in mind

Implement a linter for the rules

check out existing Dockerfile linters - do they solve (part of) the challenges?
can they be extended to give warnings for specific behaviour?
can they be configured so that rules unfavourable for reproducibility are not used?
what kind of tests do they do?
what are common rules?

https://github.com/search?q=dockerfile+linter

Preprint venue

Let's use this issue to vote on the preprint venue, one platform per comment.
Goal is to publish the preprint end of February 2020.

The platform should support posting the PDF as it is generated now.

Use as many +1 and -1 as you like (maybe adding a comment why in the case of the latter) so we can also do a negative sorting (choose the one where least people have valid concerns).

Unfortunately, https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005473 does not have any rules about selecting a preprint server.

I did not include Zenodo (because mostly posters/software/data), and might have overlooked others - feel free to add your favourite to the list if I missed it.

Discussion - Rule 3. Use formatting and favour clarity

Would it make sense to give explicit examples of good practice and bad practice, perhaps in a contextualised way, eg show a colour highlighted git diff going from bad practice to good practice with a comment or a git commit line explaining the change in terms of the rule applied? Or maybe link to a supporting git repo where a scrappy Dockerfile has been revised into a best practice example?

When mentioning:

put each dependency on it's own line, makes it easier to spot changes in version control
split up an instruction (especially relevant for RUN) when you have to scroll to see all of it

a naive reader may misinterpret this instruction and put lots of things on separate lines each with its own RUN command, which would break layering?

So the instruction:

don't worry about image size, clarity is more important (i.e. no complex RUN instructions that remove files after using rightaway)

is problematic when it comes to writing Dockerfiles that build "efficient" images?

have commands in order of least likely to change to most likely to change, it helps readers and takes advantage of build caching

Would it make sense to have a section at the start of the paper that describes the anatomy of a Dockerfile, and perhaps also situates it in a workflow (Dockerfile -> image -> container).

Only switch directoryies with WORKDIR {-}

[typo - directoryies]
So in terms of best practice, is there something here about identifying not just which directory you are in and how to change it, but how to select appropriate USER's for running certain commands?

"the harder it will become"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 400 in 4a87e3e

    
           It can almost be guaranteed that the longer that you wait to recompile the image, the harder it will become.

Is the point here that it will actually become harder to identify and fix whatever it is that is breaking a build? That if you rebuild regularly, you maximise the likelihood that you only need to identify and fix a small number of easily detected errors? Rather than waiting for months and being presented with all manner of things that are inter-dependent or that may break other things in other ways as you try to fix them?

"in the case that you will likely save as a root user"

ten-simple-rules-dockerfiles/ten-simple-rules-dockerfiles.Rmd

Line 290 in 4a87e3e

    
           When you do this, in the case that you will likely save as a root user from inside the container, you should be careful about file permissions.

This reads clumsily and needs revising, but perhaps more to the point raises the issue of what role is being used at each step of a build, which speaks to the role of permissioning. This can be important for security, but also roles / permissions may be used to enforce particular behaviour in the environment itself to stop one part of the environment affecting other parts in ways that are unintended.

When it comes to best practice in DOckerfiles, there may also be arguments for layering the build file in terms associated with which user / role is supposed to be performing what step at each part of the build.

Examples

It would be great to supplement the articles with real-world examples, e.g. by putting them into this repository and also inviting the scientific community to contribute more via pull requests.

@sje30 suggested "before and after" comparisons of Dockerfiles (#16 (comment)), which I really like!

The examples directory is open for these.

Note: more examples are always welcome, just leave a comment below.

Other ways to find examples:

Multiple images vs. one image with entrypoints/cmds

In the current text I made the suggestion to use the same container for both development and

https://figshare.com/articles/Using_Docker_to_Support_Reproducible_Research/1101910 suggest something different, namely a base + dev + release images, the latter to only run the workflow and second to have a UI (like Jupyter) with a common environment (base).

I still think a multiple-image set-up is quite error prone, but it work better for some users?

Preprint feedback

What do you think about the Ten Simple Rules we came up with?

Update: Preprint DOI is now active 👉 https://doi.org/10.31219/osf.io/fsd7t 👈 is now active.

nuest / ten-simple-rules-dockerfiles Goto Github PK

ten-simple-rules-dockerfiles's Introduction

Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science

Author contributions

Run container for editing the document

Run container for building the PDF

Useful snippets

License

ten-simple-rules-dockerfiles's People

Stargazers

Watchers

Forkers

ten-simple-rules-dockerfiles's Issues

Ping people

Note: more examples are always welcome, just leave a comment below.

Recommend Projects

Recommend Topics

Recommend Org