foldingathome / containers Goto Github PK

Docker containers for easily launching the Folding@home client anywhere!

License: Creative Commons Zero v1.0 Universal

Dockerfile 100.00%

containers's Introduction

Official Folding@home Containers

The official Folding@home containers are designed to be simple and run in any container environment - desktop, laptop, Kubernetes, Helm, Docker Compose, OpenShift, Cloud...

Containers track stable versions (2 weeks with no found bugs) of the Folding@home Client. They will run on any Linux distribution, and are based on LTS version of Ubuntu, OpenCL, and CUDA libraries. They contain enough utilities to exec in and be able to debug any problems.

Containers

GPU container (NVIDIA) - fah-gpu
GPU container (AMD) - fah-gpu-amd

Deployments

This repo will also contain Helm templates and other deployment scripts/tools for a variety of environments.

Folding@home Container Design Rules

Containers will mount all read-write state, including config.xml which also has client state, into /fah/.
Containerized clients will output to stdout/sterr for container logs.
Containers are designed to be monitored via logs, and controlled with files.
No example configurations will expose ports, and that functionality will be configured off in examples. Folding@home was built to run on desktops/LANs, and currently should not be exposed to the internet.

Operating Containers

Each of these is explained in the container README, but they are included here for clarity. RFC 2119 meanings.

MUST mount read-writable persistent storage to /fah of the running container. Running containers MUST NOT share the same mounted directory, but directories SHOULD be reused to avoid lost Work Units.
MUST create and preload a tuned config.xml in each persistent storage directory before running the container for the first time.
MUST run the container as a uid:gid, specified with with --user or equivalent, so that the running container has read-write permissions to the persistent storage in /fah.
SHOULD NOT run containers as root.
SHOULD NOT expose ports to internet without firewall rules, encryption, and strong passwords.

Folding@home Websites

Folding@home: https://foldingathome.org/
Folding@home Support Forum: https://foldingforum.org/
Folding@home Containers GitHub: https://github.com/foldingathome/containers/
Folding@home Docker Hub: https://hub.docker.com/u/foldingathome

containers's People

Stargazers

Watchers

Forkers

dmc5179 johnktims pcktdmp taddedawson dpeterpaul rjszynal mharsch sanori thebozzcl rahedges naveengh6 lovelyn2210 damianlevy87 techwithjake e-zmac quatton dyeske emgeiger

containers's Issues

Problem on Jetson Nano

Also reported here: FoldingAtHome/fah-issues#1571

Hi,

This is failing for me on a Jetson Nano:

Steps:

# fah/config.xml
<config>
  <!-- Set with your user, passkey, team-->
  <user value="REDACTED"/>
  <passkey value="REDACTED"/>
  <team value="0"/>

  <power value="full"/>
  <exit-when-done v='true'/>

  <web-enable v='false'/>
  <disable-viz v='true'/>
  <gui-enabled v='false'/>

  <!-- 1 slots for GPUs -->
  <slot id='0' type='GPU'> </slot>

  <!-- 16-1 = 15 = 3*5 for decomposition -->
  <slot id='1' type='SMP'> <cpus v='15'/> </slot>

</config>

$ docker run --gpus all --name fah0 -d --user "$(id -u):$(id -g)"   --volume $HOME/fah:/fah foldingathome/fah-gpu:latest
08a13874aba9145efd6b729c2b543f90e6fd250c0a880eb6721dc577816a950b

works

Then I run:

$ docker start fah0
fah0

~$ docker logs fah0
standard_init_linux.go:211: exec user process caused "exec format error"
standard_init_linux.go:211: exec user process caused "exec format error"

Prerequisites:

Pick your Username - FAQ.
Setup your Passkey.
Join a team or create your own - FAQ.

Docker > 19

$ docker -v
Docker version 19.03.6, build 369ce74a3c

$ nvidia-container-runtime -v
runc version spec: 1.0.1-dev

nvidia-container-runtime is already the newest version (3.1.0-1).

specs:
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic

Anomolous License

Most of the projects are licensed under GPL which can get an automatic approval from many companies to allow their employees to work on it because it's a known, long standing, license which is well understood.

This repo is licensed under CC0, which is a lesser used license, and so will increase the work needed to allow some folk to work on the project.

Please consider altering the license to GPL or a similarly well known license (e.g. CC-BY, Apache 2, MIT, BSD, LGPL)

fah-arm container

Storing this idea here, for eventual consideration when there is enough interest.

Please test fah-gpu-amd container

@mharsch
@sanori

I just pushed foldingathome/fah-gpu-amd:22.01.0-rc1 to Docker Hub, can one/both of you verify it's a good image on your systems, and also check the README wasn't mangled by the slight differences in GitHub and Docker markdown?

https://hub.docker.com/r/foldingathome/fah-gpu-amd

If all is well, I will push 21.01.0 and latest tags.

Thanks.

Deployment and performance on Google Cloud Platform

I have deployed fah-gpu container on Google Kubernetes Engine (Google Cloud Platform). Currently I'm using cluster with one node:

Machine type: n1-standard-1 (1 vCPU, 3.75 GB memory)
CPU: Intel(R) Xeon(R) CPU @ 2.20GHz, GenuineIntel Family 6 Model 79 Stepping 0
OS: Container-Optimized OS
GPU: NVIDIA Tesla T4
Preemptible VM

I'm running single GPU folding slot, and it works well - GPU yields ~800-850k points per day, and is able to process very massive work units in reasonable amount of time. However, I wonder if it is possible to increase my GPU performance, so I have 3 issues, which probably are worth to be discussed and noted in Readme file.

When I inspect cluster resources utilization, I get the following, typical values:

sgnsajgon@cloudshell:~$ kubectl top nodes
NAME                                                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
gke-folding-at-home-n1-std-1-tesla-t4-0eeec1f6-7rlh   1000m        106%   1758Mi          66%

sgnsajgon@cloudshell:~$ kubectl top pods
NAME                          CPU(cores)   MEMORY(bytes)
fahclient-gpu-statefulset-0   955m         739Mi

We can see that node CPU is fully utilized, almost entirely consumed by fahclient Pod (there are also other kube-system Pods deployed on this node, but they do not consume much of resources).

Is it possible that if I deploy fah-gpu on faster or more modern CPU, then I will get better GPU performance? Which CPU (in terms of it's parameters) is the optimal choice for my GPU, and in general, for other GPUs in context of fah-gpu container? How can we check whether we get maximum performance out of our GPUs?

Will I get any performance boost if I use newer CUDA toolkit than 9.2, for example if I use newer base Docker image, i.e. 10.1-base-ubuntu18.04, or the latest?
I'm thinking about scaling up my cluster. Which option would be better in context of performance - running 2 GPUs (and 2 vCPUs and more RAM) on single node (vertical scaling), or running two separate nodes, each having 1 GPU and 1 vCPU (horizontal scaling).

I think some information and tips on this subject would be greatly appreciated.

Thank you so much, great job, I'm looking forward to new features.

update base cuda image

Given that core 22 is moving to CUDA 11.x shall we do the same?

https://foldingforum.org/viewtopic.php?f=24&t=37391

https://docs.nvidia.com/deploy/cuda-compatibility/

Images should run as non-root user

It is best practice to run containers as a non-root user. This could be added to the design rules and is easily achieved by adding the following to the dockerfile:

# Add Folding user
RUN useradd -mG folding

# Run as non privileged user
USER folding

After doing this users will need to change the ownership of their host fah volume as it will still be owned by root. This will be uid 1000 gid 1000

Deployment on Synology NAS

Some NAS systems by Synology, Inc. offer for the deployment of custom images in its preconfigured integrated Docker installment.

I have successfully deployd fah-gpu/v5.7.1 on a Synology DS 218+ two bay NAS running DSM 6.2.3-25426 Update 2 (which is a 4.4.59+ #25426 SMP PREEMPT x86_64 GNU/Linux synology_apollolake_218+). Obstacles on the way:

integrated version is Docker version 18.09.8, build bfed4f5 (Synology version 18.09.0-0513, current of today) - this is lower than the Technical Requirements in README.md require - solved by dropping the --gpus switch from the docker run command
docker run must be run on the command line (and sudoed), since the Docker GUI does not allow to set a user
in order to use the web monitoring facilities, both the allow and web-allow config option must be correctly configured
- a port 7398 export is added to the docker run switch list: -p 7396:7396
- both syntaxes <web-allow>ADDRESSES</web-allow> and <web-allow v='ADDRESSES'> are unexpectedly equally valid methods to allow external access, the configuration <allow v='172.17.0.1/24'/><web-allow v='172.17.0.1/24'/> permits for web access using the standard configuration of Docker's Bridged network device

Explain performance penalty of running Folding@home in container

The README.md outlines so many benefits, but completely misses the part about performance. I think that for distributed computing community the performance is very important topic to cover.

https://hackernoon.com/another-reason-why-your-docker-containers-may-be-slow-d37207dec27f

Core 0x23 stalled on current (21.11.0) image

Log.txt

00:29:18:WU02:FS00:Download complete
00:29:19:WU02:FS00:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:12261 run:0 clone:236 gen:81 core:0x23 unit:0x000000ec0000005100002fe500000000
00:29:19:WU02:FS00:Starting
00:29:19:WU02:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /fah/cores/cores.foldingathome.org/openmm-core-23/centos-7.9.2009-64bit/release/0x23-8.0.3/Core_23.fah/FahCore_23 -dir 02 -suffix 01 -version 706 -lifeline 1 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
00:29:19:WU02:FS00:Started FahCore on PID 50
00:29:19:WU02:FS00:Core PID:54
00:29:19:WU02:FS00:FahCore 0x23 started
00:29:19:WARNING:WU02:FS00:FahCore returned: WU_STALLED (127 = 0x7f)

Inspection

Core 0x23 seems to require OpenCL 3.0. But, OpenCL 3.0 does not work properly on CUDA 11.2.2.

$ docker exec -it fah0 clinfo
Number of platforms                               1
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 3.0 CUDA 12.2.148
  Platform Profile                                FULL_PROFILE
(snip)
ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
	NOTE:	your OpenCL library only supports OpenCL 2.1,
		but some installed platforms support OpenCL 3.0.
		Programs using 3.0 features may crash
		or behave unexepectedly

Inference

According to the NVIDIA Technical Blog, NVIDIA supports OpenCL 3.0 since Linux driver version 465.19.1. The matching CUDA version would be 11.3.1 according to the CUDA release notes

Therefore, I guess that the CUDA version of base image should be updated at least 11.3.1.

F@H container benchmarks

#7 was closed without the resolution about performance, and as I constantly get back to this, I decided to raise the dedicated issue. #8 also contains question about performance on Google Cloud Platform.

The goal is to measure and compare performance of F@H with and without containers on similar hardware. To see if containerization of such payload is efficient, and provides ways to troubleshoot and improve that.

Question about container use case to drive more cpu cycle donations

Hi I am not sure if this is the right place to open a ticket but if not I am hoping you could help point me in the right direction.

I work for VMware and have worked for several other infrastructure companies where we burn many thousands of hours of cpu cycles every year running some random sample application, where the sample app itself has no point, its just needed to have some app to demonstrate our infrastructure software underneath.

I would like to see if its possible to turn at least some demo use cases into demos that could use folding@home as our sample workload. I am just starting to investigate this idea and dont yet know basic things like what is the minimal hardware requirement a container running folding@home would need or how long does a container need to run to create at least a minimal or greater benefit.

I realize this use case may not work well for all of our demos, but we have a lot of different types of demos and I am very hopeful we could find demo types that could make a meaningful contribution of cpu cycles. For example while most of our demo apps are short running, I am proposing some ideas for events where we may partner with an organization like folding@home to run larger scale demos that may run for a day or multiple days. Even during shorter demos, we often have our engineers access a temporary environment for their demo that continues to run for some period of time after they finish giving the demo, sometimes for multiple days or weeks.

I am actively trying to pursue exploring this idea within VMware and am not sure how quickly I will make progress, but if you have any advice or guidance I would be very grateful.

Thank you!

Update Docker docs

Note: Also, the docs should update the command:

from: fah-gpu:VERSION

to: foldingathome/fah-gpu:VERSION

OLD:
# Run container with GPUs, name it "fah0", map user and /fah volume
docker run --gpus all --name fah0 -d --user "$(id -u):$(id -g)" \
  --volume $HOME/fah:/fah fah-gpu:VERSION

NEW:
# Run container with GPUs, name it "fah0", map user and /fah volume
docker run --gpus all --name fah0 -d --user "$(id -u):$(id -g)" \
  --volume $HOME/fah:/fah foldingathome/fah-gpu:VERSION

Image out of date

Please could you build a new version of the image as a new fah-client version has been released.
Also, it may be worth setting up some sort of auto-triggered pipeline to build a new image whenever a new version of the application is available.

Are the dev libraries and compilers needed for fah-gpu-amd?

The size of the current fah-gpu-amd image is about 3.9GB.
It is too large compared to 92.1MB of the size of fah-gpu (Nvidia CUDA).

Currently, fah-gpu-amd installs the rocm-dev package, which depends on the ROCm runtime and development libraries and LLVM compilers that rely on the g++ development environment.

I guess that this image's goal is to provide OpenMM, precisely speaking OpenCL, runtime to run the FAH core 22 efficiently.
If my guess is correct, only the runtimes related to OpenCL acceleration would be enough.

fah-gpu behind a proxy

Hi,

I do need to set a proxy server - while this is no problem in docker, and docker normally adds these settings to the container, the fah client seems to ignore them.

So the container is stuck at
~/fah# docker logs fah0
13:18:19:Downloading GPUs.txt from assign1.foldingathome.org:80
13:18:19:Connecting to assign1.foldingathome.org:80

How can I set the proxy inside the container?

README needs info for running with ROCm on AMD GPUs

These containers require a --gpu flag for Docker that is not available in a fresh install on Ubuntu 20.04 amd64 with the amdgpu driver. Also the readme of the container explicitly mentions to use an nvidia-container-runtime. This appears a too-strict dependency, which brings vendor lock-in and a decrease of diversity in computational ecosystems.

Does it appear valuable to also support the AMD ROCm platform, or are there any plans yet to do so?

Few resources become interesting upon that sight:

Zombie processes are not cleaned up by the container

I'm using the fah-gpu-amd container to run the folding@home client on my desktop, as the OS it runs is not supported by ROCm userspace and having a consistent environment is much simpler.

I've noticed if the folding@home client kills a subprocess for some reason (pausing, or if some bug is detected), the process ends up as a zombie and the folding@home client never cleans it up. As folding@home is PID 1, there is no other reaper and the process continue to exist and the client gets wedged, unable to respawn the work unit. Note: this happens to both GPU and CPU WU on the same machine.

Could the folding@home client be updated to reap these processes? Otherwise, could the containers be updated with a different PID 1 to reap these dead children? If the new PID 1 is wanted, I could take a look at creating an appropriate PR.