I'm trying to create a modified Nvidia build with an old version of the Nvidia driver for support with my legacy GPUs, but I need some help. I know I'll need a matching nvidia-container-toolkit extension, too; I plan on working on that after I create a working installer image. I am still figuring out what I'm doing; some guidance would be appreciated. All the container images used in this issue are public and available in my fork of the pkgs repo.
Environment
Mac OS Ventura 13.4.1 with Docker Desktop v4.22.0 Engine 24.0.5.
docker --version
Docker version 24.0.5, build ced0996600
docker buildx version
github.com/docker/buildx v0.11.2-desktop.1 986ab6afe790e25f022969a18bc0111cff170bc2
talosctl version --client [18:52:28]
Client:
Tag: v1.4.8
SHA: 84c2961a
Built:
Go version: go1.20.7
OS/Arch: darwin/amd64
Building Images (Proprietary drivers)
First, I forked the pkgs repo and modified it with the old driver. Changes can be found in this 0772cc1 commit. I built using make kernel nonfree-kmod-nvidia PUSH=true USERNAME=cubea01 PLATFORM=linux/amd64
, which succeeded and pushed without issue.
Then I created a Dockerfile based on the documentation here.
DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/cubea01/installer:v1.4.7-470.199.02 .
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
BuildKit is currently disabled; enable it by removing the DOCKER_BUILDKIT=0
environment-variable.
Sending build context to Docker daemon 3.584kB
Step 1/4 : FROM scratch as customization
--->
Step 2/4 : COPY --from=ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02 /lib/modules /lib/modules
---> ef1c4e91f9f9
Step 3/4 : FROM ghcr.io/siderolabs/installer:v1.4.7
7d7551ba622a: Download complete
e4ae3f15a6ee: Download complete
6bbf856e9412: Download complete
2f544ee616f0: Download complete
---> 7d7551ba622a
Step 4/4 : COPY --from=ghcr.io/cubea01/kernel:v1.4.7-470.199.02 /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
---> 2463d0cb737c
[Warning] One or more build-args [RM] were not consumed
error squashing image: not implemented
And as you can see above, it didn't work. So, I rewrote the Dockerfile with my basic grasp of how they worked, and here's the result of the new build.
DOCKER_BUILDKIT=1 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/cubea01/installer:v1.4.7-470.199.02 .
WARNING: experimental flag squash is removed with BuildKit. You should squash inside build using a multi-stage Dockerfile for efficiency.
[+] Building 54.7s (21/21) FINISHED docker:desktop-linux
=> [internal] load .dockerignore 0.0s
=> => transferring context: 52B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 905B 0.0s
=> [internal] load metadata for ghcr.io/cubea01/kernel:v1.4.7-470.199.02 1.6s
=> [internal] load metadata for ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02 1.6s
=> [internal] load metadata for ghcr.io/siderolabs/installer:v1.4.7 1.6s
=> [auth] cubea01/kernel:pull token for ghcr.io 0.0s
=> [auth] siderolabs/installer:pull token for ghcr.io 0.0s
=> [auth] cubea01/nonfree-kmod-nvidia:pull token for ghcr.io 0.0s
=> [stage-2 1/3] FROM ghcr.io/siderolabs/installer:v1.4.7@sha256:7d7551ba622aaa0cc4af9d5c8e5c9eff6307ebb160efdb783a0541a39e477439 0.0s
=> => resolve ghcr.io/siderolabs/installer:v1.4.7@sha256:7d7551ba622aaa0cc4af9d5c8e5c9eff6307ebb160efdb783a0541a39e477439 0.0s
=> FROM ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02@sha256:6c283a0075d4b7b576ae0be874bd5aaa42584bf3cd600a9769f5ce32859c0bad 0.0s
=> => resolve ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02@sha256:6c283a0075d4b7b576ae0be874bd5aaa42584bf3cd600a9769f5ce32859c0bad 0.0s
=> FROM ghcr.io/cubea01/kernel:v1.4.7-470.199.02@sha256:f878d7348b811c2284c1d8831f1ab48de70f39c9c9e30d80a0815dea568e47d2 0.0s
=> => resolve ghcr.io/cubea01/kernel:v1.4.7-470.199.02@sha256:f878d7348b811c2284c1d8831f1ab48de70f39c9c9e30d80a0815dea568e47d2 0.0s
=> [kernel_stage 2/2] COPY --from=ghcr.io/cubea01/kernel:v1.4.7-470.199.02 /boot/vmlinuz /boot/vmlinuz 0.1s
=> [nonfree_stage 2/2] COPY --from=ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02 /lib/modules /lib/modules 0.3s
=> [stage-2 2/3] RUN apk add --no-cache --update cpio squashfs-tools xz 1.8s
=> [stage-2 3/3] WORKDIR /initramfs 0.0s
=> [stage-2 4/3] RUN xz -d /usr/install/amd64/initramfs.xz && cpio -idvm < /usr/install/amd64/initramfs && unsquashfs -f -d /rootfs rootfs.sqsh && for f in /lib/modules; do rm -rfv /rootfs$f; done && rm /usr/install/amd64/initra 3.1s
=> [stage-2 5/3] COPY --from=customization / /rootfs 0.0s
=> [stage-2 6/3] RUN find /rootfs && mksquashfs /rootfs rootfs.sqsh -all-root -noappend -comp xz -Xdict-size 100% -no-progress && set -o pipefail && find . 2>/dev/null | cpio -H newc -o | xz -v -C crc32 -0 -e -T 0 -z >/usr/install/amd6 29.9s
=> [stage-2 7/3] COPY --from=nonfree_stage /lib/modules /lib 0.1s
=> [stage-2 8/3] COPY --from=kernel_stage /boot/vmlinuz /usr/install/amd64/vmlinuz 0.1s
=> exporting to image 17.9s
=> => exporting layers 14.3s
=> => exporting manifest sha256:5ceb434d89ef82520add2fe92a1ba566853a1d8802081b1e421737638db408b0 0.0s
=> => exporting config sha256:448ba4a0623bcc1b4873c250b97a97de989013d4ef2fe13f7b0155f7fd520123 0.0s
=> => exporting attestation manifest sha256:fd6cdde6c1f9c684d1b19adb5556bdc27660b3932663b2c7a2e5d35029d03058 0.0s
=> => exporting manifest list sha256:7cdb47713446b5c6bf97c1e50bc76cd796c481aca9eb6f1b7115f493c7bc59a5 0.0s
=> => naming to ghcr.io/cubea01/installer:v1.4.7-470.199.02 0.0s
=> => unpacking to ghcr.io/cubea01/installer:v1.4.7-470.199.02 3.5s
Testing new Installer
First, I built and bootstrapped a new single-node cluster based on my main one in a VM. Once it was healthy, I upgraded the VM using the new image. 0.5 Playback speed recording of VM upgrading and rebooting. I uploaded a copy of the talconfig used to create the configuration for the VM here.
Command used to upgrade.
talosctl upgrade -n 10.211.55.26 -e 10.211.55.26 --force --image=ghcr.io/cubea01/installer:v1.4.7-470.199.02
It appears to upgrade and boot just fine, but it constantly spits out the following error, and as you can see in the recording, it no longer detects any network interfaces.
"error initializing kmod manager: stat /lib/modules/6.1.45-talos/modules.dep: no such file or directory"