siderolabs / pkgs Goto Github PK

License: Mozilla Public License 2.0

Makefile 60.48% Python 27.68% Shell 11.84%

pkgs's Introduction

pkgs

This repository produces a set of packages that can be used to build a rootfs suitable for creating custom Linux distributions. The packages are published as a container image, and can be "installed" by simply copying the contents to your rootfs. For example, using Docker, we can do the following:

FROM scratch
COPY --from=<registry>/<organization>/<pkg>:<tag> / /

Resources

pkgs's People

Contributors

Stargazers

Watchers

Forkers

andrewrynhard bradbeam rsmitty erinterrell42 cycoresystems smira grengojbo ulexus e-nomem unix4ever willemm aleksi guilhem jonkerj sergelogvinov jpraczyk ogkevin frezbo sbskas kramelec lion7 hainesbg vertexanvil kpostrup ammmze caprisys astra137 bowseruk dmitriymv anthr76 charlie-haley bryanasdev000 mdr164 deekue djalpee myndzi codehoschi timjones philhug lrvick mindstorms6 krehm xvzf dwsr cvandesande utkuozdemir 5l1v3r1 justingarfield inf0rmatiker talos-opi5 markbaumgarten uhthomas nathandotleeathpe tmayad ro11net roeeklinger ishioni nanfei-chen cehoffman gottox kvaps pl4nty ndbrew sleepingshell dreamingraven djeebus hackworthltd nberlee hpe-breckenridge yoctozepto anisimovdk shaunmaher aarnaud aarnaud-talos rothgar champloo tpapagian buroa damiapoquet echozio dsseng konrader yellowbox-au twelho clinia bernardgut alexmartinio choopm torvitas camrossi

pkgs's Issues

The existing boot mechanism only supports RPI 4B 4GB USB boot, not RPI 4B 8GB

I ran a number of tests to determine if the current boot solution does also support USB boot or not and here's what I've found so far:

USB2 boot halts right after XFS mount (step 6 of installation); it doesn't provide much of a performance boost, so, this could be ignored, and just seen as an observation.
USB3/3.1 (Gen1) boot works, with RPI4 firmware version 2020-09-03. I updated the EEPROM's config to boot from USB first, but this isn't an absolute necessity, if no SD card is inserted.
The very first installation took around 20 minutes, but 2 subsequent installations took considerably less, which is what's expected given the much higher I/O bandwidth, compared to SD's.
The attached file has rudimentary measurements of installation times, as well as boot times, and what was happening, for USB3.1 and SD card, plus the configuration files used (akimiski.yaml and talosconfig).

Issues encountered:

providing --arch flag to talosctl gen config has no effect, as it always generated amd64 config files (I manually modified the init.yaml, even though it seems it's not required). I wonder if the flag is needed at all, given that the installer can, perhaps, detect the architecture and pull the corresponding images (there could be edge cases).
Adding my internal CA's public key to the configuration file, based on the Talos guide would cause an installation crash loop. I tried it multiple times and the installation keeps failing and rebooting the node. Unfortunately, the crash happens in earlier stages of the installation, so, I could not query the server for any of its internal states or logs, to attach, but I could recreate the failure on VirtualBox and captured the video recording. Please note that on VirtualBox instead of rebooting it halts/sleeps, but the error could be seen, which happens in step 12 of installation. I've also attached the configuration file, which includes the CA public key (arctic.yaml).
The biggest issue I encountered was that USB boot only works on RPI 4B 4GB. This means 8GB version is not supported. This isn't entirely surprising though given what's documented on Debian wiki, VL805's firmware patch, and on Raspberry Pi's firmware site.
I have been able to boot Ubuntu 20.04.1 using a decompressed vmlinuz and a modified config.txt (which is attached).

As part of my research work, I run ML models on Kubernetes on RPI clusters, and to get the best o both worlds I'm willing to do a custom build using the decompressed kernel approach, if you could confirm that if I include the decompressed kernel, it wouldn't interfere with the Talos installation solution.
If it's possible I'd post the results by early next week, for anyone, who might be interested.
Talos.zip

Need IPVLAN support

Requested experimental feature mergeop is not supported by build server

Hi. I'm trying to build the kernel with these instructions.

git clone https://github.com/talos-systems/pkgs.git
cd pkgs
make kernel PLATFORM=linux/amd64 USERNAME=maxpain PUSH=true

But getting this error:

make[1]: Entering directory '/root/talos-test/pkgs'
make[2]: Entering directory '/root/talos-test/pkgs'
[+] Building 0.6s (5/5) FINISHED
 => [internal] load build definition from Pkgfile                                                                                                                                                                                               0.0s
 => => transferring dockerfile: 29B                                                                                                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                               0.0s
 => => transferring context: 2B                                                                                                                                                                                                                 0.0s
 => resolve image config for ghcr.io/siderolabs/bldr:v0.2.0-alpha.8-frontend                                                                                                                                                                    0.3s
 => CACHED docker-image://ghcr.io/siderolabs/bldr:v0.2.0-alpha.8-frontend@sha256:b1fe8a49f5d0fd1bb7e30dc9bb4e86ca168213fa7609e454398326e2d3f611ff                                                                                               0.0s
 => load Pkgfile, pkg.yamls and vars.yamls                                                                                                                                                                                                      0.0s
 => => transferring dockerfile: 7.07kB                                                                                                                                                                                                          0.0s
error: failed to solve: rpc error: code = Unknown desc = failed to solve with frontend dockerfile.v0: failed to solve with frontend gateway.v0: rpc error: code = Unknown desc = failed to solve LLB: requested experimental feature mergeop  is not supported by build server, please update docker
make[2]: *** [Makefile:80: target-kernel] Error 1
make[2]: Leaving directory '/root/talos-test/pkgs'
make[1]: *** [Makefile:86: docker-kernel] Error 2
make[1]: Leaving directory '/root/talos-test/pkgs'
make: *** [Makefile:90: kernel] Error 2

This is strange, because I could build the kernel a few days ago.

Help with Deploying NVIDIA device plugin

In the Deploying NVIDIA device plugin section of the proprietary driver guide it says to apply a manifest--this is using kubectl, correct? The issue I have is that the contexts aren't propagated into kubeconfig, was I supposed to do that manually? If so, how?

raspberrypi_firmware_version is old

Pkgfile says: raspberrypi_firmware_version: 1.20230405, which lacks support for rPi 5, in particular.

It seems the raspberrypi/firmware repo has stopped tagging 'releases', so I think talos needs to switch to using some other versioning scheme for this dependency. Or alternatively, find another source for these files.

I can build my own custom image, but I think rPi-5 has sufficient visibility to warrant fixing this for everyone.

kernel crash at amdgpu talos v1.5.0

k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.292786465Z]: Linux agpgart interface v0.103
k3s-node5.lab: kern: warning: [2023-08-18T13:36:47.378241465Z]: tpm tpm0: AMD fTPM version 0x3005400000005 causes system stutter; hwrng disabled
 SUBSYSTEM=tpm
 DEVICE=c10:224
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.378748465Z]: ACPI: bus type drm_connector registered
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.379027465Z]: [drm] amdgpu kernel modesetting enabled.
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.405364465Z]: [drm] initializing kernel modesetting (RENOIR 0x1002:0x164C 0x1002:0x0123 0xC2).
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.405387465Z]: [drm] register mmio base: 0xFCC00000
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.405390465Z]: [drm] register mmio size: 524288
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407906465Z]: [drm] add ip block number 0 <soc15_common>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407910465Z]: [drm] add ip block number 1 <gmc_v9_0>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407912465Z]: [drm] add ip block number 2 <vega10_ih>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407915465Z]: [drm] add ip block number 3 <psp>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407917465Z]: [drm] add ip block number 4 <smu>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407919465Z]: [drm] add ip block number 5 <dm>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407922465Z]: [drm] add ip block number 6 <gfx_v9_0>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407924465Z]: [drm] add ip block number 7 <sdma_v4_0>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407927465Z]: [drm] add ip block number 8 <vcn_v2_0>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407929465Z]: [drm] add ip block number 9 <jpeg_v2_0>
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407949465Z]: amdgpu 0000:04:00.0: amdgpu: Fetched VBIOS from VFCT
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:    info: [2023-08-18T13:36:47.407953465Z]: amdgpu: ATOM BIOS: 113-LUCIENNE-019
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.258045465Z]: tsc: Refined TSC clocksource calibration: 2096.053 MHz
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.258069465Z]: clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e369fd61e0, max_idle_ns: 440795252296 ns
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.258398465Z]: clocksource: Switched to clocksource tsc
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.468663465Z]: Freeing initrd memory: 61280K
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.469226465Z]: amdgpu 0000:04:00.0: Direct firmware load for amdgpu/renoir_sdma.bin failed with error -2
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:     err: [2023-08-18T13:36:48.469359465Z]: [drm:amdgpu_sdma_init_microcode] *ERROR* SDMA: Failed to init firmware "amdgpu/renoir_sdma.bin"
k3s-node5.lab: kern:     err: [2023-08-18T13:36:48.469376465Z]: [drm:sdma_v4_0_early_init] *ERROR* Failed to load sdma firmware!
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469384465Z]: [drm] VCN decode is enabled in VM mode
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469387465Z]: [drm] VCN encode is enabled in VM mode
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469390465Z]: [drm] JPEG decode is enabled in VM mode
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469396465Z]: amdgpu 0000:04:00.0: vgaarb: deactivate vga console
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469403465Z]: amdgpu 0000:04:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469426465Z]: amdgpu 0000:04:00.0: amdgpu: MODE2 reset
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469598465Z]: [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469611465Z]: amdgpu 0000:04:00.0: amdgpu: VRAM: 3072M 0x000000F400000000 - 0x000000F4BFFFFFFF (3072M used)
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469621465Z]: amdgpu 0000:04:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469628465Z]: amdgpu 0000:04:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469643465Z]: [drm] Detected VRAM RAM=3072M, BAR=3072M
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.469647465Z]: [drm] RAM width 128bits DDR4
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.470171465Z]: [drm] amdgpu: 3072M of VRAM memory ready
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.470177465Z]: [drm] amdgpu: 6423M of GTT memory ready.
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.470206465Z]: [drm] GART: num cpu pages 262144, num gpu pages 262144
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.470385465Z]: [drm] PCIE GART of 1024M enabled.
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.470389465Z]: [drm] PTB located at 0x000000F4BFC00000
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470662465Z]: amdgpu 0000:04:00.0: Direct firmware load for amdgpu/renoir_asd.bin failed with error -2
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:     err: [2023-08-18T13:36:48.470669465Z]: amdgpu 0000:04:00.0: amdgpu: fail to initialize asd microcode
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:     err: [2023-08-18T13:36:48.470672465Z]: [drm:psp_sw_init] *ERROR* Failed to load psp firmware!
k3s-node5.lab: kern:     err: [2023-08-18T13:36:48.470679465Z]: [drm:amdgpu_device_init.cold] *ERROR* sw_init of IP block <psp> failed -2
k3s-node5.lab: kern:     err: [2023-08-18T13:36:48.470688465Z]: amdgpu 0000:04:00.0: amdgpu: amdgpu_device_ip_init failed
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:     err: [2023-08-18T13:36:48.470692465Z]: amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.470696465Z]: amdgpu 0000:04:00.0: amdgpu: amdgpu: finishing device.
 SUBSYSTEM=pci
 DEVICE=+pci:0000:04:00.0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470800465Z]: ------------[ cut here ]------------
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470802465Z]: WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:656 amdgpu_irq_put+0x45/0x70
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470811465Z]: Modules linked in:
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470816465Z]: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.45-talos #1
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470822465Z]: Hardware name: AZW S5/S5, BIOS FP655U505 11/21/2022
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470825465Z]: RIP: 0010:amdgpu_irq_put+0x45/0x70
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470832465Z]: Code: 48 8b 4e 10 48 83 39 00 74 2c 89 d1 48 8d 04 88 8b 08 85 c9 74 14 f0 ff 08 b8 00 00 00 00 74 05 e9 d0 5b 16 01 e9 8b fd ff ff <0f> 0b b8 ea ff ff ff e9 bf 5b 16 01 b8 ea ff ff ff e9 b5 5b 16 01
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470837465Z]: RSP: 0018:ffffaf87c006fc28 EFLAGS: 00010246
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470843465Z]: RAX: ffff932700201588 RBX: ffff9327040e0000 RCX: 0000000000000000
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470847465Z]: RDX: 0000000000000000 RSI: ffff9327040e0be0 RDI: ffff9327040e0000
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470850465Z]: RBP: ffff9327040e0010 R08: 0000000000000000 R09: 0000000094654da6
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470854465Z]: R10: ffffffffffffffff R11: 0000000000000038 R12: ffff9327040e0010
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470858465Z]: R13: 0000000000000001 R14: ffff9327040e0010 R15: ffff9327040e0000
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470862465Z]: FS:  0000000000000000(0000) GS:ffff932960c00000(0000) knlGS:0000000000000000
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470867465Z]: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470871465Z]: CR2: ffff93296e1ff000 CR3: 0000000008610000 CR4: 0000000000350ef0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470875465Z]: Call Trace:
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470879465Z]:  <TASK>
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470884465Z]:  ? __warn+0x7d/0xc0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470893465Z]:  ? amdgpu_irq_put+0x45/0x70
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470900465Z]:  ? report_bug+0xe6/0x170
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470912465Z]:  ? handle_bug+0x41/0x70
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470920465Z]:  ? exc_invalid_op+0x13/0x60
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470927465Z]:  ? asm_exc_invalid_op+0x16/0x20
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470942465Z]:  ? amdgpu_irq_put+0x45/0x70
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470950465Z]:  ? __x86_return_thunk+0x5/0x6
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470957465Z]:  gmc_v9_0_hw_fini+0x60/0x80
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470964465Z]:  amdgpu_device_fini_hw+0x1cc/0x2af
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470972465Z]:  amdgpu_driver_load_kms.cold+0x54/0x6a
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470979465Z]:  amdgpu_pci_probe+0x125/0x340
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470989465Z]:  local_pci_probe+0x41/0x80
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.470998465Z]:  pci_device_probe+0xbf/0x230
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471004465Z]:  ? __x86_return_thunk+0x5/0x6
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471008465Z]:  ? kernfs_create_link+0x5d/0xa0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471015465Z]:  ? __x86_return_thunk+0x5/0x6
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471019465Z]:  ? sysfs_do_create_link_sd+0x6e/0xe0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471029465Z]:  really_probe+0xc7/0x280
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471034465Z]:  ? pm_runtime_barrier+0x50/0x90
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471042465Z]:  __driver_probe_device+0x73/0xf0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471047465Z]:  driver_probe_device+0x1f/0x90
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471053465Z]:  __driver_attach+0x84/0x130
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471058465Z]:  ? __device_attach_driver+0xc0/0xc0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471062465Z]:  bus_for_each_dev+0x87/0xd0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471071465Z]:  bus_add_driver+0x186/0x1d0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471080465Z]:  driver_register+0x89/0xe0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471084465Z]:  ? drm_sched_fence_slab_init+0x87/0x87
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471090465Z]:  do_one_initcall+0x59/0x230
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471101465Z]:  kernel_init_freeable+0x2bb/0x35d
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471107465Z]:  ? rest_init+0xd0/0xd0
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471114465Z]:  kernel_init+0x16/0x130
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471119465Z]:  ret_from_fork+0x22/0x30
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471133465Z]:  </TASK>
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471134465Z]: ---[ end trace 0000000000000000 ]---
k3s-node5.lab: kern: warning: [2023-08-18T13:36:48.471288465Z]: amdgpu: probe of 0000:04:00.0 failed with error -2
k3s-node5.lab: kern:    info: [2023-08-18T13:36:48.471548465Z]: [drm] amdgpu: ttm finalized

Client:
        Tag:         v1.5.0
        SHA:         429a2de8
        Built:       
        Go version:  go1.20.7
        OS/Arch:     linux/amd64
Server:
        NODE:        k3s-node5.lab
        Tag:         v1.5.0
        SHA:         429a2de8
        Built:       
        Go version:  go1.20.7
        OS/Arch:     linux/amd64
        Enabled:     RBAC

kernel: Magic SysRq should not be enabled

While it isn't remotely exploitable as far as I know, this really should not be enabled.

missing realtek firmware

Trying to use a ethernet USB dongle:

 kern:    info: [2024-02-15T17:30:36.563982928Z]: usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd                                                                                                                                                                                                                
  SUBSYSTEM=usb                                                                                                                                                                                                                                                                                                             
  DEVICE=+usb:4-1                                                                                                                                                                                                                                                                                                           
 kern:    info: [2024-02-15T17:30:36.870492928Z]: r8152-cfgselector 4-1: reset SuperSpeed USB device number 2 using xhci_hcd                                                                                                                                                                                                
  SUBSYSTEM=usb                                                                                                                                                                                                                                                                                                             
  DEVICE=c189:385                                                                                                                                                                                                                                                                                                           
 kern: warning: [2024-02-15T17:30:36.905576928Z]: r8152 4-1:1.0: Direct firmware load for rtl_nic/rtl8156b-2.fw failed with error -2                                                                                                                                                                                        
  SUBSYSTEM=usb                                                                                                                                                                                                                                                                                                             
  DEVICE=+usb:4-1:1.0                                                                                                                                                                                                                                                                                                       
 kern: warning: [2024-02-15T17:30:36.905967928Z]: r8152 4-1:1.0: unable to load firmware patch rtl_nic/rtl8156b-2.fw (-2)                                                                                                                                                                                                   
  SUBSYSTEM=usb                                                                                                                                                                                                                                                                                                             
  DEVICE=+usb:4-1:1.0                                                                                                                                                                                                                                                                                                       
 kern:     err: [2024-02-15T17:30:36.944160928Z]: r8152 4-1:1.0 (unnamed net_device) (uninitialized): netif_napi_add_weight() called with weight 256                                                                                                                                                                        
  SUBSYSTEM=usb                                                                                                                                                                                                                                                                                                             
  DEVICE=+usb:4-1:1.0                                                                                                                                                                                                                                                                                                       
 kern:    info: [2024-02-15T17:30:36.945480928Z]: r8152 4-1:1.0 eth1: v1.12.13                                                                                                                                                                                                                                              
  SUBSYSTEM=usb                                                                                                                                                                                                                                                                                                             
  DEVICE=+usb:4-1:1.0                                                                                                                                                                                                                                                                                                       
 kern:    info: [2024-02-15T17:33:01.129904928Z]: IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready                                                                                                                                                                                                                   
 kern:    info: [2024-02-15T17:33:01.130499928Z]: r8152 4-1:1.0 eth1: carrier on

binfmt_misc as module

Hello

Can we add the binfmt_misc as a module in the kernel

pkgs/kernel/build/config-amd64

Line 942 in 7d9e60e

# CONFIG_BINFMT_MISC is not set

it helps to use Talos as build machine. https://github.com/multiarch/qemu-user-static

Thanks.

Add open-iscsi

AMD GPU troubleshooting

KFD device doesn't appear

Enable multi-generation LRU by default in Talos 1.5

The PR #710 introduced multi-gen LRU support, although it is currently not activated by default.

Since the release of Talos 1.4.1, I have configured most of my clusters to use multi-gen LRU using the following configuration:

        sysfs:
            kernel.mm.lru_gen.enabled: y

This change has brought about noticeable benefits, such as a significant decrease in OOMkills, without any apparent drawbacks. I have been running multi-gen LRU in most clusters for more than 2 months. Considering that several other mainstream general-purpose Linux distributions have started enabling LRU, I suggest that we enable multi-gen LRU by default in the upcoming Talos 1.5 release.

Need Intel i40e network driver

restore RPi support in the kernel

Looks like #638 PR "automatically" removed lots of kernel config options for RPi, which need to be restored back.

Kernel is always rebuild even without any changes

Each package that depend of the stage kernel-build always rebuild the kernel instead of using cache. so the signing_key.x509 changed each time and the module can't be loaded.

Build the kernel:

> make kernel REGISTRY=127.0.0.1:5005 PUSH=true PLATFORM=linux/amd64
...
 => CACHED kernel-build:prepare-0                                                                                                                                                   0.0s
 => CACHED kernel-build:build-0                                                                                                                                                     0.0s
 => kernel-build:build-0                                                                                                                                                         1381.1s
 => CACHED kernel-build:finalize /usr -> /usr                                                                                                                                       0.0s
 => CACHED kernel-build:finalize /bin -> /bin                                                                                                                                       0.0s
 => kernel-build:finalize /src -> /src 
...

Build a module:

> make mymodulename REGISTRY=127.0.0.1:5005 PUSH=true PLATFORM=linux/amd64
...
 => CACHED kernel-build:prepare-0                                                                                                                                                   0.0s
 => CACHED kernel-build:build-0                                                                                                                                                     0.0s
 => kernel-build:build-0                                                                                                                                                         1358.3s
 => kernel-build:finalize /src -> /src                                                                                                                                             71.8s
 => kernel-build:finalize /toolchain -> /toolchain                                                                                                                                 10.8s
 => CACHED kernel-build:finalize /bin -> /bin
...

Client: Docker Engine - Community
 Version:    24.0.7
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false

Audit disabled Kernel options (KSPP)

We have some KSPP options disabled, audit them to make sure it's reasonable to keep them disabled

Custom kernel build: TARGETARCH is not being set in installer ONBUILD hooks

Attempting to build the kernel on v1.4.1 and am running into this issue when trying to add my kernel into the installer:

DOCKER_BUILDKIT=0 docker build --build-arg RM="/lib/modules" -t installer:kernel .
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            BuildKit is currently disabled; enable it by removing the DOCKER_BUILDKIT=0
            environment-variable.

Sending build context to Docker daemon  2.048kB
Step 1/4 : FROM scratch AS customization
 ---> 
Step 2/4 : COPY --from=registry.local/library/kernel:v1.4.1-dirty /lib/modules /lib/modules
 ---> Using cache
 ---> 4f31e4a0b166
Step 3/4 : FROM ghcr.io/siderolabs/installer:v1.4.1
# Executing 7 build triggers
 ---> Using cache
 ---> Using cache
 ---> Using cache
 ---> Running in 3a087765916c
xz: /usr/install//initramfs.xz: No such file or directory
The command '/bin/sh -c xz -d /usr/install/${TARGETARCH}/initramfs.xz     && cpio -idvm < /usr/install/${TARGETARCH}/initramfs     && unsquashfs -f -d /rootfs rootfs.sqsh     && for f in ${RM}; do rm -rfv /rootfs$f; done     && rm /usr/install/${TARGETARCH}/initramfs     && rm rootfs.sqsh' returned a non-zero code: 1

Dockerfile:

FROM scratch AS customization
COPY --from=registry.local/library/kernel:v1.4.1-dirty /lib/modules /lib/modules

FROM ghcr.io/siderolabs/installer:v1.4.1
COPY --from=registry.local/library/kernel:v1.4.1-dirty /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz

Steps were followed from here (https://www.talos.dev/v1.4/advanced/customizing-the-kernel/) with the only tweak being me adding PLATFORM=linux/amd64 to the make kernel. Used this issue comment to setup buildkit: #551 (comment). Don't know enough about multiplatform images and the ONBUILD hooks to make sense of this.

OS: Ubuntu 22.04.2 LTS
Docker: 23.0.5

Can't create a working nonfree-kmod-nvidia build.

I'm trying to create a modified Nvidia build with an old version of the Nvidia driver for support with my legacy GPUs, but I need some help. I know I'll need a matching nvidia-container-toolkit extension, too; I plan on working on that after I create a working installer image. I am still figuring out what I'm doing; some guidance would be appreciated. All the container images used in this issue are public and available in my fork of the pkgs repo.

Environment

Mac OS Ventura 13.4.1 with Docker Desktop v4.22.0 Engine 24.0.5.

docker --version                                                                                                                                                                                                                           
Docker version 24.0.5, build ced0996600

docker buildx version                                                                                                                                                                                                                      
github.com/docker/buildx v0.11.2-desktop.1 986ab6afe790e25f022969a18bc0111cff170bc2

talosctl version --client                                                                                                                                                                                                                  [18:52:28]
Client:
        Tag:         v1.4.8
        SHA:         84c2961a
        Built:       
        Go version:  go1.20.7
        OS/Arch:     darwin/amd64

Building Images (Proprietary drivers)

First, I forked the pkgs repo and modified it with the old driver. Changes can be found in this 0772cc1 commit. I built using make kernel nonfree-kmod-nvidia PUSH=true USERNAME=cubea01 PLATFORM=linux/amd64, which succeeded and pushed without issue.

Then I created a Dockerfile based on the documentation here.

DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/cubea01/installer:v1.4.7-470.199.02 .
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            BuildKit is currently disabled; enable it by removing the DOCKER_BUILDKIT=0
            environment-variable.

Sending build context to Docker daemon  3.584kB
Step 1/4 : FROM scratch as customization
 ---> 
Step 2/4 : COPY --from=ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02 /lib/modules /lib/modules
 ---> ef1c4e91f9f9
Step 3/4 : FROM ghcr.io/siderolabs/installer:v1.4.7
7d7551ba622a: Download complete 
e4ae3f15a6ee: Download complete 
6bbf856e9412: Download complete 
2f544ee616f0: Download complete 
 ---> 7d7551ba622a
Step 4/4 : COPY --from=ghcr.io/cubea01/kernel:v1.4.7-470.199.02 /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
 ---> 2463d0cb737c
[Warning] One or more build-args [RM] were not consumed
error squashing image: not implemented

And as you can see above, it didn't work. So, I rewrote the Dockerfile with my basic grasp of how they worked, and here's the result of the new build.

DOCKER_BUILDKIT=1 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/cubea01/installer:v1.4.7-470.199.02 .

WARNING: experimental flag squash is removed with BuildKit. You should squash inside build using a multi-stage Dockerfile for efficiency.
[+] Building 54.7s (21/21) FINISHED                                                                                                                                                                                                    docker:desktop-linux
 => [internal] load .dockerignore                                                                                                                                                                                                                      0.0s
 => => transferring context: 52B                                                                                                                                                                                                                       0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                   0.0s
 => => transferring dockerfile: 905B                                                                                                                                                                                                                   0.0s
 => [internal] load metadata for ghcr.io/cubea01/kernel:v1.4.7-470.199.02                                                                                                                                                                              1.6s
 => [internal] load metadata for ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02                                                                                                                                                                 1.6s
 => [internal] load metadata for ghcr.io/siderolabs/installer:v1.4.7                                                                                                                                                                                   1.6s
 => [auth] cubea01/kernel:pull token for ghcr.io                                                                                                                                                                                                       0.0s
 => [auth] siderolabs/installer:pull token for ghcr.io                                                                                                                                                                                                 0.0s
 => [auth] cubea01/nonfree-kmod-nvidia:pull token for ghcr.io                                                                                                                                                                                          0.0s
 => [stage-2 1/3] FROM ghcr.io/siderolabs/installer:v1.4.7@sha256:7d7551ba622aaa0cc4af9d5c8e5c9eff6307ebb160efdb783a0541a39e477439                                                                                                                     0.0s
 => => resolve ghcr.io/siderolabs/installer:v1.4.7@sha256:7d7551ba622aaa0cc4af9d5c8e5c9eff6307ebb160efdb783a0541a39e477439                                                                                                                             0.0s
 => FROM ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02@sha256:6c283a0075d4b7b576ae0be874bd5aaa42584bf3cd600a9769f5ce32859c0bad                                                                                                                 0.0s
 => => resolve ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02@sha256:6c283a0075d4b7b576ae0be874bd5aaa42584bf3cd600a9769f5ce32859c0bad                                                                                                           0.0s
 => FROM ghcr.io/cubea01/kernel:v1.4.7-470.199.02@sha256:f878d7348b811c2284c1d8831f1ab48de70f39c9c9e30d80a0815dea568e47d2                                                                                                                              0.0s
 => => resolve ghcr.io/cubea01/kernel:v1.4.7-470.199.02@sha256:f878d7348b811c2284c1d8831f1ab48de70f39c9c9e30d80a0815dea568e47d2                                                                                                                        0.0s
 => [kernel_stage 2/2] COPY --from=ghcr.io/cubea01/kernel:v1.4.7-470.199.02 /boot/vmlinuz /boot/vmlinuz                                                                                                                                                0.1s
 => [nonfree_stage 2/2] COPY --from=ghcr.io/cubea01/nonfree-kmod-nvidia:v1.4.7-470.199.02 /lib/modules /lib/modules                                                                                                                                    0.3s
 => [stage-2 2/3] RUN apk add --no-cache --update     cpio     squashfs-tools     xz                                                                                                                                                                   1.8s
 => [stage-2 3/3] WORKDIR /initramfs                                                                                                                                                                                                                   0.0s
 => [stage-2 4/3] RUN xz -d /usr/install/amd64/initramfs.xz     && cpio -idvm < /usr/install/amd64/initramfs     && unsquashfs -f -d /rootfs rootfs.sqsh     && for f in /lib/modules; do rm -rfv /rootfs$f; done     && rm /usr/install/amd64/initra  3.1s
 => [stage-2 5/3] COPY --from=customization / /rootfs                                                                                                                                                                                                  0.0s
 => [stage-2 6/3] RUN find /rootfs     && mksquashfs /rootfs rootfs.sqsh -all-root -noappend -comp xz -Xdict-size 100% -no-progress     && set -o pipefail && find . 2>/dev/null | cpio -H newc -o | xz -v -C crc32 -0 -e -T 0 -z >/usr/install/amd6  29.9s
 => [stage-2 7/3] COPY --from=nonfree_stage /lib/modules /lib                                                                                                                                                                                          0.1s
 => [stage-2 8/3] COPY --from=kernel_stage /boot/vmlinuz /usr/install/amd64/vmlinuz                                                                                                                                                                    0.1s
 => exporting to image                                                                                                                                                                                                                                17.9s
 => => exporting layers                                                                                                                                                                                                                               14.3s
 => => exporting manifest sha256:5ceb434d89ef82520add2fe92a1ba566853a1d8802081b1e421737638db408b0                                                                                                                                                      0.0s
 => => exporting config sha256:448ba4a0623bcc1b4873c250b97a97de989013d4ef2fe13f7b0155f7fd520123                                                                                                                                                        0.0s
 => => exporting attestation manifest sha256:fd6cdde6c1f9c684d1b19adb5556bdc27660b3932663b2c7a2e5d35029d03058                                                                                                                                          0.0s
 => => exporting manifest list sha256:7cdb47713446b5c6bf97c1e50bc76cd796c481aca9eb6f1b7115f493c7bc59a5                                                                                                                                                 0.0s
 => => naming to ghcr.io/cubea01/installer:v1.4.7-470.199.02                                                                                                                                                                                           0.0s
 => => unpacking to ghcr.io/cubea01/installer:v1.4.7-470.199.02                                                                                                                                                                                        3.5s

Testing new Installer

First, I built and bootstrapped a new single-node cluster based on my main one in a VM. Once it was healthy, I upgraded the VM using the new image. 0.5 Playback speed recording of VM upgrading and rebooting. I uploaded a copy of the talconfig used to create the configuration for the VM here.

Command used to upgrade.

talosctl upgrade -n 10.211.55.26 -e 10.211.55.26 --force --image=ghcr.io/cubea01/installer:v1.4.7-470.199.02

It appears to upgrade and boot just fine, but it constantly spits out the following error, and as you can see in the recording, it no longer detects any network interfaces.

"error initializing kmod manager: stat /lib/modules/6.1.45-talos/modules.dep: no such file or directory"

Enable recent Chelsio NICs

These are fairly popular NICs for hyperconverged storage.

Mellanox extra kernel config

Ref:

rpi_generic: u-boot to 2023.01 results in a failed EFI boot

When booting with the current HEAD of siderolabs/pkg, the boot fails on RaspberryPi 4B & CM4 with:

Scanning usb 0:1...
BootOrder not defined
EFI boot manager: Cannot load any image
Found EFI removable media binary efi/boot/bootaa64.efi
** Reading file would overwrite reserved memory **
Failed to load 'efi/boot/bootaa64.efi'
No UEFI binary known at 0x00080000
EFI LOAD FAILED: continuing...

It is linked to the bootefi bootmgr step in the auto-boot that seems to reserve memory even when it fails.

Here's the manual execution output of booting a CM4 via NVME storage:
Failing:

U-Boot> nvme scan
U-Boot> bootefi bootmgr
Card did not respond to voltage select! : -110
Card did not respond to voltage select! : -110
** Unable to read file ubootefi.var **
Failed to load EFI variables
BootOrder not defined
EFI boot manager: Cannot load any image
U-Boot> devtype=nvme
U-Boot> devnum=0
U-Boot> distro_bootpart=1
U-Boot> run boot_efi_binary
** Reading file would overwrite reserved memory **
Failed to load 'efi/boot/bootaa64.efi
No UEFI binary known at 0x00080000
U-Boot>

Success:

U-Boot> nvme scan
U-Boot> devtype=nvme
U-Boot> devnum=0
U-Boot> distro_bootpart=1
U-Boot> run boot_efi_binary
147456 bytes read in 1 ms (140.6 MiB/s)
Card did not respond to voltage select! : -110
Card did not respond to voltage select! : -110
** Unable to read file ubootefi.var **
Failed to load EFI variables
Booting /efi\boot\bootaa64.efi

Upgrade Linux to 5.7.6

Update containerd to 1.3.6

https://github.com/containerd/containerd/releases/tag/v1.3.6

Upgrade musl to v1.1.23

https://www.musl-libc.org/download.html

symlink `/toolchain/bin/` to `/bin/` in `base`

To make autoconf stop complaining when using fixed paths.

Enable CONFIG_NVME_MULTIPATH

Native nvme multipath is needed for democratic-csi nvmeof https://github.com/democratic-csi/democratic-csi/blob/master/README.md?plain=1#L221

Add intel idle module

As described in the issue we just need a small module enabled:
siderolabs/talos#8595
This is the corresponding kernel parameter:
CONFIG_INTEL_IDLE

Enable XDP in Kernel Config

Cilium has support for express data path but needs it enabled in the kernel and it doesn't currently appear that Talos Linux enables this configuration option. This is especially important while working at higher network speeds — 10 Gbps and above, which is becoming more commonplace as the standard in both enterprise and consumer gear.

https://docs.cilium.io/en/stable/operations/performance/tuning/#xdp-acceleration
https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#xdp-acceleration

Kernel panic with 1.4

Hello,

I'm testing the v1.4.beta0 version of Talos Linux on my HomeLab, while the v1.3 version works fine.

The updated machine (x86_64) rebooted at each kernel startup. I guess it's a Kernel Panic, because with the panic=0 argument, it doesn't reboot anymore.

Unfortunately, I don't have a console output on the monitor, to understand the Kernel Panic. FYI, I didn't have console output in v1.3.

Nvidia Fabric Manager requires Nvidia driver to be of the same version

After building proprietary driver there is an error from Fabric Manager service:
fabric manager NVIDIA GPU driver interface version 525.85.12 don't match with driver version 530.41.03. Please update with matching NVIDIA driver package.

It might make sense to release drivers based on datacenter documentation in a bundle with FabricManager.

However, it seems there are several FTPs to download artifacts from:

https://download.nvidia.com/XFree86/ -- currently in use. This one misses some tesla drivers (aka datacenter), like 525.85.12. I suspect the purpose of this FTP is to primarily serve drivers for desktops.
https://developer.download.nvidia.com/compute/nvidia-driver/redist/ -- this one has both FabricManager and x86 drivers(no arm64), but surprisingly enough does not serve a driver of version 525.85.12. It looks like this FTP might be an attempt to cover datacenter needs, but for some reasons there's inconsistency. Moreover, it does not have the latest version that's specified in datacenter docs - 525.105.17 - neither for FabricManager, nor for the driver.

But, using Official Advanced Driver Search I can find that 525.105.17 driver, both for arm64 and x86, and I can download it from here: https://us.download.nvidia.com/tesla/525.105.17/NVIDIA-Linux-x86_64-525.105.17.run

And there's fabric-manager-525 of a required version in ubuntu packages, with original tars. I wonder where did they get it from...

That being said, I think it is rather undesirable to ship Talos even with mandatory pre-defined version of an OSS driver. Consider these examples:

Machines run "dated" GPUs which require driver 470.X.X -- this driver is still well supported by Nvidia and the only limit is it's locked with CUDA 11.4. A classic example would be a popular Tesla K80.
A newer version of an upstream 525.X.X driver contains a bug and there's a need of a rollback, but it wouldn't make sense to rollback Talos version as well and be blocked by Nvidia in such case.

I would say, it basically means that maintaining a custom talos-installer would be safer and more versatile, given everybody has different supported hardware.

I think people deploy k8s on nodes with GPU for ML mostly, and if that's the case the only thing that matter is supported CUDA version, driver itself just needs to be stable enough.

Reboot loop with error 'WARNING: unrecognized segment type thin-pool'

Booting/Rebooting a cluster node with a disk that include thin-pool / thin logical volumes causes the boot process to fail, triggering a reboot loop.

Error Message

...
Failed to activate logical volumes, exit code 5
...
WARNING: unrecognized segment type thin-pool'
WARNING: unrecognized segment type thin

(I'm running this in proxmox so don't have a convenient way to copy/paste the full output)

Environment Setup
Cluster Version: "v1.6.0-alpha.0-59-g0bd1bdd74"

Each node has 2 disks, one boot another for PV storage
drbd and dm-thin-pool kernel modules in the configuration

machine:
  install:
    extensions:
      - image: "ghcr.io/siderolabs/drbd:9.2.5-v1.6.0-alpha.0-17-g0ba9f81"
  kernel:
    modules:
      - name: drbd
        parameters:
          - usermode_helper=disabled
      - name: drbd_transport_tcp
      - name: dm-thin-pool

First boot and cluster initialization works fine, this is prior to /dev/sdb disk being initialized
Install https://github.com/piraeusdatastore/piraeus-operator
Configure piraeus to use drbd on /dev/sdb and provision as a thin-pool

As long as the system remains online, I can provision thin-pool volumes and everything works as expected. On reboot it the system fails to load.

Possible Fix
From what I gather, this error is due to LVM2 of talos not being compiled with thin-pool support which is used during boot to mount the local disks.
https://listman.redhat.com/archives/linux-lvm/2013-July/022321.html

Setting https://github.com/siderolabs/pkgs/blob/main/lvm2/pkg.yaml#L26 to --with-thin=internal may fix the issue but I'm not exactly sure how this works if a user doesn't enable the dm-thin-pool module.

CVE-2024-21626 "Leaky Vessels" container escape vulnerability in runc ≤ 1.1.11

https://snyk.io/blog/cve-2024-21626-runc-process-cwd-container-breakout/

Snyk has discovered a vulnerability in all versions of runc <=1.1.11, as used by the Docker engine, along with other containerization technologies such as Kubernetes. Exploitation of this issue can result in container escape to the underlying host OS, either through executing a malicious image or building an image using a malicious Dockerfile or upstream image (i.e., when using FROM). This issue has been assigned the CVE-2024-21626.

Cannot run Neuvector on Talos due to missing "grep"

Hi team.

What are the chances of having "grep", "pgrep" and "sed" added to the Talos Linux rootfs? Neuvector (security product bought by Suse) seems to need these to exist on the Kubernetes host for it's Enforcer process to run.

Related issue I have raised with Neuvector: neuvector/neuvector#541

Cheers.
Shaun.

Enable CONFIG_CRYPTO_USER_API_HASH

This is required by cilium: https://docs.cilium.io/en/v1.7/install/system_requirements/#linux-kernel

Drop `-pkg` suffixed images for in-tree kernel modules

Drop -pkg suffixed images for in-tree kernel modules

Extensions repo can base on the kernel image and just copy out files.

Enable Bluetooth inside kernel?

Currently Bluetooth drivers are not enabled in the kernel.

There are use cases for k8s like home assistant to communicate with bluetooth devices.

https://www.kernelconfig.io/config_bt

siderolabs/talos#5609 (reply in thread)

support podman with pkgs

The Makefile currently only supports builds based on docker. Is there a way to use podman instead?

Add multipath-tools

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

chore: update releases (containerd/containerd, containernetworking/plugins, git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git, git://sourceware.org/git/lvm2.git, https://github.com/ipxe/ipxe.git, nvidia/open-gpu-kernel-modules)
chore: update dependency nvidia/open-gpu-kernel-modules to v550

Detected dependencies

github-actions

.github/workflows/ci.yaml

actions/checkout v4

docker/setup-buildx-action v3

docker/login-action v3

actions/github-script v7

crazy-max/ghaction-github-release v2

actions/checkout v4

docker/setup-buildx-action v3

moby/buildkit v0.13.2

moby/buildkit v0.13.2

.github/workflows/slack-notify.yaml

slackapi/slack-github-action v1

.github/workflows/weekly.yaml

actions/checkout v4

docker/setup-buildx-action v3

moby/buildkit v0.13.2

regex

Pkgfile

containernetworking/plugins v1.4.1

containerd/containerd v1.7.16@83031836b2cf55637d7abf847b17134c51b38e53

git://git.kernel.org/pub/scm/utils/cryptsetup/cryptsetup.git 2.7.2

dosfstools/dosfstools 4.2

LINBIT/drbd 9.2.9

eudev-project/eudev v3.2.14

flannel-io/cni-plugin v1.2.0@6464faacf5c00e25321573225d74638455ef03a0

https://github.com/google/gasket-driver.git 5815ee3908a46a415aac616ac7b9aedcb98a504c

git://git.savannah.gnu.org/grub.git 2.12

ipmitool/ipmitool 1_8_19

git://git.netfilter.org/iptables 1.8.10

https://github.com/ipxe/ipxe.git d7e58c5a812988c341ec4ad19f79faf067388d58

git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git 6.6.30

git://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git 32

https://pagure.io/libaio.git 0.3.113

benhoyt/inih 58

json-c/json-c 0.17

rpm-software-management/popt 1.19

seccomp/libseccomp 2.5.5

git://git.liburcu.org/userspace-rcu.git 0.14.0

git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git 20240513

git://sourceware.org/git/lvm2.git 2_03_22

git://git.musl-libc.org/musl 1.2.5

nvidia/open-gpu-kernel-modules 535.129.03

git://git.openssl.org/openssl.git 3.3.0

opencontainers/runc v1.1.12@51d5e94601ceffbbd85688df1c928ecccbfa4685

git://repo.or.cz/socat.git 1.8.0.0

git://git.kernel.org/pub/scm/boot/syslinux/syslinux.git 6.03

git://git.kernel.org/pub/scm/utils/util-linux/util-linux.git 2.40.1

git://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git 5.18.0

openzfs/zfs 2.2.4

Pkgfile

siderolabs/bldr v0.3.0

Check this box to trigger a request for Renovate to run again on this repository

Add dxgkrnl as a kernel module

https://github.com/MBRjun/dxgkrnl-dkms-lts already has a nice DKMS setup that we can take inspiration of how to build it. Should basically work out like how we build nvidia modules.

compilation warnings/errors running `make kernel PLATFORM=linux/arm64`

Howdie, I hope you are well

Thanks so much for sharing this project, it has been super useful and interesting to me so far <3

I've checked out dcc0311, and run docker buildx create --platform linux/amd64,linux/arm64 --use && make kernel PLATFORM=linux/arm64

This is on an up-to-date Archlinux machine: uname -a

Linux HOSTNAME 5.18.14-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 23 Jul 2022 11:46:17 +0000 x86_64 GNU/Linux

I believe this is the part of the logs that shows the compilation error:

...
#0 334.5 arch/arm64/kernel/signal.c: In function 'setup_rt_frame':
#0 334.5 arch/arm64/kernel/signal.c:785:35: error: invalid use of undefined type 'struct rt_sigframe'
#0 334.5   785 |         __put_user_error(0, &frame->uc.uc_flags, err);
#0 334.5       |                                   ^~
#0 334.5 ./arch/arm64/include/asm/uaccess.h:406:22: note: in definition of macro '__put_user_error'
#0 334.5   406 |         __typeof__(*(ptr)) __user *__p = (ptr);                         \
#0 334.5       |                      ^~~
#0 334.5 arch/arm64/kernel/signal.c:785:35: error: invalid use of undefined type 'struct rt_sigframe'
#0 334.5   785 |         __put_user_error(0, &frame->uc.uc_flags, err);
#0 334.5       |                                   ^~
#0 334.5 ./arch/arm64/include/asm/uaccess.h:406:43: note: in definition of macro '__put_user_error'
#0 334.5   406 |         __typeof__(*(ptr)) __user *__p = (ptr);                         \
#0 334.5       |                                           ^~~
#0 334.5 arch/arm64/kernel/signal.c:786:38: error: invalid use of undefined type 'struct rt_sigframe'
#0 334.5   786 |         __put_user_error(NULL, &frame->uc.uc_link, err);
#0 334.5       |                                      ^~
#0 334.5 ./arch/arm64/include/asm/uaccess.h:406:22: note: in definition of macro '__put_user_error'
#0 334.5   406 |         __typeof__(*(ptr)) __user *__p = (ptr);                         \
#0 334.5       |                      ^~~
#0 334.5 arch/arm64/kernel/signal.c:786:38: error: invalid use of undefined type 'struct rt_sigframe'
#0 334.5   786 |         __put_user_error(NULL, &frame->uc.uc_link, err);
#0 334.5       |                                      ^~
#0 334.5 ./arch/arm64/include/asm/uaccess.h:406:43: note: in definition of macro '__put_user_error'
#0 334.5   406 |         __typeof__(*(ptr)) __user *__p = (ptr);                         \
#0 334.5       |                                           ^~~
#0 334.5 ./arch/arm64/include/asm/uaccess.h:396:40: warning: initialization of 'int' from 'void *' makes integer from pointer without a cast [-Wint-conversion]
#0 334.5   396 |         __typeof__(*(ptr)) __rpu_val = (x);                             \
#0 334.5       |                                        ^
#0 334.5 ./arch/arm64/include/asm/uaccess.h:410:17: note: in expansion of macro '__raw_put_user'
#0 334.5   410 |                 __raw_put_user((x), __p, (err));                        \
#0 334.5       |                 ^~~~~~~~~~~~~~
#0 334.5 arch/arm64/kernel/signal.c:786:9: note: in expansion of macro '__put_user_error'
#0 334.5   786 |         __put_user_error(NULL, &frame->uc.uc_link, err);
#0 334.5       |         ^~~~~~~~~~~~~~~~
#0 334.5 arch/arm64/kernel/signal.c:788:38: error: invalid use of undefined type 'struct rt_sigframe'
#0 334.5   788 |         err |= __save_altstack(&frame->uc.uc_stack, regs->sp);
#0 334.5       |                                      ^~
#0 334.5 arch/arm64/kernel/signal.c:793:59: error: invalid use of undefined type 'struct rt_sigframe'
#0 334.5   793 |                         err |= copy_siginfo_to_user(&frame->info, &ksig->info);
#0 334.5       |                                                           ^~
#0 334.5 arch/arm64/kernel/signal.c:794:62: error: invalid use of undefined type 'struct rt_sigframe'
#0 334.5   794 |                         regs->regs[1] = (unsigned long)&frame->info;
#0 334.5       |                                                              ^~
#0 334.5 arch/arm64/kernel/signal.c:795:62: error: invalid use of undefined type 'struct rt_sigframe'
#0 334.5   795 |                         regs->regs[2] = (unsigned long)&frame->uc;
#0 334.5       |                                                              ^~
#0 334.7 arch/arm64/kernel/signal.c: In function 'sigframe_size':
#0 334.7 arch/arm64/kernel/signal.c:86:1: error: control reaches end of non-void function [-Werror=return-type]
#0 334.7    86 | }
#0 334.7       | ^
#0 335.0 cc1: some warnings being treated as errors
#0 335.0 make[2]: *** [scripts/Makefile.build:289: arch/arm64/kernel/signal.o] Error 1
...
#0 358.2 make[1]: *** [scripts/Makefile.build:552: arch/arm64/kernel] Error 2
...
#0 399.8 make: *** [Makefile:1893: arch/arm64] Error 2
...
------
error: failed to solve: process "/dev/.buildkit_qemu_emulator /toolchain/bin/bash -c set -eou pipefail\ncd /src\n\nmake -j $(nproc)\nmake -j $(nproc) modules\n\nif [[ \"${ARCH}\" == \"arm64\" ]]; then\n  echo \"Compiling device-tree blobs\"\n  make -j $(nproc) dtbs\nfi\n" did not complete successfully: exit code: 2
make[2]: *** [Makefile:79: target-kernel] Error 1

The same commit works just fine for make kernel PLATFORM=linux/amd64

Cilium now requires CONFIG_NETFILTER_XT_MATCH_SOCKET

As per their instructions, Cilium now requires the XT socket match capability for fast Layer7 service.

kube-proxy fails when in `proxy-mode=nftables` mode

The nftables kube-proxy backend is planned to one day be the default backend for kube-proxy. Support was added in 1.29

Uppon adding --feature-gates=NFTablesProxyMode=true --proxy-mode=nftables to iptables it to add any rules:

│     /dev/stdin:892:87-100: Error: Could not process rule: No such file or directory
│     add rule ip kube-proxy external-3BXM2ZZ4-haproxy/external-kubernetes-ingress/tcp/http fib saddr type local jump mark-for-masquerade comment "masquerade local traffic" 
│                                                                                           ^^^^^^^^^^^^^^ 
│     /dev/stdin:893:87-100: Error: Could not process rule: No such file or directory

I believe this is due to these unset kernel variables:

CONFIG_NFT_FIB_IPV4=m
CONFIG_NFT_FIB_IPV6=m

N.B. nftables masquerade was recently added to `flannel.

You can read more about why and what in the KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/3866-nftables-proxy/README.md

enable CONFIG_EFI_DISABLE_PCI_DMA to improve boot security from malicious PCI hardware

Enable CONFIG_EFI_DISABLE_PCI_DMA to improve boot security from malicious PCI hardware

From the author Matthew Garrett:

Add an option to disable the busmaster bit in the control register on
all PCI bridges before calling ExitBootServices() and passing control
to the runtime kernel. System firmware may configure the IOMMU to prevent
malicious PCI devices from being able to attack the OS via DMA. However,
since firmware can't guarantee that the OS is IOMMU-aware, it will tear
down IOMMU configuration when ExitBootServices() is called. This leaves
a window between where a hostile device could still cause damage before
Linux configures the IOMMU again.

A caution from the author:

This option may cause failures with some poorly behaved hardware and
should not be enabled without testing. The kernel commandline options
"efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma" may be
used to override the default. Note that PCI devices downstream from PCI
bridges are disconnected from their drivers first, using the UEFI
driver model API, so that DMA can be disabled safely at the bridge
level.

More info about DMA attacks
Source of quotes: lore

Enable Intel Management Engine Interface (MEI)

It looks like the Intel Management Engine Interface (MEI) is required to use Intel Arc. The i915 firmware does not work as the HuC firmware will fail to load.

See:

jellyfin/jellyfin#9588

uhthomas@6a83361

https://gitlab.freedesktop.org/drm/intel/-/issues/7732

build Mellanox driver as a module

Build fails with unknown flag: --target

Trying to build with the instructions for the NVIDIA GPU module:

make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true

fails almost immediately with the error:

[I] ➜  make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true

make[1]: Entering directory '/home/user/code/siderolabs/pkgs'
make[2]: Entering directory '/home/user/code/siderolabs/pkgs'
unknown flag: --target
See 'docker --help'.

Enable USB Attached SCSI (UAS) driver

The UAS driver allows for much better read and write performance and overcomes shortcomings in the bulk only transfer mode around command queueing and streaming.

I believe it is as simple as enabling https://github.com/siderolabs/pkgs/blob/43451e68a0ddf634b90c7c12cca9437faa52d183/kernel/build/config-amd64#L4152C9-L4152C9 and I will be trying to build a custom Talos image with this enabled.

You can confirm the driver is loaded with dmesg | grep usbcore where you see a line like

usbcore: registered new interface driver uas

enable CONFIG_NET_VRF in kernel

We have use cases to use VRF for advanced networking setup.

https://www.kernel.org/doc/html/latest/networking/vrf.html

VRFs allow the kernel to have multiple routing tables. With VRFs, we can logically isolate interfaces and restrict traffic flowing between them.

without it, e.g. following does not work:

# ip link add vrf-blue type vrf table 10
Error: Unknown device type.