Since 8390247 (OS-8181 lx getsockopt of SO_PROTOCOL fails) and its follow-up 462ccf4 (OS-8181 exposed fault in lx netlink bind/sendto) were integrated, modern Linux systems with systemd intermittently fail to boot (in some cases they fail more often than not)
When it fails, systemd hangs early:
NOTICE: Zone booting up]
Failed to read /proc/sys/fs/nr_open, ignoring: No such file or directory
<30>systemd[1]: systemd 245.4-4ubuntu3.1 running in system mode. (+PAM +AUDIT +S
ELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL
+XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hy
brid)
<30>systemd[1]: Detected virtualization container-other.
<30>systemd[1]: Detected architecture x86-64.
Welcome to Ubuntu 20.04 LTS!
<31>systemd[1]: sd-netlink: Failed to enable NETLINK_EXT_ACK option, ignoring: Invalid argument
<31>systemd[1]: Failed to bring loopback interface up: Connection timed out
strace on the systemd
process shows it in a loop calling poll()
and recvmsg(PEEK)
recvmsg(4, {msg_name=0x7fffffeff750, msg_namelen=128->0, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 0
ppoll([{fd=4, events=POLLIN}], 1, {tv_sec=5, tv_nsec=0}, NULL, 8) = 0 (Timeout)
recvmsg(4, {msg_name=0x7fffffeff750, msg_namelen=128->0, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 0
ppoll([{fd=4, events=POLLIN}], 1, {tv_sec=5, tv_nsec=0}, NULL, 8) = 0 (Timeout)
recvmsg(4, {msg_name=0x7fffffeff750, msg_namelen=128->0, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 0
ppoll([{fd=4, events=POLLIN}], 1, {tv_sec=5, tv_nsec=0}, NULL, 8) = 0 (Timeout)
recvmsg(4, {msg_name=0x7fffffeff750, msg_namelen=128->0, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 0
Extracting the stack trace (via mdb and /nap) shows:
0x7fffffeff4c8: libc.so.6`ppoll+0x4f
0x7fffffeff818: libsystemd-shared-245.so`socket_read_message+0x68
0x7fffffeff958: libsystemd-shared-245.so`close_nointr+0xb
I used dtrace to do some more delving here and fd 4 is a netlink socket. Modifying the lx init process to give me time to attach strace early shows that systemd sends three netlink messages in quick succession, and there are three replies.
sendto(4, " \0\0\0\24\0\5\0\1\0\0\0\0\0\0\0\2\10\200\376\1\0\0\0\10\0\2\0\177\0\0\1", 32, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 16)
sendto(4, ",\0\0\0\24\0\5\0\2\0\0\0\0\0\0\0\n\200\200\376\1\0\0\0\24\0\2\0\0\0\0\0"..., 44, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 16)
sendto(4, " \0\0\0\23\0\5\0\3\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0", 32, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 16)
ppoll([{fd=4, events=POLLIN}], 1, {tv_sec=4, tv_nsec=999790000}, NULL, 8)
recvmsg(4, {msg_namelen=12}, MSG_PEEK|MSG_TRUNC)
recvmsg(4, {msg_namelen=12}, MSG_TRUNC)
ppoll([{fd=4, events=POLLIN}], 1, {tv_sec=4, tv_nsec=999209000}, NULL, 8)
recvmsg(4, {msg_namelen=12}, MSG_PEEK|MSG_TRUNC)
recvmsg(4, {msg_namelen=12}, MSG_TRUNC)
ppoll([{fd=4, events=POLLIN}], 1, {tv_sec=4, tv_nsec=998680000}, NULL, 8)
recvmsg(4, {msg_namelen=12}, MSG_PEEK|MSG_TRUNC)
recvmsg(4, {msg_namelen=12}, MSG_TRUNC)
Yet when it fails to boot it does not seem happy with the replies, and continues to wait.
I used this dtrace script to investigate the messages going in each direction.
#!/usr/sbin/dtrace -FCs
fbt::lx_netlink_send:entry {
lxsock = (lx_netlink_sock_t *)arg0;
mp = (mblk_t *)arg1;
msg = (struct msghdr *)arg2;
hdr = (lx_netlink_hdr_t *)mp->b_rptr;
printf("---------------------------------------------------------\n");
printf("proto: %x, type: %x, seq: %x",
lxsock->lxns_proto, hdr->lxnh_type, hdr->lxnh_seq);
print(*hdr);
printf("\n");
}
fbt::lx_netlink_reply_done:entry {
r = (lx_netlink_reply_t *)arg0;
mp = (mblk_t *)r->lxnr_err;
self->rr = (lx_netlink_err_t *)mp->b_rptr;
}
fbt::lx_netlink_reply_sendup:entry {
reply = (lx_netlink_reply_t *)arg0;
mp = (mblk_t *)arg1;
mpl = (mblk_t *)arg2;
rr = (lx_netlink_err_t *)self->rr;
printf("\n");
print(*reply);
printf("\n");
print(*rr);
printf("\n");
}
Here's the first send and receive:
# CPU FUNCTION
0 -> lx_netlink_send ---------------------------------------------------------
proto: 0, type: 14, seq: 1
0 -> lx_netlink_reply_sendup
lx_netlink_reply_t {
lx_netlink_hdr_t lxnr_hdr = {
uint32_t lxnh_len = 0x20
uint16_t lxnh_type = 0x14
uint16_t lxnh_flags = 0x5
uint32_t lxnh_seq = 0x1
uint32_t lxnh_pid = 0
}
lx_netlink_sock_t *lxnr_sock = 0xfffffe16f39045d8
uint32_t lxnr_seq = 0
uint16_t lxnr_type = 0
mblk_t *lxnr_mp = 0
mblk_t *lxnr_err = 0
mblk_t *lxnr_mp1 = 0xfffffe17162cce20
int lxnr_errno = 0x7a
}
lx_netlink_err_t {
lx_netlink_hdr_t lxne_hdr = {
uint32_t lxnh_len = 0x24
uint16_t lxnh_type = 0x2
uint16_t lxnh_flags = 0x6b6c
uint32_t lxnh_seq = 0x1
uint32_t lxnh_pid = 0x3d21
}
int32_t lxne_errno = 0x7a
lx_netlink_hdr_t lxne_failed = {
uint32_t lxnh_len = 0x20
uint16_t lxnh_type = 0x14
uint16_t lxnh_flags = 0x5
uint32_t lxnh_seq = 0x1
uint32_t lxnh_pid = 0
}
}
That lxnh_flags
value of 0x6b6c
in the error header is a bit suspicious.
Looking at the netlink code, the error path does not set the flags
field so it is uninitialised memory.
Bit 1 (0x2) of that flags field indicates a multipart message. A hypothesis is that if the reply is flagged as multipart, systemd will wait for a NLMSG_DONE packet.
I wrote a smaller dtrace script to just show the flags in the reply messages and this definitely seems to be the case.
Reproducibly, if the reply messages have flags with bit 1 set, then the zone does not boot.
For example, this is a failed boot:
CPU ID FUNCTION:NAME
9 72240 lx_netlink_send:entry proto: 0, type: 14, seq: 1
9 72186 lx_netlink_reply_sendup:entry Flags: 6f72
9 72240 lx_netlink_send:entry proto: 0, type: 14, seq: 2
9 72186 lx_netlink_reply_sendup:entry Flags: 4143
9 72240 lx_netlink_send:entry proto: 0, type: 13, seq: 3
9 72186 lx_netlink_reply_sendup:entry Flags: 6176
with bit 1 set in each of the flags
and this one succeeded:
CPU ID FUNCTION:NAME
8 72240 lx_netlink_send:entry proto: 0, type: 14, seq: 1
8 72186 lx_netlink_reply_sendup:entry Flags: 7570
8 72240 lx_netlink_send:entry proto: 0, type: 14, seq: 2
8 72186 lx_netlink_reply_sendup:entry Flags: 6c61
8 72240 lx_netlink_send:entry proto: 0, type: 13, seq: 3
8 72186 lx_netlink_reply_sendup:entry Flags: 6261
with bit 1 clear
When there is a mix, the zone also does not boot - here's one where the first reply does not have the flag set.
3 72240 lx_netlink_send:entry proto: 0, type: 14, seq: 1
3 72186 lx_netlink_reply_sendup:entry Flags: 6261
3 72240 lx_netlink_send:entry proto: 0, type: 14, seq: 2
3 72186 lx_netlink_reply_sendup:entry Flags: 6f72
3 72240 lx_netlink_send:entry proto: 0, type: 13, seq: 3
3 72186 lx_netlink_reply_sendup:entry Flags: 7473