Coder Social home page Coder Social logo

Comments (15)

CharleeSF avatar CharleeSF commented on August 17, 2024

Also see the discussion here: https://forum.snapcraft.io/t/possible-to-run-scripts-as-root-on-startup-on-ubuntu-core/39580/7

And the comment by mborzecki1:

I briefly looked at the snap. The docker-proxy service should really be made a socket activated service, then systemd starts listening on the socket and that nonsense of unlink/reuseaddr is gone. Unfortunately, you’d need to add code to hand over the file descriptor provided by systemd to socat. Other servies which may use what docker-proxy provides should use after: [docker-proxy] in their declarations. This needs to be fixed in snap packaging, so there’s not much you can do, except for opening some PRs or filing bugs if you’re simply consuming the snap.

from iotedge.

damonbarry avatar damonbarry commented on August 17, 2024

Hi @CharleeSF I tried to repro locally and couldn't, and we haven't seen this in our internal testing. But I see from the forum thread that the problem is fairly well understood. @alexclewontin can you comment on the suggested approach in the forum thread?

from iotedge.

CharleeSF avatar CharleeSF commented on August 17, 2024

Hey @damonbarry, thanks for looking into this!

I am also not always able to reproduce it. The strange thing is, once I can reproduce it, it happens on every reboot.. But not every setup has it.

Since the setup takes quite a lot of time I haven't tried to get a 100% reproduction scenario.

I have however, also seen that the docker-proxy fails due to /var/run/docker.sock not being available yet.

May I ask why edged doesn't talk to /var/run/docker.sock directly?

from iotedge.

CharleeSF avatar CharleeSF commented on August 17, 2024

Hey @damonbarry,

Is there any progress on this? I just wanted to mention that I also regularly see that docker seems slower/later in booting than azure-iot-edge, resulting in behavior like this:

2024-04-17T17:51:26Z systemd[1]: Started Service for snap application azure-iot-edge.aziot-edged.
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1825]: Making /var/run/iotedge if it does not exist
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1825]: Successfully made /var/run/iotedge if it did not exist
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Starting Azure IoT Edge Daemon
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Version - 1.4.33
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Obtaining Edge device provisioning data...
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Device is SnapeQemuAuto2 on iq-shared-0-8ccd7.azure-devices.net
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Initializing module runtime...
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Using runtime network id azure-iot-edge
2024-04-17T17:51:26Z azure-iot-edge.docker-proxy[1833]: 2024/04/17 17:51:26 socat[1833] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [WARN] - container runtime error
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: Caused by:
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]:     channel closed

Further supporting a revision of the docker-proxy behavior.. (in the snapcraft forum thread a solution is suggested for this, by waiting for the /var/run/docker.sock to become available).

Restarting the docker-proxy daemon fixes the issue, but as mentioned before, I don't want to have to do anything on my device for it to boot properly. I think a slower interval between restarting the daemon of docker-proxy would also help. It seems to do 8 retries but they are all before docker has made the socket available.

I am testing azure-iot-edge with quite heavy workloads, maybe that's why docker boots slower?

from iotedge.

alexclewontin avatar alexclewontin commented on August 17, 2024

I can weigh in more eventually, but quickly hopping in to provide some context on why the proxy exists and iotedge doesn't talk to docker directly:

The issue is that in all-snap environments (i.e. Ubuntu Core) docker is provided as a snap, and there is no docker group, so you essentially cannot talk to docker.sock if you are running as UID != 0. aziot-edged runs as user snap_aziotedge and so the docker proxy runs as root in the context of the iotedge snap, but provides the proxy socket with snap_aziotedge ownership to let aziot-edged "escalate" its privileges here, without opening a massive hole that would allow any user to talk to the docker socket.

from iotedge.

alexclewontin avatar alexclewontin commented on August 17, 2024

My naive suggestion would be that edglet/contrib/snap/socat.sh could try rm -f $SNAP_COMMON/docker-proxy.sock before listening, which would maybe clean up the issue where it fails to listen because the file already exists. Maybe making aziot-edged after: docker-proxy would also help with the timing issue? Because it's a simple daemon this may or may not be comprehensive; the right way to handle that would probably be to make docker-proxy a notify-type daemon and use systemd-notify to indicate readiness after successfully listening, but there are some issues with snap confinement/PID numbers when calling it from a shell script, so I'd have to play with that to see if I could make it work

from iotedge.

CharleeSF avatar CharleeSF commented on August 17, 2024

@alexclewontin

Ah, thanks for the explanation about why the proxy exists! :) That makes more sense now.

Also, for the last problem I've had, would adding something like this to socat.sh work?

docker_socket="/var/run/docker.sock"

while [ ! -e "$docker_socket" ]; do
    echo "Docker socket ($docker_socket) does not exist yet. Sleeping"
    sleep 1
done

I think that together with the after: docker-proxy might do the trick?

from iotedge.

alexclewontin avatar alexclewontin commented on August 17, 2024

My reservation there is that because docker-proxy is a simple daemon, systemd doesn't know the difference between the wait loop and actively listening on the socket, so even when it enters the wait loop systemd will consider the proxy ready and then try to start aziot-edged. I think I'd rather keep it so socat errors out, because then there's potential for systemd to catch the problem and wait on starting aziot-edged. However that's still a bit racy, depending on how quickly socat errors out vs how quickly systemd starts aziot-edged.

The systemd-notify approach would address that race by waiting for the script to actively affirm that it is indeed ready, after successfully listening on the socket.

from iotedge.

CharleeSF avatar CharleeSF commented on August 17, 2024

I see I see.

Maybe we can consider the daemon retry interval a little longer? I think it is currently very fast and stops after a few times because of it and doesn't recover..

(Currently this retry is also triggered, because aziot-edged fails, but it doesn't recover because the socket becomes available after systemd has given up on restarting it)

from iotedge.

CharleeSF avatar CharleeSF commented on August 17, 2024

To give you an idea of the timeframe... I made a little daemon script that helps me recover from this.

The script:

#!/bin/bash
# This script removes a problematic file preventing snap from starting after reboot.
echo "Running remove-socket.sh"
ls -l /var/snap/azure-iot-edge/common/docker-proxy.sock
echo "Removing /var/snap/azure-iot-edge/common/docker-proxy.sock"
rm -f /var/snap/azure-iot-edge/common/docker-proxy.sock

docker_socket="/var/run/docker.sock"

while [ ! -e "$docker_socket" ]; do
    echo "Docker socket ($docker_socket) does not exist yet. Sleeping"
    sleep 1
done

echo "Sleeping to see if docker-proxy revives itself"
sleep 3

echo "Checking if docker-proxy is in failed state + restart everything if yes"
sudo snap logs azure-iot-edge.docker-proxy | tail -n1 | grep '"/var/run/docker.sock", 22): No such file or directory'
if [ $? -eq 0 ]; then
    echo "Restarting azure-iot-edge because it had failed due to no docker.sock"
    sudo snap restart azure-iot-edge
fi

The output after a reboot:

Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Running remove-socket.sh
Apr 17 18:50:49 ubuntu remove-socket.sh[741]: ls: cannot access '/var/snap/azure-iot-edge/common/docker-proxy.sock': No such file or directory
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Removing /var/snap/azure-iot-edge/common/docker-proxy.sock
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:50 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:51 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:52 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:53 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:54 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:55 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:56 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:57 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:58 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:59 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:00 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:01 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:02 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:03 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:04 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:05 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:06 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:07 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:08 ubuntu remove-socket.sh[735]: Sleeping to see if docker-proxy revives itself
Apr 17 18:51:11 snape remove-socket.sh[735]: Checking if docker-proxy is in failed state + restart everything if yes
Apr 17 18:51:11 snape remove-socket.sh[2059]: 2024-04-17T18:51:08Z azure-iot-edge.docker-proxy[1668]: 2024/04/17 18:51:08 socat[1668] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory
Apr 17 18:51:11 snape remove-socket.sh[735]: Restarting azure-iot-edge because it had failed due to no docker.sock
Apr 17 18:51:11 snape remove-socket.sh[2074]: Restarted.

from iotedge.

alexclewontin avatar alexclewontin commented on August 17, 2024

Yeah certainly, I think setting the retry interval on at least the proxy, if not both daemons to 1 or more seconds would be a helpful first step

from iotedge.

CharleeSF avatar CharleeSF commented on August 17, 2024

Should I make a PR for that, or do you guys prefer to do it?

from iotedge.

KhazAkar avatar KhazAkar commented on August 17, 2024

Bumping the topic using comment, because it makes azure-iot-edge snap unreliable and requires hacks to circumvent them.

from iotedge.

damonbarry avatar damonbarry commented on August 17, 2024

@CharleeSF Can I get a little more info about the PR you're proposing?

I see a few different ideas proposed here:

  • Increase the retry interval on docker-proxy and aziot-edged
  • Set after: docker-proxy in aziot-edged
  • Inside docker-proxy, delete the proxy socket file before calling socat
  • Convert docker-proxy from daemon: simple to daemon: notify and use systemd-notify to indicate readiness

Which of these ideas (or others) would your PR contain?

from iotedge.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.