Coder Social home page Coder Social logo

Race condition when scripting kubeadm deb installs, kubelet initializes normally before it uses the kubeadm dropin about release HOT 29 CLOSED

errordeveloper avatar errordeveloper commented on July 24, 2024
Race condition when scripting kubeadm deb installs, kubelet initializes normally before it uses the kubeadm dropin

from release.

Comments (29)

dgoodwin avatar dgoodwin commented on July 24, 2024 1

Yeah it's the rpm for sure, it even claims ownership of it, not sure how this was missed. Regardless given the complaints about cloud providers requiring /etc/kubernetes/cloud-config.json, I'm going to make both the pre-flight checks and reset stop assuming ownership of all of /etc/kubernetes and instead just clean out manifests, pki, and delete the admin/kubelet.conf files.

from release.

dgoodwin avatar dgoodwin commented on July 24, 2024 1

This was part of how kubeadm was designed to work, we need to configure kubelet before it can fully launch, services tend to start on install in Debian, and we didn't really want to have kubeadm execution go do things specific to systemd mid-way through. There was a lot of discussion but we landed at crash looping waiting for our config to appear. That is the behavior out of the box with the packages, if someone is rolling their own installation but trying to use kubeadm, they could indeed start hitting problems like this, but IMO we do kind of need to configure the system and systemd in specific ways for a kubeadm opinionated deployment, and the distro packages were a big part of that.

Perhaps we could do an additional pre-flight check ensuring kubelet is crash looping and not running however?

from release.

pesho avatar pesho commented on July 24, 2024

The current check (directory is not empty) is too broad, I've had to skip preflight checks because of it. A better approach would be to only check for the presence of files which kubeadm generates, and ignore other files within these directories.

from release.

errordeveloper avatar errordeveloper commented on July 24, 2024

@dgoodwin I'm seeing this on CentOS also, I think it's kubelet and not the packages... We should probably allow empty directory for the time being.

@pesho I see what you are saying, but I don't see a good reason to do this right now, as files we manage tend to evolve.

from release.

dgoodwin avatar dgoodwin commented on July 24, 2024

@marun hit this, we thought it was newer versions of rpm causing the difference in Fedora where the problem was surfacing, and CentOS where I was not.

Agreed we probably should be more tolerant in pre-flight checks, given a point raised on slack today that cloud-config.json by default must live in /etc/kubernetes, this too causes a problem, so I'm tempted to agree with @pesho that we should be checking more specifically for the files and directories we will write.

@errordeveloper can you paste output of rpm --version and cat /etc/redhat-release for me?

from release.

dgoodwin avatar dgoodwin commented on July 24, 2024

Will try to look into this first thing tomorrow.

from release.

dgoodwin avatar dgoodwin commented on July 24, 2024

@errordeveloper I can reproduce but it does appear to be the rpm, and not kubelet. i.e. I can yum install, /etc/kubernetes/manifests appears, if I remove it and start kubelet it does not. Re-install rpms and it's back. All tests done with 1.5.0-1.alpha.1.409.714f816a349e79 from unstable.

from release.

errordeveloper avatar errordeveloper commented on July 24, 2024

@dgoodwin as you asked:

[vagrant@k8s1 ~]$ rpm --version
RPM version 4.11.3
[vagrant@k8s1 ~]$ cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core) 
[vagrant@k8s1 ~]$ 

from release.

errordeveloper avatar errordeveloper commented on July 24, 2024

@dgoodwin I also looked at kubelet code, and doesn't appear to call mkdir for the manifest path, it does for some other things, but not this.

from release.

errordeveloper avatar errordeveloper commented on July 24, 2024

SGTM

On Tue, 25 Oct 2016, 15:02 Devan Goodwin, [email protected] wrote:

Yeah it's the rpm for sure, it even claims ownership of it, not sure how
this was missed. Regardless given the complaints about cloud providers
requiring /etc/kubernetes/cloud-config.json, I'm going to make both the
pre-flight checks and reset stop assuming ownership of all of
/etc/kubernetes and instead just clean out manifests, pki, and delete the
admin/kubelet.conf files.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#171 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPWS-rVs0I651xKfw9vXHSl7fYntmdGks5q3gv6gaJpZM4Ke8OF
.

from release.

luxas avatar luxas commented on July 24, 2024

@dgoodwin Great if you fix the preflight checks, be sure to ping me on it when ready!

from release.

luxas avatar luxas commented on July 24, 2024

This will be fixed with kubernetes/kubernetes#35632

from release.

dgoodwin avatar dgoodwin commented on July 24, 2024

PR merged, this can be closed now.

from release.

luxas avatar luxas commented on July 24, 2024

Thanks

from release.

codablock avatar codablock commented on July 24, 2024

I'm currently working with a custom built master version of kubeadm and the pre-flight checks still fail for /var/lib/kubelet.

When doing a "kubeadm reset", the contents of /var/lib/kubelet disappear as expected.
"kubeadm init" then complains that kubelet is not running (it was running before reset) which requires me to start the kubelet service manually. After starting the kubelet service, the /var/lib/kubelet contains some empty directories and thus "kubeadm init" is failing again.

Should I open a new issue?

from release.

luxas avatar luxas commented on July 24, 2024

Just wait some hours, we will probably release the new version today which has these fixes in it

from release.

codablock avatar codablock commented on July 24, 2024

@luxas I'm using a custom built kubernetes master from today. Is the kubeadm tool developed in a different repository? As kubernetes/kubernetes#35632 got merged a few days ago, I'd expect the fix to be in my local build already?

from release.

codablock avatar codablock commented on July 24, 2024

Some additional info: I'm on CentOS 7.2. Kubelet is also from latest master.

Contents of /var/lib/kubelet after service start:
[root@ma-kub8ms0 devops]# find /var/lib/kubelet/
/var/lib/kubelet/
/var/lib/kubelet/pods
/var/lib/kubelet/plugins

from release.

dgoodwin avatar dgoodwin commented on July 24, 2024

/var/lib/kubelet was still expected to be empty, if crash looping kubelet is now creating those directories that will indeed still trigger a pre-flight check error.

However I cannot reproduce, I compiled kubelet off master this morning:

(root@centos1 ~) $ kubeadm reset                          
Running pre-flight checks
Stopping the kubelet service...
Unmounting directories in /var/lib/kubelet...
Deleting the stateful directories: [/var/lib/kubelet /var/lib/etcd /etc/kubernetes]
Stopping all running docker containers...
failed to stop the running containers
(root@centos1 ~) $ ll /var/lib/kubelet
ls: cannot access /var/lib/kubelet: No such file or directory
(root@centos1 ~) $ systemctl start kubelet
(root@centos1 ~) $ journalctl -fu kubelet
-- Logs begin at Tue 2016-11-01 08:14:10 EDT. --
Nov 01 08:16:04 centos1.aos.example.com systemd[1]: kubelet.service: main process exited, code=exited, status=1/FAILURE
Nov 01 08:16:04 centos1.aos.example.com systemd[1]: Unit kubelet.service entered failed state.
Nov 01 08:16:04 centos1.aos.example.com systemd[1]: kubelet.service failed.
Nov 01 08:16:14 centos1.aos.example.com systemd[1]: kubelet.service holdoff time over, scheduling restart.
Nov 01 08:16:14 centos1.aos.example.com systemd[1]: Started Kubernetes Kubelet Server.
Nov 01 08:16:14 centos1.aos.example.com systemd[1]: Starting Kubernetes Kubelet Server...
Nov 01 08:16:14 centos1.aos.example.com kubelet[3742]: error: failed to run Kubelet: invalid kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory
Nov 01 08:16:14 centos1.aos.example.com systemd[1]: kubelet.service: main process exited, code=exited, status=1/FAILURE
Nov 01 08:16:14 centos1.aos.example.com systemd[1]: Unit kubelet.service entered failed state.
Nov 01 08:16:14 centos1.aos.example.com systemd[1]: kubelet.service failed.
^C
(root@centos1 ~) $ ll /var/lib/kubelet      
ls: cannot access /var/lib/kubelet: No such file or directory

Init then works fine.

Are you using the standard systemd config where we crash loop until kubeadm writes a config, or is your kubelet actually running?

from release.

codablock avatar codablock commented on July 24, 2024

kubelet was indeed running while I had these problems. At this time I did not have the kubeadm RPM package installed. Now I'm installing it before overwriting the kubeadm binary with my custom built binary.

After checking the systemd config as hinted by @dgoodwin I assume the missing /etc/systemd/system/kubelet.service.d/10-kubeadm.conf being the cause due to the RPM package not being installed. 10-kubeadm.conf makes the kubelet service crash loop, which makes the pre-flight checks work again.

Is the dependency on a crash looping kubelet service really a good thing? I can imagine other people also having problems figuring out whats wrong.

from release.

codablock avatar codablock commented on July 24, 2024

Additional pre-flight checks would have helped me. So, thumbs up :)

from release.

jbeda avatar jbeda commented on July 24, 2024

We just hit this (@captainshar) when trying to script ubuntu. Using latest packages (v1.5.1) kubeadm failed due to empty directories under /var/kubelet. It is unclear what created those directories. No mention in the kubelet logs (it is crash looping as expected).

We will work around this by having the scripted stuff do a kubeadm reset before running kubeadm init. But it shouldn't be necessary.

We'll update if we get more details about what is going on and if we start seeing this more.

from release.

luxas avatar luxas commented on July 24, 2024

@jbeda It's a race condition with the debs when scripting the install it seems.
Somehow the kubelet service has just enough time to start the kubelet normally (/usr/sbin/kubelet with no params), that makes kubelet create its directories before the kubeadm-specific service file kicks in an makes kubelet crashloop.

@dgoodwin Did you ever find a solution to this?

from release.

luxas avatar luxas commented on July 24, 2024

(What we're talking about now is not the same issue as above, but since people have chosen this thread to talk about, let's continue)

from release.

GheRivero avatar GheRivero commented on July 24, 2024

There is "race condition" when kubelet starts: Most of the times, it dies before doing anything because of the missing /etc/kubernetes/kubelet.conf (expected behaviour)

Jan 30 15:46:00 node3 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 30 15:46:00 node3 kubelet[21771]: I0130 15:46:00.455213 21771 feature_gate.go:181] feature gates: map[]
Jan 30 15:46:00 node3 kubelet[21771]: error: failed to run Kubelet: invalid kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory
Jan 30 15:46:00 node3 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jan 30 15:46:00 node3 systemd[1]: kubelet.service: Unit entered failed state.
Jan 30 15:46:00 node3 systemd[1]: kubelet.service: Failed with result 'exit-code'.

But other times, it goes into more checking before dying:
Jan 30 11:24:02 node3 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.420597 3851 feature_gate.go:181] feature gates: map[]
Jan 30 11:24:02 node3 kubelet[3851]: W0130 11:24:02.420652 3851 server.go:400] No API client: no api servers specified
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.494685 3851 docker.go:356] Connecting to docker on unix:///var/run/docker.sock
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.494703 3851 docker.go:376] Start docker client with request timeout=2m0s
Jan 30 11:24:02 node3 kubelet[3851]: E0130 11:24:02.499088 3851 cni.go:163] error updating cni config: No networks found in /etc/cni/net.d
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.504346 3851 manager.go:143] cAdvisor running in container: "/system.slice/kubelet.service"
Jan 30 11:24:02 node3 kubelet[3851]: W0130 11:24:02.534432 3851 manager.go:151] unable to connect to Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp [::1]:15
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.535897 3851 fs.go:117] Filesystem partitions: map[/dev/vda1:{mountpoint:/var/lib/docker/aufs major:253 minor:1 fsType:ext
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.536966 3851 manager.go:198] Machine: {NumCores:4 CpuFrequency:3491914 MemoryCapacity:8371175424 MachineID:861e2114926fbf9
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.537199 3851 manager.go:204] Version: {KernelVersion:4.4.0-59-generic ContainerOsVersion:Ubuntu 16.04.1 LTS DockerVersion:
Jan 30 11:24:02 node3 kubelet[3851]: W0130 11:24:02.541948 3851 container_manager_linux.go:205] Running with swap on is not supported, please disable swap! This will be a fa
Jan 30 11:24:02 node3 kubelet[3851]: W0130 11:24:02.542056 3851 server.go:669] No api server defined - no events will be sent to API server.
Jan 30 11:24:02 node3 kubelet[3851]: W0130 11:24:02.543399 3851 kubelet_network.go:69] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back to "
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.543417 3851 kubelet.go:477] Hairpin mode set to "hairpin-veth"
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.545391 3851 docker_manager.go:257] Setting dockerRoot to /var/lib/docker
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.545400 3851 docker_manager.go:260] Setting cgroupDriver to cgroupfs
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.546041 3851 server.go:770] Started kubelet v1.5.2
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.546103 3851 server.go:123] Starting to listen on 0.0.0.0:10250
Jan 30 11:24:02 node3 kubelet[3851]: E0130 11:24:02.546261 3851 kubelet.go:1145] Image garbage collection failed: unable to find data for container /
Jan 30 11:24:02 node3 kubelet[3851]: W0130 11:24:02.546547 3851 kubelet.go:1224] No api server defined - no node status update will be sent.
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.546918 3851 kubelet_node_status.go:204] Setting node annotation to enable volume controller attach/detach
Jan 30 11:24:02 node3 kubelet[3851]: E0130 11:24:02.547936 3851 kubelet.go:1634] Failed to check if disk space is available for the runtime: failed to get fs info for "runti
Jan 30 11:24:02 node3 kubelet[3851]: E0130 11:24:02.548080 3851 kubelet.go:1642] Failed to check if disk space is available on the root partition: failed to get fs info for
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.548794 3851 fs_resource_analyzer.go:66] Starting FS ResourceAnalyzer
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.548942 3851 status_manager.go:125] Kubernetes client is nil, not starting status manager.
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.549079 3851 kubelet.go:1714] Starting kubelet main sync loop.
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.549505 3851 kubelet.go:1725] skipping pod synchronization - [container runtime is down]
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.548983 3851 volume_manager.go:242] Starting Kubelet Volume Manager
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.551250 3851 factory.go:295] Registering Docker factory
Jan 30 11:24:02 node3 kubelet[3851]: W0130 11:24:02.551270 3851 manager.go:247] Registration of the rkt container factory failed: unable to communicate with Rkt api service:
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.551277 3851 factory.go:54] Registering systemd factory
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.551414 3851 factory.go:86] Registering Raw factory
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.553917 3851 manager.go:1106] Started watching for new ooms in manager
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.554957 3851 oomparser.go:185] oomparser using systemd
Jan 30 11:24:02 node3 kubelet[3851]: I0130 11:24:02.555505 3851 manager.go:288] Starting recovery of all containers
Jan 30 11:24:02 node3 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Jan 30 11:24:02 node3 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jan 30 11:24:02 node3 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 30 11:24:02 node3 kubelet[3897]: I0130 11:24:02.657147 3897 feature_gate.go:181] feature gates: map[]
Jan 30 11:24:02 node3 kubelet[3897]: error: failed to run Kubelet: invalid kubeconfig: stat /etc/kubernetes/kubelet.conf: no such file or directory

It's in this case, where it creates /var/lib/kubelet/{plugins | pods} dirs, causing kubeadm preflight checks to fail

from release.

fejta-bot avatar fejta-bot commented on July 24, 2024

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

from release.

fejta-bot avatar fejta-bot commented on July 24, 2024

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

from release.

errordeveloper avatar errordeveloper commented on July 24, 2024

@luxas should we keep this open? if so, does it still belong to this repo?

from release.

fejta-bot avatar fejta-bot commented on July 24, 2024

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

from release.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.