I believe there is an error condition currently in etcd-manager when performing a roll

Created a gist of me reproducing this again today: <a href="https://

Thanks for tracking this down <a class="user-mention notranslate" data-hovercard-type=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

For me it is: create a cluster using the master branch with th

DNS errors after upgrading rolling update of masters in cluster about etcd-manager HOT 14 OPEN

mmerrill3 commented on July 27, 2024

DNS errors after upgrading rolling update of masters in cluster

from etcd-manager.

Comments (14)

mmerrill3 commented on July 27, 2024

I'm able to reproduce this pretty easily.

I have a cluster with three members, etcd-d, etcd-e, and etcd-f, running etcd 3.4.13.

Etcd-d is the leader.

If I terminate the instance, I keep quorum. that all works. The issue is when etcd-d comes back. The other two members of the cluster receive LeadershipNotifications from the new etcd-d, so the grpc peer service is ok. But, within etcd itself, in the peer connection used for health checks and heartbeat, I see rejects for the new connections from etcd-d. It looks like the embedded grpc service for etcd (not to be confused with the peer service for etcd-manager)

The message I see from etcd-e and etcd-f indicate so:

2020-12-10 19:27:56.219405 I | embed: rejected connection from "10.203.20.185:41530" (error "tls: "10.203.20.185" does not match any of DNSNames ["etcd-d" "etcd-d.internal.k8s.ctnrva0.dev.mmerrill.net"] (lookup etcd-d on 10.203.16.2:53: no such host)", ServerName "etcd-e.internal.k8s.ctnrva0.dev.mmerrill.net", IPAddresses ["127.0.0.1"], DNSNames ["etcd-d" "etcd-d.internal.k8s.ctnrva0.dev.mmerrill.net"])

I have to restart the docker containers on etcd-e and etcd-f. When I do that, etcd-d is then able to connect to the etcd heartbeat service on etcd-e and etcd-f.

To me, this sounds like an issue with etcd itself? Where, the running etcd somehow remembers the source IP for an old connection from the old etcd master, and when a new one comes online, its rejected b/c of some data still in memory? I only say that b/c restarting the container for etcd-e and etcd-f solves it.

from etcd-manager.

mmerrill3 commented on July 27, 2024

Created a gist of me reproducing this again today:

https://gist.github.com/mmerrill3/a354e9289bc44e9c9f0711f6de932fdd

from etcd-manager.

mmerrill3 commented on July 27, 2024

do we need to put the new IP in the SAN in the client cert to bootstrap the newly created member (from an IP perspective, the peer id is the same) to the existing cluster? Right now we just have 127.0.0.1.

from etcd-manager.

mmerrill3 commented on July 27, 2024

I just tried to reproduce the issue one more time. This time, I got a clue. etcd-main actually was ok when I terminated the existing master instance of etcd-main, and it came back online. etcd-events was still not working, for the same reason above. But, this message printed, which shows that the controller was able to then continue, and publish the new peer map to all the peers (needed so /etc/hosts gets updated). The cluster state got obtained... and all peer connectivity worked ok for etcd-main. Makes sense b/c all the peers had their /etc/hosts entries updated.

{"level":"warn","ts":"2020-12-18T22:12:44.623Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e75929a0-c5ed-4839-8bd7-77e5ac9b998a/etcd-a.internal.k8s.ctnror0.dev.mmerrill.net:4001","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
W1218 22:12:44.623366 4345 controller.go:730] health-check unable to reach member 603088857070764962 on [https://etcd-a.internal.k8s.ctnror0.dev.mmerrill.net:4001]: context deadline exceeded
I1218 22:12:44.640264 4345 controller.go:292] etcd cluster state: etcdClusterState

I'm not sure why this doesn't happen all the time? It's obviously trying to communicate with the old, terminated, etcd-a which is now gone. A timeout kicks in, and then the controller does its thing and notifies all the peers of the new address for etcd-a. Why doesn't this timeout kick in all the time?

from etcd-manager.

mmerrill3 commented on July 27, 2024

I will try putting the IP in the client/server cert for etcd that is used for the healthchecks by etcd. It's a chicken before the egg issue, in that cluster members don't know about the new IP from etc/hosts yet for the newest member. B/c of that, health checks fail b/c the client cert has the DNS name (still not resolving to the newer host yet) and 127.0.0.1. I'm goig to put the new IP in there which will allow for health checks to work for etcd, and then, hopefully, there's a full quorum so etcdController and tell the members about the next IP being used by the new etcd member.

from etcd-manager.

justinsb commented on July 27, 2024

Thanks for tracking this down @mmerrill3 ... I suspect it's related to some of the newer TLS validation introduced in etcd: https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.2.md#security-authentication-9

I'm going to try to validate and I think your proposed solution of introducing the client IP sounds like a probably fix.

Sorry for the delay here!

from etcd-manager.

mmerrill3 commented on July 27, 2024

Hi @justinsb, in my PR, I wrapped the etcd client call out in an explicit context with a timeout. That timeout ocurring allows for the new map of peers to get published, and then the DNS issues go away.

from etcd-manager.

olemarkus commented on July 27, 2024

I have experienced this a number of times while rolling update on the kOps master branch Not sure it is related to changes in etcd 3.2. I have not experienced this before the last few weeks.

from etcd-manager.

justinsb commented on July 27, 2024

I'm trying to reproduce failures using the kubetest2 work that's going on in kops. Any hints as to how to reproduce it?

I'm trying HA clusters (3 nodes), then upgrading kubernetes. I think that most people have reproduced it on AWS, with a "real" DNS name (i.e. not k8s.local) - is that right?

I'm guessing that we need to cross the 1.14/1.15/1.16 -> 1.17 boundary (i.e. go from etc 3.3.10 to 3.4.3). Or maybe is it 1.17/1.18 to 1.19 (i.e. 3.4.3 -> 3.4.13)? If anyone has any recollection of what versions in particular triggered it, that could be helpful also!

We can then also get these tests into periodic testing ... but one step at a time!

from etcd-manager.

olemarkus commented on July 27, 2024

For me it is:

create a cluster using the master branch with three control plane nodes
rotate the cluster

It felt like it happened every other roll, but it is probably a bit less.

from etcd-manager.

mmerrill3 commented on July 27, 2024

For my case, it was a 3 member control plane, running in AWS, k8s version 1.19.4, etcd 3.4.13. I use kops rolling-update to force a restart of the leader in the etcd cluster. The etcd cluster leader is stopped and terminated from my command, and a new EC2 instance is spun up with a new IP. This new etcd member thinks he's the "peer" leader from the lock. Not the etcd leader, but the "peer" service leader. The new "peer" leader needs to publish it's new IP to the other peers, but there's a point where it gets stuck. That sticking point is in the PR. I'm not sure why its gets stuck, but it does. There's supposed to be a timeout with the etcdclient for the health check, that never kicks in. Wrapping the etcd client call out with another higher level context timeout brute forces a fix, but I didn't see why the internals controls of etcd client didn't timeout by itself.

Once the timeout happens, the next step in the controller loop is to publish the new IP for the new member. Once that is published, the other peers update their local /etc/hosts with the new entry, and the new etcd member/ec2 instance joins the etcd cluster.

from etcd-manager.

zetaab commented on July 27, 2024

@mmerrill3 / @olemarkus how did you solve this issue?

I am currently seeing error like

2021-02-10 15:48:27.137411 I | embed: rejected connection from "172.20.83.101:46836" (error "tls: \"172.20.83.101\" does not match any of DNSNames [\"etcd-b.internal.fteu1.awseu.ftrl.io\"]", ServerName "etcd-c.internal.fteu1.awseu.ftrl.io", IPAddresses ["127.0.0.1"], DNSNames ["etcd-b.internal.fteu1.awseu.ftrl.io"])

situation is that I have 2/3 masters online and those masters which are online do have incorrect hosts file

etcd that works but have incorrect hosts file:

# Begin host entries managed by etcd-manager[etcd] - do not edit
172.20.120.253	etcd-c.internal.fteu1.awseu.ftrl.io
172.20.62.209	etcd-a.internal.fteu1.awseu.ftrl.io
172.20.72.135	etcd-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd]

but etcd-b.internal.fteu1.awseu.ftrl.io ip address is incorrect in that hosts file.. it should be 172.20.83.101, there is no instance up and running with that 172.20.72.135 ip address?!

that incorrect ip address is in "etcd-main" cluster, but for some reason ip address is correct in etcd-event cluster

root@ip-172-20-62-209:/home/ubuntu# docker exec -it 275ac3908f7a cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
172.20.120.253	etcd-events-c.internal.fteu1.awseu.ftrl.io
172.20.62.209	etcd-events-a.internal.fteu1.awseu.ftrl.io
172.20.83.101	etcd-events-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd-events]

from etcd-manager.

zetaab commented on July 27, 2024

I scaled down entire master ASG now the situation is:

root@ip-172-20-62-209:/home/ubuntu# docker exec -it 275ac3908f7a cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
172.20.120.253	etcd-events-c.internal.fteu1.awseu.ftrl.io
172.20.62.209	etcd-events-a.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd-events]
root@ip-172-20-62-209:/home/ubuntu# docker exec -it 135ab0143a0d cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd] - do not edit
172.20.120.253	etcd-c.internal.fteu1.awseu.ftrl.io
172.20.62.209	etcd-a.internal.fteu1.awseu.ftrl.io
172.20.72.135	etcd-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd]

seems that etcd-main does not update the hosts file?!

which component is updating these hosts files, it seems that its stuck? I would not like to take etcd state from backups. I have tried to restart kubelet and protokube but does not help

from etcd-manager.

zetaab commented on July 27, 2024

I wrote docker restart <containerid> to both etcd-main containers that did have incorrect hosts file (do not write at the same time to all instances). After that I started missing master group, and now I can see correct ip address in hosts file and I have working etcd cluster again with 3/3 members in it

from etcd-manager.

DNS errors after upgrading rolling update of masters in cluster about etcd-manager HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent