Comments (14)
I'm able to reproduce this pretty easily.
I have a cluster with three members, etcd-d, etcd-e, and etcd-f, running etcd 3.4.13.
Etcd-d is the leader.
If I terminate the instance, I keep quorum. that all works. The issue is when etcd-d comes back. The other two members of the cluster receive LeadershipNotifications from the new etcd-d, so the grpc peer service is ok. But, within etcd itself, in the peer connection used for health checks and heartbeat, I see rejects for the new connections from etcd-d. It looks like the embedded grpc service for etcd (not to be confused with the peer service for etcd-manager)
The message I see from etcd-e and etcd-f indicate so:
2020-12-10 19:27:56.219405 I | embed: rejected connection from "10.203.20.185:41530" (error "tls: "10.203.20.185" does not match any of DNSNames ["etcd-d" "etcd-d.internal.k8s.ctnrva0.dev.mmerrill.net"] (lookup etcd-d on 10.203.16.2:53: no such host)", ServerName "etcd-e.internal.k8s.ctnrva0.dev.mmerrill.net", IPAddresses ["127.0.0.1"], DNSNames ["etcd-d" "etcd-d.internal.k8s.ctnrva0.dev.mmerrill.net"])
I have to restart the docker containers on etcd-e and etcd-f. When I do that, etcd-d is then able to connect to the etcd heartbeat service on etcd-e and etcd-f.
To me, this sounds like an issue with etcd itself? Where, the running etcd somehow remembers the source IP for an old connection from the old etcd master, and when a new one comes online, its rejected b/c of some data still in memory? I only say that b/c restarting the container for etcd-e and etcd-f solves it.
from etcd-manager.
Created a gist of me reproducing this again today:
https://gist.github.com/mmerrill3/a354e9289bc44e9c9f0711f6de932fdd
from etcd-manager.
do we need to put the new IP in the SAN in the client cert to bootstrap the newly created member (from an IP perspective, the peer id is the same) to the existing cluster? Right now we just have 127.0.0.1.
from etcd-manager.
I just tried to reproduce the issue one more time. This time, I got a clue. etcd-main actually was ok when I terminated the existing master instance of etcd-main, and it came back online. etcd-events was still not working, for the same reason above. But, this message printed, which shows that the controller was able to then continue, and publish the new peer map to all the peers (needed so /etc/hosts gets updated). The cluster state got obtained... and all peer connectivity worked ok for etcd-main. Makes sense b/c all the peers had their /etc/hosts entries updated.
{"level":"warn","ts":"2020-12-18T22:12:44.623Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e75929a0-c5ed-4839-8bd7-77e5ac9b998a/etcd-a.internal.k8s.ctnror0.dev.mmerrill.net:4001","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
W1218 22:12:44.623366 4345 controller.go:730] health-check unable to reach member 603088857070764962 on [https://etcd-a.internal.k8s.ctnror0.dev.mmerrill.net:4001]: context deadline exceeded
I1218 22:12:44.640264 4345 controller.go:292] etcd cluster state: etcdClusterState
I'm not sure why this doesn't happen all the time? It's obviously trying to communicate with the old, terminated, etcd-a which is now gone. A timeout kicks in, and then the controller does its thing and notifies all the peers of the new address for etcd-a. Why doesn't this timeout kick in all the time?
from etcd-manager.
I will try putting the IP in the client/server cert for etcd that is used for the healthchecks by etcd. It's a chicken before the egg issue, in that cluster members don't know about the new IP from etc/hosts yet for the newest member. B/c of that, health checks fail b/c the client cert has the DNS name (still not resolving to the newer host yet) and 127.0.0.1. I'm goig to put the new IP in there which will allow for health checks to work for etcd, and then, hopefully, there's a full quorum so etcdController and tell the members about the next IP being used by the new etcd member.
from etcd-manager.
Thanks for tracking this down @mmerrill3 ... I suspect it's related to some of the newer TLS validation introduced in etcd: https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.2.md#security-authentication-9
I'm going to try to validate and I think your proposed solution of introducing the client IP sounds like a probably fix.
Sorry for the delay here!
from etcd-manager.
Hi @justinsb, in my PR, I wrapped the etcd client call out in an explicit context with a timeout. That timeout ocurring allows for the new map of peers to get published, and then the DNS issues go away.
from etcd-manager.
I have experienced this a number of times while rolling update on the kOps master branch Not sure it is related to changes in etcd 3.2. I have not experienced this before the last few weeks.
from etcd-manager.
I'm trying to reproduce failures using the kubetest2 work that's going on in kops. Any hints as to how to reproduce it?
I'm trying HA clusters (3 nodes), then upgrading kubernetes. I think that most people have reproduced it on AWS, with a "real" DNS name (i.e. not k8s.local) - is that right?
I'm guessing that we need to cross the 1.14/1.15/1.16 -> 1.17 boundary (i.e. go from etc 3.3.10 to 3.4.3). Or maybe is it 1.17/1.18 to 1.19 (i.e. 3.4.3 -> 3.4.13)? If anyone has any recollection of what versions in particular triggered it, that could be helpful also!
We can then also get these tests into periodic testing ... but one step at a time!
from etcd-manager.
For me it is:
- create a cluster using the master branch with three control plane nodes
- rotate the cluster
It felt like it happened every other roll, but it is probably a bit less.
from etcd-manager.
For my case, it was a 3 member control plane, running in AWS, k8s version 1.19.4, etcd 3.4.13. I use kops rolling-update to force a restart of the leader in the etcd cluster. The etcd cluster leader is stopped and terminated from my command, and a new EC2 instance is spun up with a new IP. This new etcd member thinks he's the "peer" leader from the lock. Not the etcd leader, but the "peer" service leader. The new "peer" leader needs to publish it's new IP to the other peers, but there's a point where it gets stuck. That sticking point is in the PR. I'm not sure why its gets stuck, but it does. There's supposed to be a timeout with the etcdclient for the health check, that never kicks in. Wrapping the etcd client call out with another higher level context timeout brute forces a fix, but I didn't see why the internals controls of etcd client didn't timeout by itself.
Once the timeout happens, the next step in the controller loop is to publish the new IP for the new member. Once that is published, the other peers update their local /etc/hosts with the new entry, and the new etcd member/ec2 instance joins the etcd cluster.
from etcd-manager.
@mmerrill3 / @olemarkus how did you solve this issue?
I am currently seeing error like
2021-02-10 15:48:27.137411 I | embed: rejected connection from "172.20.83.101:46836" (error "tls: \"172.20.83.101\" does not match any of DNSNames [\"etcd-b.internal.fteu1.awseu.ftrl.io\"]", ServerName "etcd-c.internal.fteu1.awseu.ftrl.io", IPAddresses ["127.0.0.1"], DNSNames ["etcd-b.internal.fteu1.awseu.ftrl.io"])
situation is that I have 2/3 masters online and those masters which are online do have incorrect hosts file
etcd that works but have incorrect hosts file:
# Begin host entries managed by etcd-manager[etcd] - do not edit
172.20.120.253 etcd-c.internal.fteu1.awseu.ftrl.io
172.20.62.209 etcd-a.internal.fteu1.awseu.ftrl.io
172.20.72.135 etcd-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd]
but etcd-b.internal.fteu1.awseu.ftrl.io ip address is incorrect in that hosts file.. it should be 172.20.83.101, there is no instance up and running with that 172.20.72.135 ip address?!
that incorrect ip address is in "etcd-main" cluster, but for some reason ip address is correct in etcd-event cluster
root@ip-172-20-62-209:/home/ubuntu# docker exec -it 275ac3908f7a cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Begin host entries managed by etcd-manager[etcd-events] - do not edit
172.20.120.253 etcd-events-c.internal.fteu1.awseu.ftrl.io
172.20.62.209 etcd-events-a.internal.fteu1.awseu.ftrl.io
172.20.83.101 etcd-events-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd-events]
from etcd-manager.
I scaled down entire master ASG now the situation is:
root@ip-172-20-62-209:/home/ubuntu# docker exec -it 275ac3908f7a cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Begin host entries managed by etcd-manager[etcd-events] - do not edit
172.20.120.253 etcd-events-c.internal.fteu1.awseu.ftrl.io
172.20.62.209 etcd-events-a.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd-events]
root@ip-172-20-62-209:/home/ubuntu# docker exec -it 135ab0143a0d cat /etc/hosts
# Kubernetes-managed hosts file (host network).
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Begin host entries managed by etcd-manager[etcd] - do not edit
172.20.120.253 etcd-c.internal.fteu1.awseu.ftrl.io
172.20.62.209 etcd-a.internal.fteu1.awseu.ftrl.io
172.20.72.135 etcd-b.internal.fteu1.awseu.ftrl.io
# End host entries managed by etcd-manager[etcd]
seems that etcd-main does not update the hosts file?!
which component is updating these hosts files, it seems that its stuck? I would not like to take etcd state from backups. I have tried to restart kubelet and protokube but does not help
from etcd-manager.
I wrote docker restart <containerid>
to both etcd-main containers that did have incorrect hosts file (do not write at the same time to all instances). After that I started missing master group, and now I can see correct ip address in hosts file and I have working etcd cluster again with 3/3 members in it
from etcd-manager.
Related Issues (20)
- Leader Election Causes Downtime in Kops Rolling Updates
- ETCD TLS Bad Certificate
- Backups not coming in after OpenStack token expires: Reauth bug? HOT 1
- Handle volume failures gracefully
- Upgrade etcd to v3.4.3 fails after cluster upgrade HOT 5
- Recommendation for ETCD node type for a large cluster
- Etcd-manager-ctl docker image
- Kubernetes deployment file to start the Etcd manager
- How to connect/configure existing Etcd-cluster with Ectd-manager
- etcd-manager backup cleaner calls vfs storage quite heavily HOT 1
- OpenStack virtio-scsi device path can contain full cinder uuid instead of shorten one HOT 3
- Restore Results In 0.0.0.0:8081: bind: address already in use if ETCD_LISTEN_METRICS_URLS is used HOT 2
- Problems mounting EBS volumes in ETCD nodes in the same availability zone
- Could not perform operation. Error
- How to override the default backup period?
- Auth Errors on Cluster Upgrade HOT 2
- ccccccrijfjjtdlvlchnclinvvfrfvlbnfcfgdkvhbhn
- Add a second docker registry as backup HOT 1
- DockerHub Rate Limit HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from etcd-manager.