kopeio / etcd-manager Goto Github PK
View Code? Open in Web Editor NEWoperator for etcd: moved to https://github.com/kubernetes-sigs/etcdadm
License: Apache License 2.0
operator for etcd: moved to https://github.com/kubernetes-sigs/etcdadm
License: Apache License 2.0
Enable a separate port for metrics and allow access from node SG to simplify pulling etcd metrics in a secure way. Requires Upgrade to etc 3.3.0+ as well.
Tag for K8s 1.1.14 where etcd has been defaulted to 3.3.10
The state file used to initialize the etcd server should reflect when etcd members come and go. Currently, we update /etc/hosts when this information changes, but we do not update the state file under the base data directory for etcd-main or etcd-events. I also propose updating the state.Nodes information for the running etcd service to reflect the changes.
This manifests as bad node data being returned by the GetInfo peers service, since the state of the cluster comes from that state file, and it doesn't get updated once its read. If you decide to expand your cluster, and then shrink it (for say, moving to a new subnet), references to the old etcd members will still be seen in the logs when the "master" etcd server calls GetInfo on the other peers.
I've built my cluster with the etcd version 3.3.10 before running etcd-manager. Consequently, now I can't run etcd-manager because it is not shipped with the binary version for the etcd 3.3.10. In order to build the etcd-manager image by myself which is the best way? Further etcd-manager would run with etcd 3.3.x?
slice of the cluster.spec:
etcdClusters:
- backups:
backupStore: s3://$KOPS_STATE_STORE/$KLUSTER_NAME/backups/etcd/main
etcdMembers:
- instanceGroup: master-eu-west-1c
name: -2c
- instanceGroup: master-eu-west-1b
name: -2b
- instanceGroup: master-eu-west-1a
name: -2a
manager:
image: kopeio/etcd-manager:latest
name: main
version: 3.1.12
- backups:
backupStore: s3://$KOPS_STATE_STORE/$KLUSTER_NAME/backups/etcd/events
etcdMembers:
- instanceGroup: master-eu-west-1c
name: -2c
- instanceGroup: master-eu-west-1b
name: -2b
- instanceGroup: master-eu-west-1a
name: -2a
manager:
image: kopeio/etcd-manager:latest
name: events
version: 3.1.12
:latest
points to v3.0.20190125
verified by SHASUM in docker images
Docker logs show:
I0226 01:05:54.017737 1 main.go:243] discovered IP address: 192.168.132.131
I0226 01:05:54.017774 1 main.go:248] Setting data dir to /rootfs/mnt/master-vol-078729a1332f034f2
open /etc/kubernetes/pki/etcd-manager/etcd-manager-ca.key: no such file or directory
and etcd-manager exits without bringing up etcd.
Hi,
I'm using datadog for monitoring solution, but I feel like this is more generic question. Previously (before etcdmanager was introduced) port 4001 and 4002 were accessible, and if I recall correctly these are not exposed on nodes anymore.
Previously, our datadog check was configured like this:
---
init_config:
instances:
- url: http://etcd-a.internal.CLUSTER_NAME:4001
What's the correct url to use now?
Linux ip-10-0-13-180 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1 (2019-04-12) x86_64 GNU/Linux
I installed Bazel as per the instructions below - downloaded bazel-0.27.0-installer-linux-x86_64.sh and installed it (was fine apart from needing me to install unzip)
I cloned the repo down and cd'd into it.
when attempting to complie - etcd and etcdctl seem to work fine
root@ip-10-0-4-171:/tmp/etcd-manager# bazel build //:etcd-v2.2.1-linux-amd64_etcd //:etcd-v2.2.1-linux-amd64_etcdctl
Starting local Bazel server and connecting to it...
INFO: Analyzed 2 targets (4 packages loaded, 21 targets configured).
INFO: Found 2 targets...
INFO: Elapsed time: 6.905s, Critical Path: 0.41s
INFO: 2 processes: 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
root@ip-10-0-4-171:/tmp/etcd-manager# bazel build //:etcd-v3.2.24-linux-amd64_etcd //:etcd-v3.2.24-linux-amd64_etcdctl
INFO: Analyzed 2 targets (1 packages loaded, 4 targets configured).
INFO: Found 2 targets...
INFO: Elapsed time: 4.088s, Critical Path: 0.51s
INFO: 2 processes: 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
root@ip-10-0-4-171:/tmp/etcd-manager# cp -r bazel-genfiles/etcd-v* /opt/
root@ip-10-0-4-171:/tmp/etcd-manager# chown -R ${USER} /opt/etcd-v*
root@ip-10-0-4-171:/tmp/etcd-manager# ls -lrt
total 116
-rw-r--r-- 1 root root 3679 Jun 24 15:18 WORKSPACE
-rw-r--r-- 1 root root 14520 Jun 24 15:18 README.md
-rw-r--r-- 1 root root 1210 Jun 24 15:18 Makefile
-rw-r--r-- 1 root root 11358 Jun 24 15:18 LICENSE
drwxr-xr-x 2 root root 4096 Jun 24 15:18 images
-rw-r--r-- 1 root root 1175 Jun 24 15:18 Gopkg.toml
-rw-r--r-- 1 root root 11964 Jun 24 15:18 Gopkg.lock
drwxr-xr-x 2 root root 4096 Jun 24 15:18 docs
drwxr-xr-x 2 root root 4096 Jun 24 15:18 dev
drwxr-xr-x 7 root root 4096 Jun 24 15:18 cmd
-rw-r--r-- 1 root root 643 Jun 24 15:18 cloudbuild.yaml
-rw-r--r-- 1 root root 1276 Jun 24 15:18 cloudbuild-master.yaml
-rw-r--r-- 1 root root 2097 Jun 24 15:18 BUILD
drwxr-xr-x 2 root root 4096 Jun 24 15:18 tools
drwxr-xr-x 3 root root 4096 Jun 24 15:18 test
drwxr-xr-x 20 root root 4096 Jun 24 15:18 pkg
drwxr-xr-x 8 root root 4096 Jun 24 15:18 vendor
lrwxrwxrwx 1 root root 113 Jun 24 15:24 bazel-testlogs -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__/bazel-out/k8-fastbuild/testlogs
lrwxrwxrwx 1 root root 91 Jun 24 15:24 bazel-out -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__/bazel-out
lrwxrwxrwx 1 root root 108 Jun 24 15:24 bazel-genfiles -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__/bazel-out/k8-fastbuild/bin
lrwxrwxrwx 1 root root 81 Jun 24 15:24 bazel-etcd-manager -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__
lrwxrwxrwx 1 root root 108 Jun 24 15:24 bazel-bin -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__/bazel-out/k8-fastbuild/bin
Attempted to compile etcd-manager-ctl - this fails
root@ip-10-0-13-221:/tmp/etcd-manager# bazel build //cmd/etcd-manager-ctl
ERROR: /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/BUILD.bazel:62:1: in go_context_data rule @io_bazel_rules_go//:go_context_data:
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/BUILD.bazel", line 62
go_context_data(name = 'go_context_data')
File "/root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/go/private/context.bzl", line 396, in _go_context_data_impl
cc_common.configure_features(cc_toolchain = cc_toolchain, reque..., ...)
Incompatible flag --incompatible_require_ctx_in_configure_features has been flipped, and the mandatory parameter 'ctx' of cc_common.configure_features is missing. Please add 'ctx' as a named parameter.m/bazelbuild/bazel/issues/7793 for details.
ERROR: Analysis of target '//cmd/etcd-manager-ctl:etcd-manager-ctl' failed; build aborted: Analysis of target '@io_bazel_rules_go//:go_context_data' failed; build aborted
INFO: Elapsed time: 2.641s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (143 packages loaded, 1485 targets configured)
Fetching @org_golang_x_tools; Restarting.
root@ip-10-0-13-221:/tmp/etcd-manager# https://github.com/bazelbuild/bazel/issues/7793 vi /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/go/private/context.b
-su: https://github.com/bazelbuild/bazel/issues/7793: No such file or directory
I applied the suggestion
"Incompatible flag --incompatible_require_ctx_in_configure_features has been flipped, and the mandatory parameter 'ctx' of cc_common.configure_features is missing. Please add 'ctx' as a named parameter.m/bazelbuild/bazel/issues/7793 for details."
This results in a new error.
root@ip-10-0-13-221:/tmp/etcd-manager# bazel build //cmd/etcd-manager-ctl
ERROR: /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/BUILD.bazel:62:1: in go_context_data rule @io_bazel_rules_go//:go_context_data:
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/BUILD.bazel", line 62
go_context_data(name = 'go_context_data')
File "/root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/go/private/context.bzl", line 396, in _go_context_data_impl
cc_common.configure_features(ctx = ctx, cc_toolchain = cc_toolc..., <2 more arguments>)
go_context_data has to declare 'CppConfiguration' as a required fragment in target configuration in order to access it.
ERROR: Analysis of target '//cmd/etcd-manager-ctl:etcd-manager-ctl' failed; build aborted: Analysis of target '@io_bazel_rules_go//:go_context_data' failed; build aborted
INFO: Elapsed time: 0.721s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (128 packages loaded, 1224 targets configured)
We like the fact that etcd-manager automatically takes backups into s3 - and can see this feature operating. However we have found no way to carry out restores apart from building etcd-manager-ctl - which we're unable to do.
In the meantime, can we restore the "old" way using the backup in the s3 bucket? In other words aws cp it to the nodes and perform an etcdctl
snapshot restore onto each node?
Are there any options to disable etcd-manager and revert to legacy etcd-server?
After upgrading k8s cluster to 1.12 with kops defaults (etcd-managed enabled) I tried to apply cluster.spec.etcdClusters[*].provider=Legacy
, etcd-server started but with clean database (no deployments, services, etc). Etcd-manager saves db on EBS in different directory and seems like dbs are incompatible.
Are there any solution/documentation how to downgrade to pure etcd-server?
Thanks for ideas in advance.
When protokube used to manage etcd in kops versions prior to 1.10, it used to update the internal ip addresses of the etcd members in AWS route53.
Not sure if this has to be now expected from etcd-manager. Because as of now none (protokube / etcd-manager) seems to be taking care of updating the etcd endpoints.
Even though the cluster is healthy and etcd is discoverable to the api server, I am not sure if this is the desired behavior.
We're testing out a procedure for a full master refresh using kops/etcd-manager (described here: https://hindenes.com/2019-08-09-Kops-Restore/).
In short, we wipe the masters, let kops set up new masters, and use etcd-manager-ctl
to restore the last known backup. This seems to work very well.
However, we're noticing that in-cluster apps that need access to the Kubernetes api sometimes fail. This seems to be caused by the fact that old (deleted) masters are still present in the kubernetes
endpoint (kubectl -n default get endpoints kubernetes -o=yaml
).
This is probably not a etcd-manager problem at all, but I'm at a loss regarding how to get rid of references to old (non-existing) masters, so any pointers would be deeply appreciated.
Hi,
I've noticed after upgrading to Kops/Kubernetes 1.12 that the internal record sets for etcd are set to the default placeholder 203.0.113.123
. However, etcd
seems to be functioning normally. Is this expected?
Good day.
We use etcd-manager with kops to manage etcd. By default, etcd-managetr sets up backups to the bake, every 15 minutes. But I could not find out what default retention (https://github.com/kopeio/etcd-manager/blob/master/pkg/backupcontroller/cleanup.go) is worth it and how can it be configured?
What steps did you take and what happened:
Running etcd-manager via kops in AWS on kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17. We've observed the issue below in the following configurations (which is not intended as an exhaustive list of affected configurations, just the configurations we've tried):
Kops with etcd-manager enabled appears to by default start two instances of etcd-manager on each master, one for "main" and one for events.
The master images have manage_etc_hosts set which means at boot time a handful of lines are placed into /etc/hosts, i.e.:
# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
# /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 your-ec2-fqdn your-ec2-shortname
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
Each instance of etcd-manager (main/events) writes records about the etcd-manager cluster into /etc/hosts, apparently every 10 seconds:
# Begin host entries managed by etcd-manager[etcd-events] - do not edit
your-master1-ip your-master1-name
your-master2-ip your-master2-name
your-master3-ip your-master3-name
# End host entries managed by etcd-manager[etcd-events]
# Begin host entries managed by etcd-manager[etcd] - do not edit
your-master1-ip your-master1-name
your-master2-ip your-master2-name
your-master3-ip your-master3-name
# End host entries managed by etcd-manager[etcd]
At some indeterminate time after boot (hours or days), we are seeing the manage_etc_hosts entries disappear from /etc/hosts, including localhost entries, leaving only the etcd-manager entries. Per auditd logging no other processes are writing to this file, so etcd-manager appears to be the cause of the disappearing entries.
What did you expect to happen:
Existing entries in /etc/hosts to remain undisturbed.
Anything else you would like to add:
A reboot of the node will (temporarily) restore the records, and the entries can of course be (temporarily) re-added by hand.
Versions used:
kops 1.11.1
k8s cluster: 1.11.9
infrastructure provider: aws
Our team was upgrading the etcd cluster (from 2.2.1 to 3.1.12) using kops, using the following scenario:
kops edit cluster
- add etcd manager and backupskops update cluster --out terraform
, terraform apply
kops rolling-update cluster --yes
kops edit cluster
- add etcd version 3.1.12kops update cluster --out terraform
, terraform apply
kops rolling-update cluster --yes
After some minutes I have executed kubectl get nodes
and a big surprise - I see there only one node, with status "NotReady" - all other cluster nodes are gone.
Quick check and it seems that etcd-manager
performed an upgrade of etcd2 to etcd3, but it lost the data and created new, empty cluster.
As an unexpected side effect, it has also affected kube-dns
and flannel
, which rendered k8s services (and therefore all ingresses and all services exposed via them) unavailable - so I consider a major outage, as not only masters were affected, but also services running inside k8s cluster were not able to reach each other and were not reachable from the Internet.
etcd-manager
logged massive amount of data and the whole migration process, hopefully that's good enought to analyse the problem: https://gist.github.com/marek-obuchowicz/adda812f89644accc508b8d4db5db03c
"Luckily v1": "we have backups". At this moment we realised that there is no documentation provided how to restore those backups using etcd-manager. We considered going back to pure etcd (without etcd-manager
) first in order to restore the contents, but this idea was rejected.
"Luckily v2": etcd2 data was still available on the volumes, as etcd3 cluster was created with another name (another directory name was used for data). I was able to workaround the issue and bring up my etcd2 cluster with original data by:
state
file on one node and forcing it back to old directory name / version 2.2.1 + changing etcd-cluster-spec
back to version 2.2.1. It wasn't easy as the state file is a binary file (encoded with protobuf), so we had to write a little bit of go code to unmarshal the file first, change contents and then marshal it again: https://gist.github.com/marek-obuchowicz/c553effc19a97e40f01bc8e924b516eeetcd-cluster-spec
file on s3 - change version back to 2.2.1
state
file was adjustedBy doing that, I was able to get again etcd2 cluster with old data. Manager correctly recognised on the node that "cluster wanted" and "local state" versions are 2.2.1, so it automatically created etcd2
cluster, using existing data. This solution however is pretty hacky and took long time to discover.
Please let me know if there is any more information I could provide to help analysing the problem.
We have executed the same operation, with the same steps, around two weeks earlier on a testing cluster - it was succesfull. There are two minor differences between testing cluster (uses CNI networking and is hosted in us-east-1 region) and live cluster that crashed (uses flannel networking and is hosted in eu-central-1 region). So I suspect the different behaviour might have been caused by latest etcd-manager
updates.
I'm not sure if this is the correct place to report this issue or if I should open it in kops
project, but looks to me like it's related to etcd-manager
directly.
Currently etcd-manager uses aws-sdk-go v1.21.6 which doesn't support the me-south-1 region.
I have built etcd-manager with aws-sdk-go v1.21.7, it works fine.
I recently got etcd-manager-ctl working (Ref issue 224, now closed)
I have followed the documentation in terms of carrying out a restore, and it appears to have worked - however looking at the cluster afterwards, what I'd expect to be restored isn't there.
Are there any logs which state whether the backups/restores are functioning?
Use case below
Create deployments/secrets in cluster
[centos@ee78cb168c41 tmp]$ kubectl apply -f nginx_with_pv.yaml
namespace/nginx-example created
persistentvolume/nginx-logs-volume created
persistentvolumeclaim/nginx-logs created
deployment.apps/nginx-deployment created
service/my-nginx created
[centos@ee78cb168c41 tmp]$ kubectl get deployments --all-namespaces
NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-system calico-kube-controllers 0 0 0 0 57m
kube-system dns-controller 1 1 1 1 57m
kube-system kube-dns 2 2 2 2 57m
kube-system kube-dns-autoscaler 1 1 1 1 57m
nginx-example nginx-deployment 1 1 1 1 17s
[centos@ee78cb168c41 tmp]$ kubectl create secret generic db-user-pass-bloop --from-file=./username.txt --from-file=./password.txt --namespace nginx-example
secret/db-user-pass-bloop created
Wait for etcd-manager backup
root@ip-10-0-25-247:/tmp/etcd-manager# ./etcd-manager-ctl -backup-store=s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main list-backups
Backup Store: s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main
I0704 14:02:17.605787 21750 vfs.go:94] listed backups in s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main: [2019-07-04T12:45:48Z-000001 2019-07-04T13:01:39Z-000001 2019-07-04T13:16:41Z-000002 2019-07-04T13:31:43Z-000003 2019-07-04T13:46:46Z-000004 2019-07-04T14:01:48Z-000001]
2019-07-04T12:45:48Z-000001
2019-07-04T13:01:39Z-000001
2019-07-04T13:16:41Z-000002
2019-07-04T13:31:43Z-000003
2019-07-04T13:46:46Z-000004
2019-07-04T14:01:48Z-000001
Create havoc - delete deployment and secret
root@ip-10-0-25-247:/tmp/etcd-manager# kubectl get deployments --all-namespaces
NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-system calico-kube-controllers 0 0 0 0 75m
kube-system dns-controller 1 1 1 1 75m
kube-system kube-dns 2 2 2 2 75m
kube-system kube-dns-autoscaler 1 1 1 1 75m
nginx-example nginx-deployment 1 1 1 1 18m
root@ip-10-0-25-247:/tmp/etcd-manager# kubectl delete deployment nginx-deployment -n nginx-example
deployment.extensions "nginx-deployment" deleted
root@ip-10-0-25-247:/tmp/etcd-manager# kubectl delete secret db-user-pass-bloop -n nginx-example
secret "db-user-pass-bloop" deleted
Restore the backup
root@ip-10-0-25-247:/tmp/etcd-manager# ./etcd-manager-ctl -backup-store=s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main restore-backup 2019-07-04T14:01:48Z-000001
Backup Store: s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main
I0704 14:04:04.484006 22622 vfs.go:60] Adding command at s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main/control/2019-07-04T14:04:04Z-000000/_command.json: timestamp:1562249044483908780 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.2.24" > backup:"2019-07-04T14:01:48Z-000001" >
added restore-backup command: timestamp:1562249044483908780 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.2.24" > backup:"2019-07-04T14:01:48Z-000001" >
Wait a while - check for the deleted items
[centos@ee78cb168c41 kops-cluster-sb]$ kubectl get deployment --all-namespaces
NAMESPACE NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kube-system calico-kube-controllers 0 0 0 0 94m
kube-system dns-controller 1 1 1 1 94m
kube-system kube-dns 2 2 2 2 94m
kube-system kube-dns-autoscaler 1 1 1 1 94m
[centos@ee78cb168c41 kops-cluster-sb]$ kubectl get secrets -n nginx-example
NAME TYPE DATA AGE
default-token-8mj5q kubernetes.io/service-account-token 3 37m
Does anyone know if I am doing this incorrectly?
Or is my expectation of what etcd-manager backs up incorrect?
Thanks
We want this to be a neutral (de-facto) standard
As a user, I would like to be able to manually force an etcd backup before cluster maintenance. I would like the etcd-manager-ctl to had a "create-backup" command.
We are running internal e2e tests all the time and we see sometimes issues like this, when creating new cluster using kops and etcd-manager:
root@master-zone-1-3-1-clusterpr-3d22d3-k8s-local:/home/debian# docker logs 5dda81d9c876
etcd-manager
I0830 10:21:21.645797 6788 volumes.go:200] Found project="c2cd83b134244985b80038bf5c9e5e42"
I0830 10:21:21.645918 6788 volumes.go:209] Found instanceName="master-zone-1-3-1-clusterpr-3d22d3-k8s-local"
I0830 10:21:23.111471 6788 volumes.go:229] Found internalIP="10.1.32.9" and zone="zone-1"
I0830 10:21:23.111514 6788 main.go:254] Mounting available etcd volumes matching tags [KubernetesCluster=clusterpr-3d22d3.k8s.local k8s.io/etcd/main k8s.io/role/master=1]; nameTag=k8s.io/etcd/main
I0830 10:21:23.111542 6788 volumes.go:299] Listing Openstack disks in c2cd83b134244985b80038bf5c9e5e42/zone-1
I0830 10:21:23.605418 6788 mounter.go:288] Trying to mount master volume: "00e3f964-da00-4ea0-91f5-5a7a2a68de88"
I0830 10:21:26.217984 6788 mounter.go:302] Currently attached volumes: [0xc000246f80]
I0830 10:21:26.218061 6788 mounter.go:64] Master volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" is attached at "/dev/vdd"
I0830 10:21:26.218137 6788 mounter.go:78] Doing safe-format-and-mount of /dev/vdd to /mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local
I0830 10:21:26.218174 6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:27.218470 6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:28.218911 6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:29.219109 6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:30.219328 6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:31.219656 6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:32.219854 6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:33.220191 6788 mounter.go:116] Found volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" mounted at device "/dev/vdd"
I0830 10:21:33.221050 6788 mounter.go:161] Creating mount directory "/rootfs/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local"
I0830 10:21:33.221180 6788 mounter.go:166] Mounting device "/dev/vdd" on "/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local"
I0830 10:21:33.221221 6788 mount_linux.go:440] Checking for issues with fsck on disk: /dev/vdd
I0830 10:21:33.221227 6788 nsenter_exec.go:50] Running command : nsenter [--mount=/rootfs/proc/1/ns/mnt -- fsck -a /dev/vdd]
W0830 10:21:33.257556 6788 mounter.go:82] unable to mount master volume: "error formatting and mounting disk \"/dev/vdd\" on \"/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local\": 'fsck' found errors on device /dev/vdd but could not correct them: fsck from util-linux 2.33.1\n/dev/vdd: Superblock has an invalid journal (inode 8).\nCLEARED.\n*** journal has been deleted ***\n\n/dev/vdd: Resize inode not valid. \n\n/dev/vdd: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.\n\t(i.e., without -a or -p options)\n."
I0830 10:21:33.257581 6788 boot.go:49] waiting for volumes
I0830 10:22:33.257754 6788 volumes.go:299] Listing Openstack disks in c2cd83b134244985b80038bf5c9e5e42/zone-1
I0830 10:22:33.721984 6788 mounter.go:64] Master volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" is attached at "/dev/vdd"
I0830 10:22:33.722177 6788 mounter.go:78] Doing safe-format-and-mount of /dev/vdd to /mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local
I0830 10:22:33.722270 6788 mounter.go:116] Found volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" mounted at device "/dev/vdd"
I0830 10:22:33.722843 6788 mounter.go:161] Creating mount directory "/rootfs/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local"
I0830 10:22:33.722985 6788 mounter.go:166] Mounting device "/dev/vdd" on "/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local"
I0830 10:22:33.723069 6788 mount_linux.go:440] Checking for issues with fsck on disk: /dev/vdd
I0830 10:22:33.723154 6788 nsenter_exec.go:50] Running command : nsenter [--mount=/rootfs/proc/1/ns/mnt -- fsck -a /dev/vdd]
W0830 10:22:33.753342 6788 mounter.go:82] unable to mount master volume: "error formatting and mounting disk \"/dev/vdd\" on \"/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local\": 'fsck' found errors on device /dev/vdd but could not correct them: fsck from util-linux 2.33.1\n/dev/vdd contains a file system with errors, check forced.\n/dev/vdd: Resize inode not valid. \n\n/dev/vdd: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.\n\t(i.e., without -a or -p options)\n."
I0830 10:22:33.753361 6788 boot.go:49] waiting for volumes
What is causing this issue? Like in this case we got 2 of 3 masters up and running. Events etcd-manager is running fine, but one of the main etcd volumes failed which lead to one master failure.
kopeio/etcd-manager:1.0.20180729
fails to mount etcd volumes with Failed to create bus connection: No data available
kops Version 1.10.0 (git-8b52ea6d1)
I0822 11:02:52.053982 1 mounter.go:150] Creating mount directory "/rootfs/mnt/master-vol-04c550c9347a13de8"
I0822 11:02:52.053995 1 mounter.go:155] Mounting device "/dev/xvdu" on "/mnt/master-vol-04c550c9347a13de8"
I0822 11:02:52.054005 1 mount_linux.go:472] Checking for issues with fsck on disk: /dev/xvdu
I0822 11:02:52.054011 1 nsenter_exec.go:50] Running command : nsenter [--mount=/rootfs/proc/1/ns/mnt -- fsck -a /dev/xvdu]
I0822 11:02:52.083063 1 mount_linux.go:491] Attempting to mount disk: /dev/xvdu /mnt/master-vol-04c550c9347a13de8
I0822 11:02:52.083097 1 nsenter_mount.go:81] nsenter mount /dev/xvdu /mnt/master-vol-04c550c9347a13de8 [defaults]
I0822 11:02:52.083117 1 nsenter.go:106] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/systemd-run --description=Kubernetes transient mount for /mnt/master-vol-04c550c9347a13de8 --scope -- /bin/mount -o defaults /dev/xvdu /mnt/master-vol-04c550c9347a13de8]
I0822 11:02:52.092447 1 nsenter_mount.go:85] Output of mounting /dev/xvdu to /mnt/master-vol-04c550c9347a13de8: Failed to create bus connection: No data available
I0822 11:02:52.092465 1 mount_linux.go:542] Attempting to determine if disk "/dev/xvdu" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/xvdu])
I0822 11:02:52.092478 1 nsenter_exec.go:50] Running command : nsenter [--mount=/rootfs/proc/1/ns/mnt -- blkid -p -s TYPE -s PTTYPE -o export /dev/xvdu]
I0822 11:02:52.106696 1 mount_linux.go:545] Output: "DEVNAME=/dev/xvdu\nTYPE=ext4\n", err: <nil>
W0822 11:02:52.106733 1 mounter.go:79] unable to mount master volume: "error formatting and mounting disk \"/dev/xvdu\" on \"/mnt/master-vol-04c550c9347a13de8\": exit status 1"
cluster.yml
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2017-06-20T09:30:00Z
name: xxx
spec:
api:
loadBalancer:
type: Public
authorization:
alwaysAllow: {}
channel: alpha
cloudProvider: aws
configBase: s3://xxx
dnsZone: xxx
docker:
bridgeIP: 192.168.5.1/24
storage: overlay2
etcdClusters:
- etcdMembers:
- instanceGroup: master-eu-west-1a
name: a
encryptedVolume: true
- instanceGroup: master-eu-west-1b
name: b
encryptedVolume: true
- instanceGroup: master-eu-west-1c
name: c
encryptedVolume: true
name: main
manager:
image: kopeio/etcd-manager:1.0.20180729
- etcdMembers:
- instanceGroup: master-eu-west-1a
name: a
encryptedVolume: true
- instanceGroup: master-eu-west-1b
name: b
encryptedVolume: true
- instanceGroup: master-eu-west-1c
name: c
encryptedVolume: true
name: events
manager:
image: kopeio/etcd-manager:1.0.20180729
kubeAPIServer:
runtimeConfig:
batch/v2alpha1: "true"
logLevel: 1
kubelet:
logLevel: 1
podInfraContainerImage: gcr.io/google_containers/pause-amd64:3.1
kubeProxy:
logLevel: 1
kubeControllerManager:
logLevel: 1
kubeScheduler:
logLevel: 1
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.10.7
masterPublicName: api.xxx
networkCIDR: xxx
networking:
kubenet: {}
nonMasqueradeCIDR: 100.64.0.0/10
hooks:
- name: disable-locksmithd.service
before:
- locksmithd.service
manifest: |
Type=oneshot
ExecStart=/usr/bin/systemctl mask locksmithd.service
ExecStart=-/usr/bin/systemctl stop locksmithd.service
sshAccess:
- xxx
subnets:
- cidr: xxx
name: eu-west-1a
type: Public
zone: eu-west-1a
- cidr: xxx
name: eu-west-1b
type: Public
zone: eu-west-1b
- cidr: xxx
name: eu-west-1c
type: Public
zone: eu-west-1c
topology:
dns:
type: Public
masters: public
nodes: public
one of the three masters:
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2017-06-22T07:08:46Z
labels:
kops.k8s.io/cluster: xxx
name: master-eu-west-1a
spec:
detailedInstanceMonitoring: true
image: coreos.com/CoreOS-stable-*-hvm
machineType: t2.medium
maxSize: 1
minSize: 1
nodeLabels:
beta.kubernetes.io/fluentd-ds-ready: "true"
role: Master
subnets:
- eu-west-1a
We should be able to create keys and distribute them security
We are using kopeio/etcd-manager:3.0.20190801 version in our k8s cluster for events and main, and they corrupted the /etc/hosts file after some hours.
for the consitent master it looks like this:
# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
# /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 ip-1-2-3-4.ourdomain.pri ip-1-2-3-4
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Begin host entries managed by etcd-manager[etcd-events] - do not edit
1.2.3.4 etcd-events-a.internal.example.com
1.2.3.5 etcd-events-b.internal.example.com
1.2.3.6 etcd-events-c.internal.example.com
# End host entries managed by etcd-manager[etcd-events]
# Begin host entries managed by etcd-manager[etcd] - do not edit
1.2.3.4 etcd-a.internal.example.com
1.2.3.5 etcd-b.internal.example.com
1.2.3.6 etcd-c.internal.example.com
# End host entries managed by etcd-manager[etcd]
while on one of the other master, where it is damaged:
r-data
#
127.0.1.1 ip-1-2-3-6.ourdomain.pri ip-1-2-3-6
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Begin host entries managed by etcd-manager[etcd] - do not edit
1.2.3.4 etcd-a.internal.example.com
1.2.3.5 etcd-b.internal.example.com
1.2.3.6 etcd-c.internal.example.com
# End host entries managed by etcd-manager[etcd]
# Begin host entries managed by etcd-manager[etcd-events] - do not edit
1.2.3.4 etcd-events-a.internal.example.com
1.2.3.5 etcd-events-b.internal.example.com
1.2.3.6 etcd-events-c.internal.example.com
# End host entries managed by etcd-manager[etcd-events]
As you can see after some concurrent writes the events and the main etcd-manager damaged the beginning of the file (partially removing part of cloud.cfg comment). After some time they will remove the host entries as well, and we end up with a file, that doesn't contain any entries for loclahost and for the hostname ip-x-x-x-x, which causes all the calico nodes in the cluster become unready.
Attaching the 2 host file, and part of kibanlogs we see:
We are sometimes seeing situation like this when using OpenStack
I1027 06:32:33.238927 5130 volumes.go:300] Listing Openstack disks in 44a6f8538efe47cd9b55182e0a94e478/zone-1
I1027 06:32:33.660679 5130 mounter.go:288] Trying to mount master volume: "1bc7494f-be09-443b-8713-c478f8f2c5ed"
W1027 06:32:33.952050 5130 mounter.go:293] Error attaching volume "1bc7494f-be09-443b-8713-c478f8f2c5ed": error attaching volume 1bc7494f-be09-443b-8713-c478f8f2c5ed to server 061822c7-0fcd-4e49-96f3-ee0a204a448c: Bad request with: [POST https://foobar.com/v2.1/servers/061822c7-0fcd-4e49-96f3-ee0a204a448c/os-volume_attachments], error message: {"badRequest": {"message": "Invalid volume: volume 1bc7494f-be09-443b-8713-c478f8f2c5ed already attached", "code": 400}}
I1027 06:32:33.952206 5130 mounter.go:302] Currently attached volumes: []
I1027 06:32:33.952256 5130 boot.go:49] waiting for volumes
% openstack volume list --project kaas-clusterpr-6aef63-k8s-local
+--------------------------------------+------------------------------------------+-----------+------+---------------------------------------------------------------+
| ID | Name | Status | Size | Attached to |
+--------------------------------------+------------------------------------------+-----------+------+---------------------------------------------------------------+
| 1dc01e68-a4db-4a67-b00f-da9e26fbd7af | 1.etcd-events.clusterpr-6aef63.k8s.local | in-use | 8 | Attached to 18e52d29-2771-404c-bf6a-37a94631e506 on /dev/vdc |
| ec8f7fa7-2594-4619-88f5-83b5e93b2886 | 1.etcd-main.clusterpr-6aef63.k8s.local | in-use | 8 | Attached to 18e52d29-2771-404c-bf6a-37a94631e506 on /dev/vdd |
| 854de0e0-39ba-43f3-982b-b1affc774e55 | 3.etcd-events.clusterpr-6aef63.k8s.local | in-use | 8 | Attached to 061822c7-0fcd-4e49-96f3-ee0a204a448c on /dev/vdd |
| 7e549e7b-7105-46c7-b976-2f3bb4bf6c8f | 2.etcd-events.clusterpr-6aef63.k8s.local | in-use | 8 | Attached to 9d55ba26-ee07-4422-afa1-b37ffec92d73 on /dev/vdd |
| 52d58cb5-71da-415f-8695-e9bea97380a6 | 3.etcd-main.clusterpr-6aef63.k8s.local | in-use | 8 | Attached to 9d55ba26-ee07-4422-afa1-b37ffec92d73 on /dev/vdc |
| 1bc7494f-be09-443b-8713-c478f8f2c5ed | 2.etcd-main.clusterpr-6aef63.k8s.local | available | 8 | |
+--------------------------------------+------------------------------------------+-----------+------+---------------------------------------------------------------+
So for some reason manager decides to take incorrect volume. Maybe better tags for volumes needed? I am running this in single zone so volumes can be mounted to any master.
edit:
Hmm now when I check the error message and volume list, the ids actually match to non-mounted volume but the volume is not somehow attached? but it says its attached?
Currently seeing our cluster has gotten into a state where it's cluster state knows about all three members, but marks one as unhealthy because it's not responding to etcd checks. However, the reason it's not responding is because the gRPC command to join the cluster hasn't been initiated, because it already knows the member exists.
Of note is that this host runs two instances of etcd-manager, one for events and one for main Kubernetes objects. Only one of the instances is "broken".
Log excerpt from etcd-manager leader:
0103 18:44:51.268012 18771 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0103 18:44:51.818575 18771 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0103 18:44:51.879649 18771 hosts.go:84] hosts update: primary=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 18:44:51.879750 18771 hosts.go:181] skipping update of unchanged /etc/hosts
2020-01-03 18:44:55.679341 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-03 18:44:55.679371 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
I0103 18:44:58.933053 18771 controller.go:173] starting controller iteration
I0103 18:44:58.933090 18771 controller.go:269] I am leader with token "[REDACTED]"
2020-01-03 18:45:00.679490 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:00.679521 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
W0103 18:45:03.957167 18771 controller.go:703] health-check unable to reach member 2595344402187300919: error building etcd client for https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002: dial tcp 172.28.196.130:4002: connect: connection refused
I0103 18:45:03.957196 18771 controller.go:276] etcd cluster state: etcdClusterState
members:
{"name":"etcd-events-etcd-us-west-2c","peerURLs":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002"],"ID":"2595344402187300919"}
NOT HEALTHY
{"name":"etcd-events-etcd-us-west-2b","peerURLs":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002"],"ID":"14454711989398209995"}
{"name":"etcd-events-etcd-us-west-2a","peerURLs":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002"],"ID":"16707933308235350511"}
peers:
etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2a" endpoints:"172.28.192.102:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" > etcd_state:<cluster:<cluster_token:"Ty6K7M5AzR1HeBeARXgqAA" nodes:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" tls_enabled:true > > etcd_version:"3.3.13" > }
etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2b" endpoints:"172.28.194.230:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" > etcd_state:<cluster:<cluster_token:"Ty6K7M5AzR1HeBeARXgqAA" nodes:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" tls_enabled:true > > etcd_version:"3.3.13" > }
etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2c" endpoints:"172.28.196.130:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" > }
I0103 18:45:03.957341 18771 controller.go:277] etcd cluster members: map[14454711989398209995:{"name":"etcd-events-etcd-us-west-2b","peerURLs":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002"],"ID":"14454711989398209995"} 16707933308235350511:{"name":"etcd-events-etcd-us-west-2a","peerURLs":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002"],"ID":"16707933308235350511"} 2595344402187300919:{"name":"etcd-events-etcd-us-west-2c","peerURLs":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002"],"ID":"2595344402187300919"}]
I0103 18:45:03.957362 18771 controller.go:615] sending member map to all peers: members:<name:"etcd-events-etcd-us-west-2a" dns:"etcd-events-etcd-us-west-2a.internal.redacted.k8s.local" addresses:"172.28.192.102" > members:<name:"etcd-events-etcd-us-west-2b" dns:"etcd-events-etcd-us-west-2b.internal.redacted.k8s.local" addresses:"172.28.194.230" >
I0103 18:45:03.957569 18771 etcdserver.go:226] updating hosts: map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]]
I0103 18:45:03.957808 18771 hosts.go:84] hosts update: primary=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 18:45:04.011465 18771 commands.go:22] not refreshing commands - TTL not hit
I0103 18:45:04.011495 18771 s3fs.go:220] Reading file "s3://zendesk-compute-kops-state-staging/redacted.k8s.local/backups/etcd/events/control/etcd-cluster-created"
I0103 18:45:04.042214 18771 controller.go:369] spec member_count:3 etcd_version:"3.3.13"
I0103 18:45:04.042271 18771 controller.go:494] etcd has unhealthy members, but we already have a slot where we could add another member
I0103 18:45:04.042294 18771 controller.go:531] controller loop complete
2020-01-03 18:45:05.679628 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:05.679658 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-03 18:45:10.679762 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:10.679790 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
This project is linked to from the Kops roadmap. What's the status of it? Is it production ready or still a work-in-progress?
HI Justin,
I have been studying this project. I found that you have implemented a gossip and leader election alg here. Have you considered using raft itself to do so instead of reinventing this?
Thanks.
I am trying to compile etcd-manager docker image using make push
root@bazeltest:/home/debian/etcd-manager# bazel version
Build label: 0.28.1
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Jul 19 15:19:51 2019 (1563549591)
Build timestamp: 1563549591
Build timestamp as int: 1563549591
root@bazeltest:/home/debian/etcd-manager# make push
bazel run --features=pure --platforms=@io_bazel_rules_go//go/toolchain:linux_amd64 //images:push-etcd-manager
ERROR: /root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/subpar/subpar.bzl:111:17: Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/subpar/subpar.bzl", line 108
rule(attrs = {"src": attr.label(manda...")}, <2 more arguments>)
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/subpar/subpar.bzl", line 111, in rule
attr.label(mandatory = True, allow_files = Tr..., ...)
'single_file' is no longer supported. use allow_single_file instead. You can use --incompatible_disable_deprecated_attr_params=false to temporarily disable this check.
ERROR: /home/debian/etcd-manager/images/BUILD:102:1: every rule of type container_push implicitly depends upon the target '@containerregistry//:pusher', but this target could not be found because of: error loading package '@containerregistry//': Extension file 'subpar.bzl' has errors
ERROR: /home/debian/etcd-manager/images/BUILD:102:1: every rule of type container_push implicitly depends upon the target '@containerregistry//:digester', but this target could not be found because of: error loading package '@containerregistry//': Extension file 'subpar.bzl' has errors
ERROR: Analysis of target '//images:push-etcd-manager' failed; build aborted: error loading package '@containerregistry//': Extension file 'subpar.bzl' has errors
INFO: Elapsed time: 0.780s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (16 packages loaded, 143 targets configured)
FAILED: Build did NOT complete successfully (16 packages loaded, 143 targets configured)
currently loading: @containerregistry//
Makefile:23: recipe for target 'push-etcd-manager' failed
make: *** [push-etcd-manager] Error 1
Lets add --incompatible_disable_deprecated_attr_params=false
to parameters:
root@bazeltest:/home/debian/etcd-manager# make push
bazel run --features=pure --platforms=@io_bazel_rules_go//go/toolchain:linux_amd64 --incompatible_disable_deprecated_attr_params=false //images:push-etcd-manager
ERROR: /home/debian/etcd-manager/images/BUILD:29:1: in container_layer_ rule //images:etcd-3-1-12-layer:
Traceback (most recent call last):
File "/home/debian/etcd-manager/images/BUILD", line 29
container_layer_(name = 'etcd-3-1-12-layer')
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/io_bazel_rules_docker/container/layer.bzl", line 184, in _impl
zip_layer(ctx, unzipped_layer)
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/io_bazel_rules_docker/container/layer.bzl", line 121, in zip_layer
_gzip(ctx, layer)
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/io_bazel_rules_docker/skylib/zip.bzl", line 19, in _gzip
ctx.actions.run_shell(command = ("%s -n < %s > %s" % (...)), <4 more arguments>)
Found tool(s) 'bazel-out/host/bin/external/gzip/gzip' in inputs. A tool is an input with executable=True set. All tools should be passed using the 'tools' argument instead of 'inputs' in order to make their runfiles available to the action. This safety check will not be performed once the action is modified to take a 'tools' argument. To temporarily disable this check, set --incompatible_no_support_tools_in_action_inputs=false.
ERROR: Analysis of target '//images:push-etcd-manager' failed; build aborted: Analysis of target '//images:etcd-3-1-12-layer' failed; build aborted
INFO: Elapsed time: 1.299s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (159 packages loaded, 5762 targets configured)
FAILED: Build did NOT complete successfully (159 packages loaded, 5762 targets configured)
Makefile:23: recipe for target 'push-etcd-manager' failed
make: *** [push-etcd-manager] Error 1
Still it fails, lets add --incompatible_no_support_tools_in_action_inputs=false
to parameters.
root@bazeltest:/home/debian/etcd-manager# make push
bazel run --features=pure --platforms=@io_bazel_rules_go//go/toolchain:linux_amd64 --incompatible_disable_deprecated_attr_params=false --incompatible_no_support_tools_in_action_inputs=false //images:push-etcd-manager
INFO: Analyzed target //images:push-etcd-manager (322 packages loaded, 8428 targets configured).
INFO: Found 1 target...
ERROR: /home/debian/etcd-manager/images/BUILD:102:1: ContainerPushDigest images/push-etcd-manager.digest failed (Exit 1) digester failed: error executing command bazel-out/host/bin/external/containerregistry/digester --config bazel-out/k8-fastbuild/bin/images/etcd-manager.0.config --manifest bazel-out/k8-fastbuild/bin/images/etcd-manager.0.manifest --digest ... (remaining 61 argument(s) skipped)
Use --sandbox_debug to see verbose messages from the sandbox
Traceback (most recent call last):
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/sandbox/linux-sandbox/199/execroot/__main__/bazel-out/host/bin/external/containerregistry/digester.runfiles/containerregistry/tools/image_digester_.py", line 28, in <module>
from containerregistry.client.v2_2 import docker_image as v2_2_image
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/sandbox/linux-sandbox/199/execroot/__main__/bazel-out/host/bin/external/containerregistry/digester.runfiles/containerregistry/client/__init__.py", line 23, in <module>
from containerregistry.client import docker_creds_
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/sandbox/linux-sandbox/199/execroot/__main__/bazel-out/host/bin/external/containerregistry/digester.runfiles/containerregistry/client/docker_creds_.py", line 31, in <module>
import httplib2
File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/sandbox/linux-sandbox/199/execroot/__main__/bazel-out/host/bin/external/containerregistry/digester.runfiles/httplib2/__init__.py", line 988
raise socket.error, msg
^
SyntaxError: invalid syntax
----------------
Note: The failure of target @containerregistry//:digester (with exit code 1) may have been caused by the fact that it is running under Python 3 instead of Python 2. Examine the error to determine if that appears to be the problem. Since this target is built in the host configuration, the only way to change its version is to set --host_force_python=PY2, which affects the entire build.
If this error started occurring in Bazel 0.27 and later, it may be because the Python toolchain now enforces that targets analyzed as PY2 and PY3 run under a Python 2 and Python 3 interpreter, respectively. See https://github.com/bazelbuild/bazel/issues/7899 for more information.
----------------
Target //images:push-etcd-manager failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 2.594s, Critical Path: 0.35s
INFO: 0 processes.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
Makefile:23: recipe for target 'push-etcd-manager' failed
make: *** [push-etcd-manager] Error 1
I am out of ideas how to build docker image of etcd-manager.
Hey @justinsb
We've slightly touched that topic on kubecon in Barcelona.
First of all I much appreciate all your (authors and collaborators) hard work in these exceptional projects.
My current k8s@aws v1.12 was build by kops 1.11 using normal official etcd image of v3.2.26
Now kops 1.12 is released and has all the etcd versions hardcoded.
Curious if there is way we can have our official images and versions chosen during kops/etcdmanager setup?
IMHO such hard-coding adds excess maintenance.
Whole community will depend on contributors will and free-time.
It's GO code and not end-user-friendly YAML.
Also this commit:
justinsb committed 12 days ago Support etcd 3.3.10 (May 16, 2019)
Any specific reason why not 3.3.13, or anything else after 3.3.10 ?
Related issue from kops repo kubernetes/kops#6756
I'm getting a timeout trying to go get
the project:
$ go get kope.io/etcd-manager
package kope.io/etcd-manager: unrecognized import path "kope.io/etcd-manager" (https fetch: Get https://kope.io/etcd-manager?go-get=1: dial tcp 104.197.25.62:443: i/o timeout
It looks like the vanity URL is unable to respond.
Hopefully this beautiful piuece of software will be update to latest etcd to enable latest secutiry fixes
https://groups.google.com/forum/#!msg/golang-announce/65QixT3tcmg/DrFiG6vvCwAJ
https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3315-2019-08-19
:)
We've been using kops for a few years, and prior to the introduction of etcd-manager we relied on our own EBS backup strategy. This led to a number of etcd volumes being present in our AWS account that matched the tags used by etcd-manager to select and mount storage.
The first host that came up after the rolling-update that installed etcd-manager had the following in its logs:
I1209 16:26:18.958984 18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.959912 18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.960468 18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.961016 18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.961520 18726 main.go:254] Mounting available etcd volumes matching tags [k8s.io/etcd/main k8s.io/role/master=1 kubernetes.io/cluster/kube.us-east-1.dev.deploys.brightcove.com=owned]; nameTag=k8s.io/etcd/main
I1209 16:26:18.962655 18726 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I1209 16:26:19.152540 18726 mounter.go:302] Currently attached volumes: [0xc00025af00]
I1209 16:26:19.152574 18726 mounter.go:64] Master volume "vol-0a5a75bec90179bd8" is attached at "/dev/xvdu"
I1209 16:26:19.152590 18726 mounter.go:78] Doing safe-format-and-mount of /dev/xvdu to /mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.152604 18726 volumes.go:233] volume vol-0a5a75bec90179bd8 not mounted at /rootfs/dev/xvdu
I1209 16:26:19.152639 18726 volumes.go:247] found nvme volume "nvme-Amazon_Elastic_Block_Store_vol0a5a75bec90179bd8" at "/dev/nvme1n1"
I1209 16:26:19.152652 18726 mounter.go:116] Found volume "vol-0a5a75bec90179bd8" mounted at device "/dev/nvme1n1"
I1209 16:26:19.153151 18726 mounter.go:173] Device already mounted on "/mnt/master-vol-0a5a75bec90179bd8", verifying it is our device
I1209 16:26:19.153167 18726 mounter.go:185] Found existing mount of "/dev/nvme1n1" at "/mnt/master-vol-0a5a75bec90179bd8"
I1209 16:26:19.153241 18726 mount_linux.go:164] Detected OS without systemd
I1209 16:26:19.153789 18726 mounter.go:226] matched device "/dev/nvme1n1" and "/dev/nvme1n1" via '\x00'
I1209 16:26:19.153803 18726 mounter.go:86] mounted master volume "vol-0a5a75bec90179bd8" on /mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.153816 18726 main.go:269] discovered IP address: 10.250.16.215
I1209 16:26:19.153823 18726 main.go:274] Setting data dir to /rootfs/mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.154260 18726 server.go:71] starting GRPC server using TLS, ServerName="etcd-manager-server-etcd-b"
I1209 16:26:19.154403 18726 s3context.go:331] product_uuid is "ec2004e4-d619-9524-bf5b-e56ce28c2bd6", assuming running on EC2
I1209 16:26:19.155152 18726 s3context.go:164] got region from metadata: "us-east-1"
I1209 16:26:19.212772 18726 s3context.go:210] found bucket in region "us-east-1"
I1209 16:26:19.212798 18726 s3fs.go:128] Writing file "s3://com.brightcove.deploys.dev.kube.dev-us-east-1/kube.us-east-1.dev.deploys.brightcove.com/backups/etcd/main/control/etcd-cluster-created"
I1209 16:26:19.212816 18726 s3context.go:238] Checking default bucket encryption for "com.brightcove.deploys.dev.kube.dev-us-east-1"
W1209 16:26:19.272282 18726 controller.go:135] not enabling TLS for etcd, this is insecure
I1209 16:26:19.272306 18726 server.go:89] GRPC server listening on "10.250.16.215:3996"
I1209 16:26:19.272403 18726 etcdserver.go:534] starting etcd with state cluster:<cluster_token:"ckDjqRPhIBJGj0dtx6qVlw" nodes:<name:"etcd-a" peer_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > nodes:<name:"etcd-b" peer_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > nodes:<name:"etcd-c" peer_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > > etcd_version:"2.2.1"
I1209 16:26:19.272549 18726 etcdserver.go:543] starting etcd with datadir /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
I1209 16:26:19.272548 18726 volumes.go:85] AWS API Request: ec2/DescribeVolumes
W1209 16:26:19.272599 18726 pki.go:46] not generating peer keypair as peers-ca not set
W1209 16:26:19.272626 18726 pki.go:84] not generating client keypair as clients-ca not set
I1209 16:26:19.272703 18726 etcdprocess.go:180] executing command /opt/etcd-v2.2.1-linux-amd64/etcd [/opt/etcd-v2.2.1-linux-amd64/etcd]
W1209 16:26:19.272749 18726 etcdprocess.go:234] using insecure configuration for etcd peers
W1209 16:26:19.272774 18726 etcdprocess.go:243] using insecure configuration for etcd clients
2019-12-09 16:26:19.277754 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001
2019-12-09 16:26:19.277784 I | flags: recognized and used environment variable ETCD_DATA_DIR=/rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.277799 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380
2019-12-09 16:26:19.277814 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-a=http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:2380,etcd-b=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380,etcd-c=http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:2380
2019-12-09 16:26:19.277820 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
2019-12-09 16:26:19.277830 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.277838 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:4001
2019-12-09 16:26:19.277848 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2019-12-09 16:26:19.277859 I | flags: recognized and used environment variable ETCD_NAME=etcd-b
2019-12-09 16:26:19.277889 W | flags: unrecognized environment variable ETCD_LISTEN_METRICS_URLS=
2019-12-09 16:26:19.277934 I | etcdmain: etcd Version: 2.2.1
2019-12-09 16:26:19.277938 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2019-12-09 16:26:19.277941 I | etcdmain: Go Version: go1.12.5
2019-12-09 16:26:19.277945 I | etcdmain: Go OS/Arch: linux/amd64
2019-12-09 16:26:19.277949 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2019-12-09 16:26:19.277992 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2019-12-09 16:26:19.278095 I | etcdmain: listening for peers on http://0.0.0.0:2380
2019-12-09 16:26:19.278118 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2019-12-09 16:26:19.371091 I | etcdserver: recovered store from snapshot at index 380038
2019-12-09 16:26:19.371126 I | etcdserver: name = etcd-b
2019-12-09 16:26:19.371130 I | etcdserver: data dir = /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.371134 I | etcdserver: member dir = /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw/member
2019-12-09 16:26:19.371138 I | etcdserver: heartbeat = 100ms
2019-12-09 16:26:19.371140 I | etcdserver: election = 1000ms
2019-12-09 16:26:19.371144 I | etcdserver: snapshot count = 10000
2019-12-09 16:26:19.371155 I | etcdserver: advertise client URLs = http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001
2019-12-09 16:26:19.371185 I | etcdserver: loaded cluster information from store: <nil>
I1209 16:26:19.373963 18726 volumes.go:85] AWS API Request: ec2/DescribeInstances
2019-12-09 16:26:19.412180 I | etcdserver: restarting member a8bc606d954cb360 in cluster 362b3eb57d5b3247 at commit index 386849
2019-12-09 16:26:19.412557 I | raft: a8bc606d954cb360 became follower at term 826
2019-12-09 16:26:19.412578 I | raft: newRaft a8bc606d954cb360 [peers: [21c1cba54be22c9a,85558b08fd6377a2,a8bc606d954cb360], term: 826, commit: 386849, applied: 380038, lastindex: 386849, lastterm: 20]
2019-12-09 16:26:19.419037 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.419054 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.419064 E | rafthttp: failed to dial 21c1cba54be22c9a on stream Message (cluster ID mismatch)
2019-12-09 16:26:19.419073 E | rafthttp: failed to dial 21c1cba54be22c9a on stream MsgApp v2 (cluster ID mismatch)
2019-12-09 16:26:19.419903 I | etcdserver: starting server... [version: 2.2.1, cluster version: 2.2]
2019-12-09 16:26:19.422255 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[85558b08fd6377a2]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.422279 E | rafthttp: failed to dial 85558b08fd6377a2 on stream Message (cluster ID mismatch)
2019-12-09 16:26:19.422552 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[85558b08fd6377a2]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.422565 E | rafthttp: failed to dial 85558b08fd6377a2 on stream MsgApp v2 (cluster ID mismatch)
I1209 16:26:19.439578 18726 peers.go:101] found new candidate peer from discovery: etcd-a [{10.250.17.141 0} {10.250.17.141 0}]
I1209 16:26:19.439616 18726 peers.go:101] found new candidate peer from discovery: etcd-b [{10.250.16.215 0} {10.250.16.215 0}]
I1209 16:26:19.439629 18726 peers.go:101] found new candidate peer from discovery: etcd-c [{10.250.18.173 0} {10.250.18.173 0}]
I1209 16:26:19.439703 18726 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
I1209 16:26:19.439733 18726 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.17.141 10.250.17.141] etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.16.215 10.250.16.215] etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.18.173 10.250.18.173]], final=map[10.250.16.215:[etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com] 10.250.17.141:[etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com] 10.250.18.173:[etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com]]
I1209 16:26:19.439903 18726 peers.go:281] connecting to peer "etcd-c" with TLS policy, servername="etcd-manager-server-etcd-c"
I1209 16:26:19.439982 18726 peers.go:281] connecting to peer "etcd-b" with TLS policy, servername="etcd-manager-server-etcd-b"
W1209 16:26:19.440686 18726 peers.go:325] unable to grpc-ping discovered peer 10.250.18.173:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.250.18.173:3996: connect: connection refused"
I1209 16:26:19.440719 18726 peers.go:347] was not able to connect to peer etcd-c: map[10.250.18.173:3996:true]
W1209 16:26:19.440745 18726 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-c
W1209 16:26:19.441043 18726 peers.go:325] unable to grpc-ping discovered peer 10.250.17.141:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.250.17.141:3996: connect: connection refused"
I1209 16:26:19.441077 18726 peers.go:347] was not able to connect to peer etcd-a: map[10.250.17.141:3996:true]
W1209 16:26:19.441096 18726 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a
2019-12-09 16:26:19.520566 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)
As the first of 3 hosts that would eventually have etcd-manager installed, the gossip-specific warnings are to be expected. The cluster ID mismatch errors are more significant, and are the consequence of etcd mounting a volume that was several months old.
Some of the approaches that occur to me to address this are:
I think the first would suffice in most cases, but the other options result in less unexpected behavior in the future.
Hello,
It is not an issue. It is a question. Do you have any plan to implement in-place-upgrade from etcd 3.0.17 to 3.2.24?
We have k8s environment with etcd 3.0.17 and would like to protect all etcd communication with TLS but etc-manager doesn't support 3.0.17 version.
etcd-manager/pkg/controller/controller.go
Line 421 in 7af893b
me-south-1 is aws new region
I use kops to deploy k8s in me-south-1
but etcd-manager return error: me-south-1 is invalid region
I watch the pull request, and find the code to support me-south-1, had upload in 8 days ago
how to build it and use it to support my deploy
While testing out Kops 1.11-beta.1 with K8s 1.12.3 I noticed some data corruption after migrating to etcd-manager.
Replication process, create a new k8s cluster with Kops.
Kops version: 1.10
Kubernetes version: 1.10
etcd version: 3.2.12
Update etcd and k8s version.
Kubernetes version: 1.13
etcd version: 3.2.18 / 3.2.24 (Tested with both and saw the same issue)
Below is the logs I'm seeing from the etcd-manager container when the corruption seems to happen. When this happens etcd does not start and unfortunately I have not been able to find any relevant logs as to why.
Flag --insecure-bind-address has been deprecated, This flag will be removed in a future version. Flag --insecure-port has been deprecated, This flag will be removed in a future version. I1212 00:12:42.352817 7 flags.go:33] FLAG: --address="127.0.0.1" I1212 00:12:42.352874 7 flags.go:33] FLAG: --admission-control="[]" I1212 00:12:42.352885 7 flags.go:33] FLAG: --admission-control-config-file="" I1212 00:12:42.352892 7 flags.go:33] FLAG: --advertise-address="<nil>" I1212 00:12:42.352896 7 flags.go:33] FLAG: --allow-privileged="true" I1212 00:12:42.352900 7 flags.go:33] FLAG: --alsologtostderr="false" I1212 00:12:42.352904 7 flags.go:33] FLAG: --anonymous-auth="false" I1212 00:12:42.352907 7 flags.go:33] FLAG: --apiserver-count="5" I1212 00:12:42.352911 7 flags.go:33] FLAG: --audit-log-batch-buffer-size="10000" I1212 00:12:42.352915 7 flags.go:33] FLAG: --audit-log-batch-max-size="1" I1212 00:12:42.352917 7 flags.go:33] FLAG: --audit-log-batch-max-wait="0s" I1212 00:12:42.352921 7 flags.go:33] FLAG: --audit-log-batch-throttle-burst="0" I1212 00:12:42.352924 7 flags.go:33] FLAG: --audit-log-batch-throttle-enable="false" I1212 00:12:42.352927 7 flags.go:33] FLAG: --audit-log-batch-throttle-qps="0" I1212 00:12:42.352934 7 flags.go:33] FLAG: --audit-log-format="json" I1212 00:12:42.352937 7 flags.go:33] FLAG: --audit-log-maxage="10" I1212 00:12:42.352940 7 flags.go:33] FLAG: --audit-log-maxbackup="5" I1212 00:12:42.352943 7 flags.go:33] FLAG: --audit-log-maxsize="100" I1212 00:12:42.352946 7 flags.go:33] FLAG: --audit-log-mode="blocking" I1212 00:12:42.352949 7 flags.go:33] FLAG: --audit-log-path="/var/log/kube-audit.log" I1212 00:12:42.352952 7 flags.go:33] FLAG: --audit-log-truncate-enabled="false" I1212 00:12:42.352955 7 flags.go:33] FLAG: --audit-log-truncate-max-batch-size="10485760" I1212 00:12:42.352960 7 flags.go:33] FLAG: --audit-log-truncate-max-event-size="102400" I1212 00:12:42.352963 7 flags.go:33] FLAG: --audit-log-version="audit.k8s.io/v1beta1" I1212 00:12:42.352966 7 flags.go:33] FLAG: --audit-policy-file="/srv/kubernetes/audit_policy.yaml" I1212 00:12:42.352969 7 flags.go:33] FLAG: --audit-webhook-batch-buffer-size="10000" I1212 00:12:42.352972 7 flags.go:33] FLAG: --audit-webhook-batch-initial-backoff="10s" I1212 00:12:42.352975 7 flags.go:33] FLAG: --audit-webhook-batch-max-size="400" I1212 00:12:42.352978 7 flags.go:33] FLAG: --audit-webhook-batch-max-wait="30s" I1212 00:12:42.352981 7 flags.go:33] FLAG: --audit-webhook-batch-throttle-burst="15" I1212 00:12:42.352984 7 flags.go:33] FLAG: --audit-webhook-batch-throttle-enable="true" I1212 00:12:42.352987 7 flags.go:33] FLAG: --audit-webhook-batch-throttle-qps="10" I1212 00:12:42.352990 7 flags.go:33] FLAG: --audit-webhook-config-file="" I1212 00:12:42.352993 7 flags.go:33] FLAG: --audit-webhook-initial-backoff="10s" I1212 00:12:42.352996 7 flags.go:33] FLAG: --audit-webhook-mode="batch" I1212 00:12:42.352999 7 flags.go:33] FLAG: --audit-webhook-truncate-enabled="false" I1212 00:12:42.353002 7 flags.go:33] FLAG: --audit-webhook-truncate-max-batch-size="10485760" I1212 00:12:42.353005 7 flags.go:33] FLAG: --audit-webhook-truncate-max-event-size="102400" I1212 00:12:42.353008 7 flags.go:33] FLAG: --audit-webhook-version="audit.k8s.io/v1beta1" I1212 00:12:42.353011 7 flags.go:33] FLAG: --authentication-token-webhook-cache-ttl="2m0s" I1212 00:12:42.353014 7 flags.go:33] FLAG: --authentication-token-webhook-config-file="/etc/kubernetes/authn.config" I1212 00:12:42.353017 7 flags.go:33] FLAG: --authorization-mode="[RBAC]" I1212 00:12:42.353021 7 flags.go:33] FLAG: --authorization-policy-file="" I1212 00:12:42.353024 7 flags.go:33] FLAG: --authorization-webhook-cache-authorized-ttl="5m0s" I1212 00:12:42.353027 7 flags.go:33] FLAG: --authorization-webhook-cache-unauthorized-ttl="30s" I1212 00:12:42.353030 7 flags.go:33] FLAG: --authorization-webhook-config-file="" I1212 00:12:42.353032 7 flags.go:33] FLAG: --basic-auth-file="/srv/kubernetes/basic_auth.csv" I1212 00:12:42.353036 7 flags.go:33] FLAG: --bind-address="0.0.0.0" I1212 00:12:42.353039 7 flags.go:33] FLAG: --cert-dir="/var/run/kubernetes" I1212 00:12:42.353042 7 flags.go:33] FLAG: --client-ca-file="/srv/kubernetes/ca.crt" I1212 00:12:42.353045 7 flags.go:33] FLAG: --cloud-config="/etc/kubernetes/cloud.config" I1212 00:12:42.353048 7 flags.go:33] FLAG: --cloud-provider="aws" I1212 00:12:42.353051 7 flags.go:33] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16" I1212 00:12:42.353056 7 flags.go:33] FLAG: --contention-profiling="false" I1212 00:12:42.353059 7 flags.go:33] FLAG: --cors-allowed-origins="[]" I1212 00:12:42.353065 7 flags.go:33] FLAG: --default-not-ready-toleration-seconds="300" I1212 00:12:42.353068 7 flags.go:33] FLAG: --default-unreachable-toleration-seconds="300" I1212 00:12:42.353071 7 flags.go:33] FLAG: --default-watch-cache-size="100" I1212 00:12:42.353074 7 flags.go:33] FLAG: --delete-collection-workers="1" I1212 00:12:42.353077 7 flags.go:33] FLAG: --deserialization-cache-size="0" I1212 00:12:42.353080 7 flags.go:33] FLAG: --disable-admission-plugins="[]" I1212 00:12:42.353083 7 flags.go:33] FLAG: --enable-admission-plugins="[Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,NodeRestriction,ResourceQuota]" I1212 00:12:42.353100 7 flags.go:33] FLAG: --enable-aggregator-routing="false" I1212 00:12:42.353107 7 flags.go:33] FLAG: --enable-bootstrap-token-auth="false" I1212 00:12:42.353109 7 flags.go:33] FLAG: --enable-garbage-collector="true" I1212 00:12:42.353112 7 flags.go:33] FLAG: --enable-logs-handler="true" I1212 00:12:42.353115 7 flags.go:33] FLAG: --enable-swagger-ui="false" I1212 00:12:42.353118 7 flags.go:33] FLAG: --endpoint-reconciler-type="lease" I1212 00:12:42.353121 7 flags.go:33] FLAG: --etcd-cafile="" I1212 00:12:42.353123 7 flags.go:33] FLAG: --etcd-certfile="" I1212 00:12:42.353126 7 flags.go:33] FLAG: --etcd-compaction-interval="5m0s" I1212 00:12:42.353129 7 flags.go:33] FLAG: --etcd-count-metric-poll-period="1m0s" I1212 00:12:42.353132 7 flags.go:33] FLAG: --etcd-keyfile="" I1212 00:12:42.353135 7 flags.go:33] FLAG: --etcd-prefix="/registry" I1212 00:12:42.353138 7 flags.go:33] FLAG: --etcd-quorum-read="true" I1212 00:12:42.353141 7 flags.go:33] FLAG: --etcd-servers="[http://127.0.0.1:4001]" I1212 00:12:42.353145 7 flags.go:33] FLAG: --etcd-servers-overrides="[/events#http://127.0.0.1:4002]" I1212 00:12:42.353150 7 flags.go:33] FLAG: --event-ttl="1h0m0s" I1212 00:12:42.353156 7 flags.go:33] FLAG: --experimental-encryption-provider-config="" I1212 00:12:42.353159 7 flags.go:33] FLAG: --external-hostname="" I1212 00:12:42.353162 7 flags.go:33] FLAG: --feature-gates="" I1212 00:12:42.353167 7 flags.go:33] FLAG: --help="false" I1212 00:12:42.353170 7 flags.go:33] FLAG: --http2-max-streams-per-connection="0" I1212 00:12:42.353172 7 flags.go:33] FLAG: --insecure-bind-address="127.0.0.1" I1212 00:12:42.353176 7 flags.go:33] FLAG: --insecure-port="8080" I1212 00:12:42.353179 7 flags.go:33] FLAG: --kubelet-certificate-authority="" I1212 00:12:42.353182 7 flags.go:33] FLAG: --kubelet-client-certificate="/srv/kubernetes/kubelet-api.pem" I1212 00:12:42.353185 7 flags.go:33] FLAG: --kubelet-client-key="/srv/kubernetes/kubelet-api-key.pem" I1212 00:12:42.353188 7 flags.go:33] FLAG: --kubelet-https="true" I1212 00:12:42.353191 7 flags.go:33] FLAG: --kubelet-port="10250" I1212 00:12:42.353199 7 flags.go:33] FLAG: --kubelet-preferred-address-types="[InternalIP,Hostname,ExternalIP]" I1212 00:12:42.353203 7 flags.go:33] FLAG: --kubelet-read-only-port="10255" I1212 00:12:42.353206 7 flags.go:33] FLAG: --kubelet-timeout="5s" I1212 00:12:42.353209 7 flags.go:33] FLAG: --kubernetes-service-node-port="0" I1212 00:12:42.353212 7 flags.go:33] FLAG: --log-backtrace-at=":0" I1212 00:12:42.353219 7 flags.go:33] FLAG: --log-dir="" I1212 00:12:42.353222 7 flags.go:33] FLAG: --log-flush-frequency="5s" I1212 00:12:42.353225 7 flags.go:33] FLAG: --logtostderr="true" I1212 00:12:42.353228 7 flags.go:33] FLAG: --master-service-namespace="default" I1212 00:12:42.353231 7 flags.go:33] FLAG: --max-connection-bytes-per-sec="0" I1212 00:12:42.353234 7 flags.go:33] FLAG: --max-mutating-requests-inflight="200" I1212 00:12:42.353237 7 flags.go:33] FLAG: --max-requests-inflight="400" I1212 00:12:42.353240 7 flags.go:33] FLAG: --min-request-timeout="1800" I1212 00:12:42.353243 7 flags.go:33] FLAG: --oidc-ca-file="" I1212 00:12:42.353246 7 flags.go:33] FLAG: --oidc-client-id="" I1212 00:12:42.353249 7 flags.go:33] FLAG: --oidc-groups-claim="" I1212 00:12:42.353251 7 flags.go:33] FLAG: --oidc-groups-prefix="" I1212 00:12:42.353254 7 flags.go:33] FLAG: --oidc-issuer-url="" I1212 00:12:42.353257 7 flags.go:33] FLAG: --oidc-required-claim="" I1212 00:12:42.353261 7 flags.go:33] FLAG: --oidc-signing-algs="[RS256]" I1212 00:12:42.353266 7 flags.go:33] FLAG: --oidc-username-claim="sub" I1212 00:12:42.353269 7 flags.go:33] FLAG: --oidc-username-prefix="" I1212 00:12:42.353271 7 flags.go:33] FLAG: --port="8080" I1212 00:12:42.353274 7 flags.go:33] FLAG: --profiling="true" I1212 00:12:42.353277 7 flags.go:33] FLAG: --proxy-client-cert-file="/srv/kubernetes/apiserver-aggregator.cert" I1212 00:12:42.353281 7 flags.go:33] FLAG: --proxy-client-key-file="/srv/kubernetes/apiserver-aggregator.key" I1212 00:12:42.353284 7 flags.go:33] FLAG: --repair-malformed-updates="false" I1212 00:12:42.353287 7 flags.go:33] FLAG: --request-timeout="1m0s" I1212 00:12:42.353290 7 flags.go:33] FLAG: --requestheader-allowed-names="[aggregator]" I1212 00:12:42.353294 7 flags.go:33] FLAG: --requestheader-client-ca-file="/srv/kubernetes/apiserver-aggregator-ca.cert" I1212 00:12:42.353299 7 flags.go:33] FLAG: --requestheader-extra-headers-prefix="[X-Remote-Extra-]" I1212 00:12:42.353304 7 flags.go:33] FLAG: --requestheader-group-headers="[X-Remote-Group]" I1212 00:12:42.353307 7 flags.go:33] FLAG: --requestheader-username-headers="[X-Remote-User]" I1212 00:12:42.353313 7 flags.go:33] FLAG: --runtime-config="admissionregistration.k8s.io/v1alpha1=true" I1212 00:12:42.353320 7 flags.go:33] FLAG: --secure-port="443" I1212 00:12:42.353323 7 flags.go:33] FLAG: --service-account-api-audiences="[]" I1212 00:12:42.353326 7 flags.go:33] FLAG: --service-account-issuer="" I1212 00:12:42.353329 7 flags.go:33] FLAG: --service-account-key-file="[]" I1212 00:12:42.353338 7 flags.go:33] FLAG: --service-account-lookup="true" I1212 00:12:42.353341 7 flags.go:33] FLAG: --service-account-max-token-expiration="0s" I1212 00:12:42.353344 7 flags.go:33] FLAG: --service-account-signing-key-file="" I1212 00:12:42.353347 7 flags.go:33] FLAG: --service-cluster-ip-range="100.64.0.0/13" I1212 00:12:42.353352 7 flags.go:33] FLAG: --service-node-port-range="30000-32767" I1212 00:12:42.353359 7 flags.go:33] FLAG: --ssh-keyfile="" I1212 00:12:42.353362 7 flags.go:33] FLAG: --ssh-user="" I1212 00:12:42.353364 7 flags.go:33] FLAG: --stderrthreshold="2" I1212 00:12:42.353367 7 flags.go:33] FLAG: --storage-backend="etcd3" I1212 00:12:42.353370 7 flags.go:33] FLAG: --storage-media-type="application/vnd.kubernetes.protobuf" I1212 00:12:42.353374 7 flags.go:33] FLAG: --storage-versions="admission.k8s.io/v1beta1,admissionregistration.k8s.io/v1beta1,apps/v1,authentication.k8s.io/v1,authorization.k8s.io/v1,autoscaling/v1,batch/v1,certificates.k8s.io/v1beta1,coordination.k8s.io/v1beta1,events.k8s.io/v1beta1,extensions/v1beta1,imagepolicy.k8s.io/v1alpha1,networking.k8s.io/v1,policy/v1beta1,rbac.authorization.k8s.io/v1,scheduling.k8s.io/v1beta1,settings.k8s.io/v1alpha1,storage.k8s.io/v1,v1" I1212 00:12:42.353390 7 flags.go:33] FLAG: --target-ram-mb="0" I1212 00:12:42.353393 7 flags.go:33] FLAG: --tls-cert-file="/srv/kubernetes/server.cert" I1212 00:12:42.353396 7 flags.go:33] FLAG: --tls-cipher-suites="[]" I1212 00:12:42.353400 7 flags.go:33] FLAG: --tls-min-version="" I1212 00:12:42.353403 7 flags.go:33] FLAG: --tls-private-key-file="/srv/kubernetes/server.key" I1212 00:12:42.353406 7 flags.go:33] FLAG: --tls-sni-cert-key="[]" I1212 00:12:42.353410 7 flags.go:33] FLAG: --token-auth-file="/srv/kubernetes/known_tokens.csv" I1212 00:12:42.353413 7 flags.go:33] FLAG: --v="2" I1212 00:12:42.353416 7 flags.go:33] FLAG: --version="false" I1212 00:12:42.353421 7 flags.go:33] FLAG: --vmodule="" I1212 00:12:42.353424 7 flags.go:33] FLAG: --watch-cache="true" I1212 00:12:42.353427 7 flags.go:33] FLAG: --watch-cache-sizes="[]" I1212 00:12:42.353695 7 server.go:681] external host was not specified, using 10.5.0.30 I1212 00:12:42.354026 7 server.go:705] Initializing deserialization cache size based on 0MB limit I1212 00:12:42.354036 7 server.go:724] Initializing cache sizes based on 0MB limit I1212 00:12:42.354101 7 server.go:152] Version: v1.12.3 W1212 00:12:42.832684 7 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts. I1212 00:12:42.832846 7 feature_gate.go:206] feature gates: &{map[Initializers:true]} I1212 00:12:42.832863 7 initialization.go:90] enabled Initializers feature as part of admission plugin setup I1212 00:12:42.833085 7 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook,Initializers. I1212 00:12:42.833094 7 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota. W1212 00:12:42.833382 7 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts. I1212 00:12:42.833654 7 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook,Initializers. I1212 00:12:42.833664 7 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota. I1212 00:12:42.835749 7 store.go:1414] Monitoring customresourcedefinitions.apiextensions.k8s.io count at <storage-prefix>//apiextensions.k8s.io/customresourcedefinitions I1212 00:12:42.859202 7 master.go:240] Using reconciler: lease I1212 00:12:42.862882 7 store.go:1414] Monitoring podtemplates count at <storage-prefix>//podtemplates I1212 00:12:42.863313 7 store.go:1414] Monitoring events count at <storage-prefix>//events I1212 00:12:42.863693 7 store.go:1414] Monitoring limitranges count at <storage-prefix>//limitranges I1212 00:12:42.864078 7 store.go:1414] Monitoring resourcequotas count at <storage-prefix>//resourcequotas I1212 00:12:42.864499 7 store.go:1414] Monitoring secrets count at <storage-prefix>//secrets I1212 00:12:42.864886 7 store.go:1414] Monitoring persistentvolumes count at <storage-prefix>//persistentvolumes I1212 00:12:42.865271 7 store.go:1414] Monitoring persistentvolumeclaims count at <storage-prefix>//persistentvolumeclaims I1212 00:12:42.865659 7 store.go:1414] Monitoring configmaps count at <storage-prefix>//configmaps I1212 00:12:42.866063 7 store.go:1414] Monitoring namespaces count at <storage-prefix>//namespaces I1212 00:12:42.866465 7 store.go:1414] Monitoring endpoints count at <storage-prefix>//services/endpoints I1212 00:12:42.866890 7 store.go:1414] Monitoring nodes count at <storage-prefix>//minions I1212 00:12:42.867659 7 store.go:1414] Monitoring pods count at <storage-prefix>//pods I1212 00:12:42.868099 7 store.go:1414] Monitoring serviceaccounts count at <storage-prefix>//serviceaccounts I1212 00:12:42.868523 7 store.go:1414] Monitoring services count at <storage-prefix>//services/specs I1212 00:12:42.869296 7 store.go:1414] Monitoring replicationcontrollers count at <storage-prefix>//controllers I1212 00:12:43.236425 7 master.go:432] Enabling API group "authentication.k8s.io". I1212 00:12:43.236452 7 master.go:432] Enabling API group "authorization.k8s.io". I1212 00:12:43.237028 7 store.go:1414] Monitoring horizontalpodautoscalers.autoscaling count at <storage-prefix>//horizontalpodautoscalers I1212 00:12:43.237503 7 store.go:1414] Monitoring horizontalpodautoscalers.autoscaling count at <storage-prefix>//horizontalpodautoscalers I1212 00:12:43.237908 7 store.go:1414] Monitoring horizontalpodautoscalers.autoscaling count at <storage-prefix>//horizontalpodautoscalers I1212 00:12:43.237922 7 master.go:432] Enabling API group "autoscaling". I1212 00:12:43.238316 7 store.go:1414] Monitoring jobs.batch count at <storage-prefix>//jobs I1212 00:12:43.238723 7 store.go:1414] Monitoring cronjobs.batch count at <storage-prefix>//cronjobs I1212 00:12:43.238739 7 master.go:432] Enabling API group "batch". I1212 00:12:43.239112 7 store.go:1414] Monitoring certificatesigningrequests.certificates.k8s.io count at <storage-prefix>//certificatesigningrequests I1212 00:12:43.239127 7 master.go:432] Enabling API group "certificates.k8s.io". I1212 00:12:43.239556 7 store.go:1414] Monitoring leases.coordination.k8s.io count at <storage-prefix>//leases I1212 00:12:43.239572 7 master.go:432] Enabling API group "coordination.k8s.io". I1212 00:12:43.239956 7 store.go:1414] Monitoring replicationcontrollers count at <storage-prefix>//controllers I1212 00:12:43.240365 7 store.go:1414] Monitoring daemonsets.extensions count at <storage-prefix>//daemonsets I1212 00:12:43.240731 7 store.go:1414] Monitoring deployments.extensions count at <storage-prefix>//deployments I1212 00:12:43.241123 7 store.go:1414] Monitoring ingresses.extensions count at <storage-prefix>//ingress I1212 00:12:43.241545 7 store.go:1414] Monitoring podsecuritypolicies.policy count at <storage-prefix>//podsecuritypolicy I1212 00:12:43.241975 7 store.go:1414] Monitoring replicasets.extensions count at <storage-prefix>//replicasets I1212 00:12:43.242372 7 store.go:1414] Monitoring networkpolicies.networking.k8s.io count at <storage-prefix>//networkpolicies I1212 00:12:43.242385 7 master.go:432] Enabling API group "extensions". I1212 00:12:43.242779 7 store.go:1414] Monitoring networkpolicies.networking.k8s.io count at <storage-prefix>//networkpolicies I1212 00:12:43.242791 7 master.go:432] Enabling API group "networking.k8s.io". I1212 00:12:43.243237 7 store.go:1414] Monitoring poddisruptionbudgets.policy count at <storage-prefix>//poddisruptionbudgets I1212 00:12:43.243653 7 store.go:1414] Monitoring podsecuritypolicies.policy count at <storage-prefix>//podsecuritypolicy I1212 00:12:43.243666 7 master.go:432] Enabling API group "policy". I1212 00:12:43.243998 7 store.go:1414] Monitoring roles.rbac.authorization.k8s.io count at <storage-prefix>//roles I1212 00:12:43.244431 7 store.go:1414] Monitoring rolebindings.rbac.authorization.k8s.io count at <storage-prefix>//rolebindings I1212 00:12:43.244808 7 store.go:1414] Monitoring clusterroles.rbac.authorization.k8s.io count at <storage-prefix>//clusterroles I1212 00:12:43.245201 7 store.go:1414] Monitoring clusterrolebindings.rbac.authorization.k8s.io count at <storage-prefix>//clusterrolebindings I1212 00:12:43.245546 7 store.go:1414] Monitoring roles.rbac.authorization.k8s.io count at <storage-prefix>//roles I1212 00:12:43.245916 7 store.go:1414] Monitoring rolebindings.rbac.authorization.k8s.io count at <storage-prefix>//rolebindings I1212 00:12:43.246314 7 store.go:1414] Monitoring clusterroles.rbac.authorization.k8s.io count at <storage-prefix>//clusterroles I1212 00:12:43.246702 7 store.go:1414] Monitoring clusterrolebindings.rbac.authorization.k8s.io count at <storage-prefix>//clusterrolebindings I1212 00:12:43.246718 7 master.go:432] Enabling API group "rbac.authorization.k8s.io". I1212 00:12:43.247889 7 store.go:1414] Monitoring priorityclasses.scheduling.k8s.io count at <storage-prefix>//priorityclasses I1212 00:12:43.247908 7 master.go:432] Enabling API group "scheduling.k8s.io". I1212 00:12:43.247920 7 master.go:424] Skipping disabled API group "settings.k8s.io". I1212 00:12:43.248329 7 store.go:1414] Monitoring storageclasses.storage.k8s.io count at <storage-prefix>//storageclasses I1212 00:12:43.248726 7 store.go:1414] Monitoring volumeattachments.storage.k8s.io count at <storage-prefix>//volumeattachments I1212 00:12:43.249164 7 store.go:1414] Monitoring storageclasses.storage.k8s.io count at <storage-prefix>//storageclasses I1212 00:12:43.249176 7 master.go:432] Enabling API group "storage.k8s.io". I1212 00:12:43.249588 7 store.go:1414] Monitoring deployments.extensions count at <storage-prefix>//deployments I1212 00:12:43.249996 7 store.go:1414] Monitoring statefulsets.apps count at <storage-prefix>//statefulsets I1212 00:12:43.250453 7 store.go:1414] Monitoring controllerrevisions.apps count at <storage-prefix>//controllerrevisions I1212 00:12:43.250895 7 store.go:1414] Monitoring deployments.extensions count at <storage-prefix>//deployments I1212 00:12:43.251298 7 store.go:1414] Monitoring statefulsets.apps count at <storage-prefix>//statefulsets I1212 00:12:43.251706 7 store.go:1414] Monitoring daemonsets.extensions count at <storage-prefix>//daemonsets I1212 00:12:43.252085 7 store.go:1414] Monitoring replicasets.extensions count at <storage-prefix>//replicasets I1212 00:12:43.274505 7 store.go:1414] Monitoring controllerrevisions.apps count at <storage-prefix>//controllerrevisions I1212 00:12:43.275002 7 store.go:1414] Monitoring deployments.extensions count at <storage-prefix>//deployments I1212 00:12:43.276141 7 store.go:1414] Monitoring statefulsets.apps count at <storage-prefix>//statefulsets I1212 00:12:43.277721 7 store.go:1414] Monitoring daemonsets.extensions count at <storage-prefix>//daemonsets I1212 00:12:43.279482 7 store.go:1414] Monitoring replicasets.extensions count at <storage-prefix>//replicasets I1212 00:12:43.279883 7 store.go:1414] Monitoring controllerrevisions.apps count at <storage-prefix>//controllerrevisions I1212 00:12:43.279895 7 master.go:432] Enabling API group "apps". I1212 00:12:43.280238 7 store.go:1414] Monitoring initializerconfigurations.admissionregistration.k8s.io count at <storage-prefix>//initializerconfigurations I1212 00:12:43.280641 7 store.go:1414] Monitoring validatingwebhookconfigurations.admissionregistration.k8s.io count at <storage-prefix>//validatingwebhookconfigurations I1212 00:12:43.280968 7 store.go:1414] Monitoring mutatingwebhookconfigurations.admissionregistration.k8s.io count at <storage-prefix>//mutatingwebhookconfigurations I1212 00:12:43.280979 7 master.go:432] Enabling API group "admissionregistration.k8s.io". I1212 00:12:43.281301 7 store.go:1414] Monitoring events count at <storage-prefix>//events I1212 00:12:43.281312 7 master.go:432] Enabling API group "events.k8s.io". W1212 00:12:43.516919 7 genericapiserver.go:325] Skipping API batch/v2alpha1 because it has no resources. W1212 00:12:43.835670 7 genericapiserver.go:325] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources. W1212 00:12:43.848163 7 genericapiserver.go:325] Skipping API scheduling.k8s.io/v1alpha1 because it has no resources. W1212 00:12:43.869772 7 genericapiserver.go:325] Skipping API storage.k8s.io/v1alpha1 because it has no resources. [restful] 2018/12/12 00:12:44 log.go:33: [restful/swagger] listing is available at https://10.5.0.30:443/swaggerapi [restful] 2018/12/12 00:12:44 log.go:33: [restful/swagger] https://10.5.0.30:443/swaggerui/ is mapped to folder /swagger-ui/ [restful] 2018/12/12 00:12:45 log.go:33: [restful/swagger] listing is available at https://10.5.0.30:443/swaggerapi [restful] 2018/12/12 00:12:45 log.go:33: [restful/swagger] https://10.5.0.30:443/swaggerui/ is mapped to folder /swagger-ui/ W1212 00:12:45.683798 7 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts. I1212 00:12:45.684127 7 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook,Initializers. I1212 00:12:45.684138 7 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota. I1212 00:12:45.686223 7 store.go:1414] Monitoring apiservices.apiregistration.k8s.io count at <storage-prefix>//apiregistration.k8s.io/apiservices I1212 00:12:45.686707 7 store.go:1414] Monitoring apiservices.apiregistration.k8s.io count at <storage-prefix>//apiregistration.k8s.io/apiservices I1212 00:12:48.453407 7 deprecated_insecure_serving.go:50] Serving insecurely on 127.0.0.1:8080 I1212 00:12:48.454725 7 secure_serving.go:116] Serving securely on [::]:443 I1212 00:12:48.454763 7 autoregister_controller.go:136] Starting autoregister controller I1212 00:12:48.454770 7 cache.go:32] Waiting for caches to sync for autoregister controller I1212 00:12:48.454874 7 apiservice_controller.go:90] Starting APIServiceRegistrationController I1212 00:12:48.454892 7 controller.go:84] Starting OpenAPI AggregationController I1212 00:12:48.454902 7 cache.go:32] Waiting for caches to sync for APIServiceRegistrationController controller I1212 00:12:48.454935 7 crdregistration_controller.go:112] Starting crd-autoregister controller I1212 00:12:48.454932 7 crd_finalizer.go:242] Starting CRDFinalizer I1212 00:12:48.454962 7 available_controller.go:278] Starting AvailableConditionController I1212 00:12:48.454967 7 cache.go:32] Waiting for caches to sync for AvailableConditionController controller I1212 00:12:48.454969 7 naming_controller.go:284] Starting NamingConditionController I1212 00:12:48.454994 7 establishing_controller.go:73] Starting EstablishingController I1212 00:12:48.454950 7 controller_utils.go:1027] Waiting for caches to sync for crd-autoregister controller I1212 00:12:48.455033 7 customresource_discovery_controller.go:199] Starting DiscoveryController I1212 00:12:58.923688 7 trace.go:76] Trace[1029194318]: "Create /api/v1/namespaces/kube-system/serviceaccounts" (started: 2018-12-12 00:12:48.921492128 +0000 UTC m=+6.629693070) (total time: 10.002174722s): Trace[1029194318]: [10.002174722s] [10.00039192s] END I1212 00:13:08.925557 7 trace.go:76] Trace[645995136]: "Create /apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings" (started: 2018-12-12 00:12:58.924586395 +0000 UTC m=+16.632787332) (total time: 10.000946296s): Trace[645995136]: [10.000946296s] [10.00036682s] END I1212 00:13:24.847128 7 shared_informer.go:119] stop requested I1212 00:13:24.847145 7 shared_informer.go:119] stop requested I1212 00:13:24.847146 7 shared_informer.go:119] stop requested I1212 00:13:24.847144 7 secure_serving.go:156] Stopped listening on 127.0.0.1:8080 I1212 00:13:24.847158 7 shared_informer.go:119] stop requested I1212 00:13:24.847160 7 shared_informer.go:119] stop requested I1212 00:13:24.847158 7 shared_informer.go:119] stop requested E1212 00:13:24.847165 7 customresource_discovery_controller.go:202] timed out waiting for caches to sync I1212 00:13:24.847168 7 crd_finalizer.go:246] Shutting down CRDFinalizer E1212 00:13:24.847171 7 controller_utils.go:1030] Unable to sync caches for crd-autoregister controller I1212 00:13:24.847172 7 shared_informer.go:119] stop requested I1212 00:13:24.847171 7 customresource_discovery_controller.go:203] Shutting down DiscoveryController E1212 00:13:24.847180 7 cache.go:35] Unable to sync caches for autoregister controller E1212 00:13:24.847148 7 cache.go:35] Unable to sync caches for APIServiceRegistrationController controller I1212 00:13:24.847157 7 establishing_controller.go:77] Shutting down EstablishingController I1212 00:13:24.847135 7 shared_informer.go:119] stop requested I1212 00:13:24.847215 7 secure_serving.go:156] Stopped listening on [::]:443 I1212 00:13:24.847215 7 controller.go:171] Shutting down kubernetes service endpoint reconciler E1212 00:13:24.847225 7 cache.go:35] Unable to sync caches for AvailableConditionController controller I1212 00:13:24.847152 7 naming_controller.go:288] Shutting down NamingConditionController I1212 00:13:24.847186 7 controller.go:90] Shutting down OpenAPI AggregationController I1212 00:13:24.848248 7 crdregistration_controller.go:117] Shutting down crd-autoregister controller I1212 00:13:24.849329 7 autoregister_controller.go:141] Shutting down autoregister controller I1212 00:13:24.850406 7 apiservice_controller.go:94] Shutting down APIServiceRegistrationController I1212 00:13:24.851479 7 available_controller.go:282] Shutting down AvailableConditionController E1212 00:13:34.847575 7 controller.go:173] rpc error: code = Unavailable desc = transport is closing E1212 00:13:48.464293 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.464359 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.465402 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.466507 7 trace.go:76] Trace[1330970842]: "List /apis/admissionregistration.k8s.io/v1alpha1/initializerconfigurations" (started: 2018-12-12 00:12:48.464197188 +0000 UTC m=+6.172398126) (total time: 1m0.00229233s): Trace[1330970842]: [1m0.00229233s] [1m0.002288147s] END E1212 00:13:48.466971 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.467596 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.468649 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.469741 7 trace.go:76] Trace[1868745693]: "List /apis/admissionregistration.k8s.io/v1alpha1/initializerconfigurations" (started: 2018-12-12 00:12:48.466884694 +0000 UTC m=+6.175085674) (total time: 1m0.002842133s): Trace[1868745693]: [1m0.002842133s] [1m0.002837372s] END E1212 00:13:48.470629 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.470821 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.470927 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.471076 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Secret: the server was unable to return a response in the time allotted, but may still be processing the request (get secrets) E1212 00:13:48.471550 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.471675 7 reflector.go:134] k8s.io/apiextensions-apiserver/pkg/client/informers/internalversion/factory.go:117: Failed to list *apiextensions.CustomResourceDefinition: the server was unable to return a response in the time allotted, but may still be processing the request (get customresourcedefinitions.apiextensions.k8s.io) E1212 00:13:48.471884 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.472979 7 trace.go:76] Trace[1800478074]: "List /apis/admissionregistration.k8s.io/v1alpha1/initializerconfigurations" (started: 2018-12-12 00:12:48.470532433 +0000 UTC m=+6.178733370) (total time: 1m0.002432073s): Trace[1800478074]: [1m0.002432073s] [1m0.002427554s] END E1212 00:13:48.474023 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.475085 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.477257 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout I1212 00:13:48.479430 7 trace.go:76] Trace[1280622339]: "List /api/v1/secrets" (started: 2018-12-12 00:12:48.470911773 +0000 UTC m=+6.179112712) (total time: 1m0.008501979s): Trace[1280622339]: [1m0.008501979s] [1m0.008459411s] END E1212 00:13:48.480486 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.481577 7 trace.go:76] Trace[1804652784]: "List /apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions" (started: 2018-12-12 00:12:48.471466491 +0000 UTC m=+6.179667429) (total time: 1m0.010100941s): Trace[1804652784]: [1m0.010100941s] [1m0.010060801s] END E1212 00:13:48.500678 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500713 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.500806 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500845 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *scheduling.PriorityClass: the server was unable to return a response in the time allotted, but may still be processing the request (get priorityclasses.scheduling.k8s.io) E1212 00:13:48.500882 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.ClusterRole: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterroles.rbac.authorization.k8s.io) E1212 00:13:48.500903 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500953 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500957 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500957 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500979 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500987 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501007 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501090 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501091 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501147 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *storage.StorageClass: the server was unable to return a response in the time allotted, but may still be processing the request (get storageclasses.storage.k8s.io) E1212 00:13:48.501156 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501160 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *core.LimitRange: the server was unable to return a response in the time allotted, but may still be processing the request (get limitranges) E1212 00:13:48.501238 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501241 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501275 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501284 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *core.Secret: the server was unable to return a response in the time allotted, but may still be processing the request (get secrets) E1212 00:13:48.501395 7 reflector.go:134] k8s.io/kube-aggregator/pkg/client/informers/internalversion/factory.go:117: Failed to list *apiregistration.APIService: the server was unable to return a response in the time allotted, but may still be processing the request (get apiservices.apiregistration.k8s.io) E1212 00:13:48.501398 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *core.PersistentVolume: the server was unable to return a response in the time allotted, but may still be processing the request (get persistentvolumes) E1212 00:13:48.501439 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.ServiceAccount: the server was unable to return a response in the time allotted, but may still be processing the request (get serviceaccounts) E1212 00:13:48.501457 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.ValidatingWebhookConfiguration: the server was unable to return a response in the time allotted, but may still be processing the request (get validatingwebhookconfigurations.admissionregistration.k8s.io) E1212 00:13:48.501501 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: the server was unable to return a response in the time allotted, but may still be processing the request (get services) E1212 00:13:48.501524 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *core.ResourceQuota: the server was unable to return a response in the time allotted, but may still be processing the request (get resourcequotas) E1212 00:13:48.501628 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods) E1212 00:13:48.501687 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.MutatingWebhookConfiguration: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io) E1212 00:13:48.501731 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Namespace: the server was unable to return a response in the time allotted, but may still be processing the request (get namespaces) E1212 00:13:48.501747 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.RoleBinding: the server was unable to return a response in the time allotted, but may still be processing the request (get rolebindings.rbac.authorization.k8s.io) E1212 00:13:48.501776 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.501974 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.502090 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Role: the server was unable to return a response in the time allotted, but may still be processing the request (get roles.rbac.authorization.k8s.io) I1212 00:13:48.502863 7 trace.go:76] Trace[2003208653]: "List /apis/scheduling.k8s.io/v1beta1/priorityclasses" (started: 2018-12-12 00:12:48.50058919 +0000 UTC m=+6.208790128) (total time: 1m0.002260482s): Trace[2003208653]: [1m0.002260482s] [1m0.002225647s] END E1212 00:13:48.503663 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.503680 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.503783 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.StorageClass: the server was unable to return a response in the time allotted, but may still be processing the request (get storageclasses.storage.k8s.io) E1212 00:13:48.503809 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.ClusterRoleBinding: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterrolebindings.rbac.authorization.k8s.io) E1212 00:13:48.503945 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.503984 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.504107 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints) E1212 00:13:48.504981 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.508235 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.509303 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.510393 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.511474 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.512543 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.513624 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.514705 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.515781 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.516852 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.517948 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.521205 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.522313 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.523383 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.536332 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.538482 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.539550 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.542790 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout I1212 00:13:48.544949 7 trace.go:76] Trace[1776629030]: "List /apis/rbac.authorization.k8s.io/v1/clusterroles" (started: 2018-12-12 00:12:48.500703254 +0000 UTC m=+6.208904191) (total time: 1m0.044230748s): Trace[1776629030]: [1m0.044230748s] [1m0.044191519s] END E1212 00:13:48.546005 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.547081 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.548160 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.549233 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.550326 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.551402 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.552483 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.553559 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.554642 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.555717 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.556795 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.557885 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.558957 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.560038 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.561114 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.562191 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.563270 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.564370 7 trace.go:76] Trace[1439133424]: "List /apis/storage.k8s.io/v1/storageclasses" (started: 2018-12-12 00:12:48.50081678 +0000 UTC m=+6.209017718) (total time: 1m0.063540529s): Trace[1439133424]: [1m0.063540529s] [1m0.063506486s] END I1212 00:13:48.565439 7 trace.go:76] Trace[1683817720]: "List /api/v1/serviceaccounts" (started: 2018-12-12 00:12:48.500940682 +0000 UTC m=+6.209141619) (total time: 1m0.064488328s): Trace[1683817720]: [1m0.064488328s] [1m0.064456016s] END I1212 00:13:48.566514 7 trace.go:76] Trace[491490319]: "List /api/v1/persistentvolumes" (started: 2018-12-12 00:12:48.500986525 +0000 UTC m=+6.209187462) (total time: 1m0.065518757s): Trace[491490319]: [1m0.065518757s] [1m0.065482619s] END I1212 00:13:48.567591 7 trace.go:76] Trace[1474503645]: "List /apis/apiregistration.k8s.io/v1/apiservices" (started: 2018-12-12 00:12:48.500875642 +0000 UTC m=+6.209076622) (total time: 1m0.066706928s): Trace[1474503645]: [1m0.066706928s] [1m0.066656322s] END I1212 00:13:48.568677 7 trace.go:76] Trace[635852309]: "List /api/v1/secrets" (started: 2018-12-12 00:12:48.500940686 +0000 UTC m=+6.209141623) (total time: 1m0.06772371s): Trace[635852309]: [1m0.06772371s] [1m0.067687683s] END I1212 00:13:48.569755 7 trace.go:76] Trace[175882069]: "List /api/v1/limitranges" (started: 2018-12-12 00:12:48.500925548 +0000 UTC m=+6.209126486) (total time: 1m0.06881921s): Trace[175882069]: [1m0.06881921s] [1m0.068781117s] END I1212 00:13:48.570836 7 trace.go:76] Trace[122202535]: "List /api/v1/pods" (started: 2018-12-12 00:12:48.500951889 +0000 UTC m=+6.209152828) (total time: 1m0.069871581s): Trace[122202535]: [1m0.069871581s] [1m0.069830326s] END I1212 00:13:48.571908 7 trace.go:76] Trace[865708000]: "List /api/v1/resourcequotas" (started: 2018-12-12 00:12:48.501056066 +0000 UTC m=+6.209257003) (total time: 1m0.070840152s): Trace[865708000]: [1m0.070840152s] [1m0.070808759s] END I1212 00:13:48.572979 7 trace.go:76] Trace[955305514]: "List /apis/rbac.authorization.k8s.io/v1/rolebindings" (started: 2018-12-12 00:12:48.501055621 +0000 UTC m=+6.209256562) (total time: 1m0.071915466s): Trace[955305514]: [1m0.071915466s] [1m0.071884923s] END I1212 00:13:48.574060 7 trace.go:76] Trace[1423473229]: "List /api/v1/namespaces" (started: 2018-12-12 00:12:48.501149822 +0000 UTC m=+6.209350759) (total time: 1m0.072900808s): Trace[1423473229]: [1m0.072900808s] [1m0.072867725s] END I1212 00:13:48.575139 7 trace.go:76] Trace[802608035]: "List /apis/admissionregistration.k8s.io/v1beta1/validatingwebhookconfigurations" (started: 2018-12-12 00:12:48.501149182 +0000 UTC m=+6.209350109) (total time: 1m0.073979725s): Trace[802608035]: [1m0.073979725s] [1m0.073948799s] END I1212 00:13:48.576217 7 trace.go:76] Trace[1021760621]: "List /apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations" (started: 2018-12-12 00:12:48.501154269 +0000 UTC m=+6.209355207) (total time: 1m0.075052452s): Trace[1021760621]: [1m0.075052452s] [1m0.075012296s] END I1212 00:13:48.577292 7 trace.go:76] Trace[1969470568]: "List /api/v1/services" (started: 2018-12-12 00:12:48.501258385 +0000 UTC m=+6.209459322) (total time: 1m0.076025789s): Trace[1969470568]: [1m0.076025789s] [1m0.076004504s] END I1212 00:13:48.578373 7 trace.go:76] Trace[1871147953]: "List /apis/rbac.authorization.k8s.io/v1/roles" (started: 2018-12-12 00:12:48.501860956 +0000 UTC m=+6.210061881) (total time: 1m0.076503388s): Trace[1871147953]: [1m0.076503388s] [1m0.076480245s] END I1212 00:13:48.579453 7 trace.go:76] Trace[640462565]: "List /apis/storage.k8s.io/v1/storageclasses" (started: 2018-12-12 00:12:48.503571787 +0000 UTC m=+6.211772724) (total time: 1m0.075871435s): Trace[640462565]: [1m0.075871435s] [1m0.075846101s] END I1212 00:13:48.580530 7 trace.go:76] Trace[759626822]: "List /apis/rbac.authorization.k8s.io/v1/clusterrolebindings" (started: 2018-12-12 00:12:48.50357283 +0000 UTC m=+6.211773767) (total time: 1m0.076948558s): Trace[759626822]: [1m0.076948558s] [1m0.076917912s] END I1212 00:13:48.581612 7 trace.go:76] Trace[647924664]: "List /api/v1/endpoints" (started: 2018-12-12 00:12:48.503957566 +0000 UTC m=+6.212158503) (total time: 1m0.077645739s): Trace[647924664]: [1m0.077645739s] [1m0.077595063s] END E1212 00:13:49.455364 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:49.455402 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:49.455432 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:49.455591 7 storage_rbac.go:154] unable to initialize clusterroles: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterroles.rbac.authorization.k8s.io) E1212 00:13:49.455602 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} W1212 00:13:49.455631 7 storage_scheduling.go:95] unable to get PriorityClass system-node-critical: the server was unable to return a response in the time allotted, but may still be processing the request (get priorityclasses.scheduling.k8s.io system-node-critical). Retrying... F1212 00:13:49.455641 7 hooks.go:188] PostStartHook "scheduling/bootstrap-system-priority-classes" failed: unable to add default system priority classes: the server was unable to return a response in the time allotted, but may still be processing the request (get priorityclasses.scheduling.k8s.io system-node-critical) E1212 00:13:49.489143 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:49.500198 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:49.511157 7 client_ca_hook.go:72] Post https://[::1]:443/api/v1/namespaces: dial tcp [::1]:443: connect: connection refused
I was able to replicate this consistently, the one time I was able to do a full upgrade and had it succeed I then proceeded to rotate the cluster 1 more time with no updates at that point I once again saw the corruption.
To ensure that this was not an issue with the k8s and etcd versions I picked I once again created a new Kops cluster and the updated k8s and etcd to the versions mentioned above. This time however I set the etcd provisioner in Kops to legacy, the cluster upgrade succeeded with no issues and following cluster rotations have not caused any visible issues.
We're testing a new Kubernetes cluster on AWS built with kops
1.12
, running etcd 3 with the etcd-manager
. Each master node runs two instances of etcd-manager
(main
and events
):
NAMESPACE NAME READY STATUS
kube-system etcd-manager-events-ip-172-22-129-234.ec2.internal 1/1 Running
kube-system etcd-manager-main-ip-172-22-129-234.ec2.internal 1/1 Running
While testing the rollout of master nodes, we've observed that due to how assignDevice
works it returns - at first try - the same device on both instances, introducing a 1 minute delay in the master rollout. To explain it, see the following (redacted) logs.
etcd-manager-events
logs:
I0611 09:35:18.441963 9418 main.go:228] Mounting available etcd volumes matching tags [k8s.io/etcd/events k8s.io/role/master=1 kubernetes.io/cluster/REDACTED=owned]; nameTag=k8s.io/etcd/events
I0611 09:35:18.444481 9418 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:35:18.628872 9418 mounter.go:288] Trying to mount master volume: "vol-0033a5507d5546fe4"
I0611 09:35:18.628998 9418 volumes.go:85] AWS API Request: ec2/AttachVolume
I0611 09:35:18.890734 9418 volumes.go:339] AttachVolume request returned {
AttachTime: 2019-06-11 09:35:18.866 +0000 UTC,
Device: "/dev/xvdu",
InstanceId: "i-0abc13d47bb0d19cd",
State: "attaching",
VolumeId: "vol-0033a5507d5546fe4"
}
I0611 09:35:18.890895 9418 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:35:18.983082 9418 mounter.go:302] Currently attached volumes: [0xc000238100]
I0611 09:35:18.983115 9418 mounter.go:64] Master volume "vol-0033a5507d5546fe4" is attached at "/dev/xvdu"
etcd-manager-main
logs:
I0611 09:35:18.601952 9498 main.go:228] Mounting available etcd volumes matching tags [k8s.io/etcd/main k8s.io/role/master=1 kubernetes.io/cluster/REDACTED=owned]; nameTag=k8s.io/etcd/main
I0611 09:35:18.607188 9498 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:35:18.828121 9498 mounter.go:288] Trying to mount master volume: "vol-045d841d4ec069864"
I0611 09:35:18.828251 9498 volumes.go:85] AWS API Request: ec2/AttachVolume
W0611 09:35:19.114926 9498 mounter.go:293] Error attaching volume "vol-045d841d4ec069864": Error attaching EBS volume "vol-045d841d4ec069864": InvalidParameterValue: Invalid value '/dev/xvdu' for unixDevice. Attachment point /dev/xvdu is already in use
status code: 400, request id: b668745c-64b9-46a0-af1a-61c6352daaed
I0611 09:35:19.114951 9498 mounter.go:302] Currently attached volumes: []
I0611 09:35:19.114966 9498 boot.go:49] waiting for volumes
I0611 09:36:19.115312 9498 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:36:19.204717 9498 mounter.go:288] Trying to mount master volume: "vol-045d841d4ec069864"
I0611 09:36:19.204841 9498 volumes.go:85] AWS API Request: ec2/AttachVolume
I0611 09:36:19.483306 9498 volumes.go:339] AttachVolume request returned {
AttachTime: 2019-06-11 09:36:19.439 +0000 UTC,
Device: "/dev/xvdv",
InstanceId: "i-0abc13d47bb0d19cd",
State: "attaching",
VolumeId: "vol-045d841d4ec069864"
}
I0611 09:36:19.483477 9498 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:36:19.614795 9498 mounter.go:302] Currently attached volumes: [0xc000096080]
I0611 09:36:19.614822 9498 mounter.go:64] Master volume "vol-045d841d4ec069864" is attached at "/dev/xvdv"
As you can see, since the first EBS volume attachment has failed due to Attachment point /dev/xvdu is already in use
it will reconcile after 60 seconds, introducing a 60 seconds delay in the bootstrapping of master nodes.
Few options / ideas to start the conversation:
Volume.PreferredLocalDevice
, which is populated by an optional volume tag read from the cloud provider, so that kops
can set a different preferred local device for each volume. The Volume.PreferredLocalDevice
is then passed to assignDevice()
, which will return the preferred one if set and available, otherwise will fallback to the current logicassignDevice()
, which will then iterate of the next ones starting from it (this just reduces the likelihood, at the cost of having assignDevice()
behave in a non deterministic way)We don't want this to be a backdoor
OpenStack api shows path for volume:
/dev/vdd on master-zone-1-2-1-ownold-master-k8s-local
logs:
I0913 18:04:49.564932 1578 mounter.go:64] Master volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" is attached at "/dev/vdd"
I0913 18:04:49.564993 1578 mounter.go:78] Doing safe-format-and-mount of /dev/vdd to /mnt/master-2.etcd-main.ownold-master.k8s.local
I0913 18:04:49.565025 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:50.565341 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:51.565517 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:52.565714 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:53.565907 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:54.566067 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:55.566275 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:56.566473 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:57.566669 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:58.566883 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:59.567076 1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
However, when I check devices:
$ ls /dev/vd
vda vda1 vdb vdc
So OpenStack API is reporting incorrect device paths, this is known problem from k8s side. It is solved using this function https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L331
When listing volumes by id
$ ls /dev/disk/by-id
virtio-44a9c2a0-4648-4ce1-8
so the disk is there, but the path is incorrect. Etcd manager thinks that it is in /dev/vdd
instead of /dev/vdc
Maybe we can just ship the primary versions, or use symlinks where we know them to be compatible.
RE: this todo - Is there any work in flight for this or are you looking for a contributor?
Any considerations for initial support? Do you see MVP as providing paths to certificates for the init? Or providing certs inline?
Hi all.
I have been reviewing the disaster recovery documentation and it's not clear to me where should I execute the etcd-manager-ctl
commands (list backups, restore backups...). so I have some questions:
etcd-manager-ctl
needs the API keys with the right permissions to access the S3 bucket where the etcd backups live?My current setup its an K8s cluster with 3 masters with etcd-manager (main and events) running in the master nodes using the manifests that are present in /etc/kubernetes/manifests/..
Happy to create a PR to improve the disaster recovery docs with this information.
Thanks
We have situation that we have health etcd-manager cluster with 3 masters. However, we would like to move this running cluster to use different storage backend. I know that we can do that by using backup+restore but it means also downtime for kubernetes apis.
Also I have seen situations that for some reason one etcd member is broken. I have not found way to add (new) member back to cluster which do have 2/3 members healthy. The only way that I have found is to make new etcd-manager cluster from the backup. However, this is not perfect way to do things because it will always lead downtime and possible (small) data loss.
During an attempted migration to etcd-manager on a kops cluster, tailing the etcd.log on the first node to be updated shows the following:
I0102 16:18:55.621144 5320 controller.go:137] peers: [peer{id:"etcd-eu-west-1a" endpoints:"172.21.56.25:3996" } peer{id:"etcd-eu-west-1b" endpoints:"172.21.66.14:3996" } peer{id:"etcd-eu-west-1c" endpoints:"172.21.108.164:3996" }]
I0102 16:18:55.622677 5320 controller.go:232] etcd cluster state: etcdClusterState
members:
peers:
etcdClusterPeerInfo{peer=peer{id:"etcd-eu-west-1a" endpoints:"172.21.56.25:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-eu-west-1a" peer_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:2380" client_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:4001" quarantined_client_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:3994" > }
etcdClusterPeerInfo{peer=peer{id:"etcd-eu-west-1b" endpoints:"172.21.66.14:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-eu-west-1b" peer_urls:"http://etcd-eu-west-1b.internal.stg.mycluster:2380" client_urls:"http://etcd-eu-west-1b.internal.stg.mycluster:4001" quarantined_client_urls:"http://etcd-eu-west-1b.internal.stg.mycluster:3994" > }
etcdClusterPeerInfo{peer=peer{id:"etcd-eu-west-1c" endpoints:"172.21.108.164:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-eu-west-1c" peer_urls:"http://etcd-eu-west-1c.internal.stg.mycluster:2380" client_urls:"http://etcd-eu-west-1c.internal.stg.mycluster:4001" quarantined_client_urls:"http://etcd-eu-west-1c.internal.stg.mycluster:3994" > }
I0102 16:18:55.622744 5320 controller.go:233] etcd cluster members: map[]
I0102 16:18:55.622753 5320 controller.go:516] sending member map to all peers:
I0102 16:18:55.623935 5320 commands.go:22] not refreshing commands - TTL not hit
I0102 16:18:55.623955 5320 s3fs.go:210] Reading file "s3://kops-clusters.mycluster/stg.mycluster/backups/etcd/main/control/etcd-cluster-created"
I0102 16:18:55.647667 5320 controller.go:318] spec member_count:3 etcd_version:"2.2.1"
I0102 16:18:55.647693 5320 controller.go:375] etcd has 0 members registered, we want 3; will try to expand cluster
W0102 16:18:55.647700 5320 controller.go:663] unable to do backup before adding peer - no members
I0102 16:18:55.647706 5320 controller.go:667] will try to start etcd on new peer: etcdClusterPeerInfo{peer=peer{id:"etcd-eu-west-1a" endpoints:"172.21.56.25:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-eu-west-1a" peer_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:2380" client_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:4001" quarantined_client_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:3994" > }
Gossip appears to work, the node see each other, but the advertisements appear to be missing a field maybe?
1. Describe IN DETAIL the feature/behavior/change you would like to see.
A flag to set all the allowed cipher suite, similar to the parameter "--tls-cipher-suites" used on kubelet.
This necessity showed up after a vulnerability scan on a Kubernetes environment configured by Kops. The Nessus scan revealed that the etcd-manager doesn't restrict the use of non-secure ciphers suite (ECDHE-RSA-DES-CBC3-SHA and DES-CBC3-SHA).
Hi Justin,
I am trying to build etcd-manager and I am getting the following error:
$ bazel build //cmd/etcd-manager //cmd/etcd-manager-ctl
ERROR: /home/tamal/go/src/kope.io/etcd-manager/cmd/etcd-manager-ctl/BUILD.bazel:3:1: no such package 'vendor/github.com/golang/glog': BUILD file not found on package path and referenced by '//cmd/etcd-manager-ctl:go_default_library'
ERROR: Analysis of target '//cmd/etcd-manager-ctl:etcd-manager-ctl' failed; build aborted: no such package 'vendor/github.com/golang/glog': BUILD file not found on package path
INFO: Elapsed time: 0.311s
FAILED: Build did NOT complete successfully (0 packages loaded)
currently loading: @io_bazel_rules_go//go/private
This is my first time using bazel. I installed bazel on a Ubuntu 16.04 machine following instructions here: https://docs.bazel.build/versions/master/install-ubuntu.html
Any idea how to fix this error ?
Currently etcd-manager depends on an very old kops version.
I'm currently working on make etct-manager support Alicloud, in this case, I need to update the vfs module in kops.
But I don't know what is the best way to upgrade kops dependency in etcd-manager.
Can you please take a look at this? @justinsb
This is blocking #269.
Thanks!
Otherwise we could OOM on big backups / restores
The documentation at https://github.com/kopeio/etcd-manager mentions etcd-manager-ctl
as the tool for the management of the etcd backups/restores. Would it be possible to include that tool in https://hub.docker.com/r/kopeio/etcd-manager?
Moreover it would be useful to have described also a more realistic scenario of the usage of etcd-manager-ctl
for backups/restores on an aws/other system managed by kops.
Tried to create cluster on eu-north with kops 1.12.0 using:
kops create cluster --state s3://292662267961-k8s.local-eu-north-1-kops-storage --zones eu-north-1a --master-size t3.small --node-size t3.small --name test2.k8s.local
All resources on AWS is created but the cluster isn't starting correctly.
Looking at the logs on the master node I can see that the docker images for etcd-manager
fails with:
I0516 13:20:39.596608 3806 s3context.go:164] got region from metadata: "eu-north-1"
W0516 13:20:39.596655 3806 controller.go:149] unexpected error running etcd cluster reconciliation loop: error refreshing control store after leadership cha
nge: error reading s3://292662267961-k8s.local-eu-north-1-kops-storage/test2.k8s.local/backups/etcd/events/control: eu-north-1 is not a valid region
Please check that your region is formatted correctly (e.g. us-east-1)
So it would seem that even though kops supports the eu-north region the etcd-manager isn't?
I guess a simple update of the aws-sdk
component in etcd-manager
should solve this?
Also created an issue on kops
Hi! etcd-manager doesn't support configuring heartbeat and leader election timeouts. Do you have it somewhere on the roadmap?
Currently etcd-manager supports GWE and AWS. Need similar support for Digital Ocean. Currently trying to run KOPS with etcd-manager for Digital Ocean and that needs an update to etcd-manager.
Hi! Do you have a feature on roadmap to allow put wal-dir on a separate disk/PV?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.