kopeio / etcd-manager Goto Github PK

View Code? Open in Web Editor NEW

164.0 164.0 47.0 21.69 MB

operator for etcd: moved to https://github.com/kubernetes-sigs/etcdadm

License: Apache License 2.0

Makefile 0.88% Go 90.00% Shell 1.54% Starlark 7.59%

etcd-manager's People

Contributors

Stargazers

Watchers

Forkers

justinsb aledbf mattkelly etcdcluster pablme jayunit100 vivekgarg20 appvia drekle hexfusion granular-ryanbonham uthark mmerrill3 srikiz pracucci lorantonodipo kirrmann rdrgmnzs mikesplain zetaab dzoeteman morriski misdre ivnilv adammw mariusv rifelpet coderanger hakman johngmyers colinhoglund warmchang olemarkus isgasho slu2011 cloudnatix goberle jeffwan ottosulin bharath-123 spotlesscoder charoensri yhxjack aberenshtein blank-1 askagirl iq-scm

etcd-manager's Issues

Support "etcd --listen-metrics-urls"

Enable a separate port for metrics and allow access from node SG to simplify pulling etcd metrics in a secure way. Requires Upgrade to etc 3.3.0+ as well.

Tag for K8s 1.1.14 where etcd has been defaulted to 3.3.10

when cluster members are added or removed, we should update the state file

The state file used to initialize the etcd server should reflect when etcd members come and go. Currently, we update /etc/hosts when this information changes, but we do not update the state file under the base data directory for etcd-main or etcd-events. I also propose updating the state.Nodes information for the running etcd service to reflect the changes.
This manifests as bad node data being returned by the GetInfo peers service, since the state of the cluster comes from that state file, and it doesn't get updated once its read. If you decide to expand your cluster, and then shrink it (for say, moving to a new subnet), references to the old etcd members will still be seen in the logs when the "master" etcd server calls GetInfo on the other peers.

Building with different etcd binary version

I've built my cluster with the etcd version 3.3.10 before running etcd-manager. Consequently, now I can't run etcd-manager because it is not shipped with the binary version for the etcd 3.3.10. In order to build the etcd-manager image by myself which is the best way? Further etcd-manager would run with etcd 3.3.x?

`v3.0.20190125` breaks kops 1.10 deploys

slice of the cluster.spec:

  etcdClusters:
  - backups:
      backupStore: s3://$KOPS_STATE_STORE/$KLUSTER_NAME/backups/etcd/main
    etcdMembers:
    - instanceGroup: master-eu-west-1c
      name: -2c
    - instanceGroup: master-eu-west-1b
      name: -2b
    - instanceGroup: master-eu-west-1a
      name: -2a
    manager:
      image: kopeio/etcd-manager:latest
    name: main
    version: 3.1.12
  - backups:
      backupStore: s3://$KOPS_STATE_STORE/$KLUSTER_NAME/backups/etcd/events
    etcdMembers:
    - instanceGroup: master-eu-west-1c
      name: -2c
    - instanceGroup: master-eu-west-1b
      name: -2b
    - instanceGroup: master-eu-west-1a
      name: -2a
    manager:
      image: kopeio/etcd-manager:latest
    name: events
    version: 3.1.12

:latest points to v3.0.20190125 verified by SHASUM in docker images

Docker logs show:

I0226 01:05:54.017737       1 main.go:243] discovered IP address: 192.168.132.131
I0226 01:05:54.017774       1 main.go:248] Setting data dir to /rootfs/mnt/master-vol-078729a1332f034f2
open /etc/kubernetes/pki/etcd-manager/etcd-manager-ca.key: no such file or directory

and etcd-manager exits without bringing up etcd.

How to get metrics from etcd after migration to etcdmanager?

Hi,

I'm using datadog for monitoring solution, but I feel like this is more generic question. Previously (before etcdmanager was introduced) port 4001 and 4002 were accessible, and if I recall correctly these are not exposed on nodes anymore.

Previously, our datadog check was configured like this:

---
init_config:
instances:
  - url: http://etcd-a.internal.CLUSTER_NAME:4001

What's the correct url to use now?

Attempting to compile etcd-manager-ctl results in error

Linux ip-10-0-13-180 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1 (2019-04-12) x86_64 GNU/Linux

I installed Bazel as per the instructions below - downloaded bazel-0.27.0-installer-linux-x86_64.sh and installed it (was fine apart from needing me to install unzip)

I cloned the repo down and cd'd into it.

when attempting to complie - etcd and etcdctl seem to work fine

root@ip-10-0-4-171:/tmp/etcd-manager# bazel build //:etcd-v2.2.1-linux-amd64_etcd //:etcd-v2.2.1-linux-amd64_etcdctl
Starting local Bazel server and connecting to it...
INFO: Analyzed 2 targets (4 packages loaded, 21 targets configured).
INFO: Found 2 targets...
INFO: Elapsed time: 6.905s, Critical Path: 0.41s
INFO: 2 processes: 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
root@ip-10-0-4-171:/tmp/etcd-manager# bazel build //:etcd-v3.2.24-linux-amd64_etcd //:etcd-v3.2.24-linux-amd64_etcdctl
INFO: Analyzed 2 targets (1 packages loaded, 4 targets configured).
INFO: Found 2 targets...
INFO: Elapsed time: 4.088s, Critical Path: 0.51s
INFO: 2 processes: 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions

root@ip-10-0-4-171:/tmp/etcd-manager# cp -r bazel-genfiles/etcd-v* /opt/
root@ip-10-0-4-171:/tmp/etcd-manager# chown -R ${USER} /opt/etcd-v*
root@ip-10-0-4-171:/tmp/etcd-manager# ls -lrt
total 116
-rw-r--r--  1 root root  3679 Jun 24 15:18 WORKSPACE
-rw-r--r--  1 root root 14520 Jun 24 15:18 README.md
-rw-r--r--  1 root root  1210 Jun 24 15:18 Makefile
-rw-r--r--  1 root root 11358 Jun 24 15:18 LICENSE
drwxr-xr-x  2 root root  4096 Jun 24 15:18 images
-rw-r--r--  1 root root  1175 Jun 24 15:18 Gopkg.toml
-rw-r--r--  1 root root 11964 Jun 24 15:18 Gopkg.lock
drwxr-xr-x  2 root root  4096 Jun 24 15:18 docs
drwxr-xr-x  2 root root  4096 Jun 24 15:18 dev
drwxr-xr-x  7 root root  4096 Jun 24 15:18 cmd
-rw-r--r--  1 root root   643 Jun 24 15:18 cloudbuild.yaml
-rw-r--r--  1 root root  1276 Jun 24 15:18 cloudbuild-master.yaml
-rw-r--r--  1 root root  2097 Jun 24 15:18 BUILD
drwxr-xr-x  2 root root  4096 Jun 24 15:18 tools
drwxr-xr-x  3 root root  4096 Jun 24 15:18 test
drwxr-xr-x 20 root root  4096 Jun 24 15:18 pkg
drwxr-xr-x  8 root root  4096 Jun 24 15:18 vendor
lrwxrwxrwx  1 root root   113 Jun 24 15:24 bazel-testlogs -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__/bazel-out/k8-fastbuild/testlogs
lrwxrwxrwx  1 root root    91 Jun 24 15:24 bazel-out -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__/bazel-out
lrwxrwxrwx  1 root root   108 Jun 24 15:24 bazel-genfiles -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__/bazel-out/k8-fastbuild/bin
lrwxrwxrwx  1 root root    81 Jun 24 15:24 bazel-etcd-manager -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__
lrwxrwxrwx  1 root root   108 Jun 24 15:24 bazel-bin -> /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/execroot/__main__/bazel-out/k8-fastbuild/bin

Attempted to compile etcd-manager-ctl - this fails

root@ip-10-0-13-221:/tmp/etcd-manager# bazel build //cmd/etcd-manager-ctl
ERROR: /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/BUILD.bazel:62:1: in go_context_data rule @io_bazel_rules_go//:go_context_data:
Traceback (most recent call last):
        File "/root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/BUILD.bazel", line 62
                go_context_data(name = 'go_context_data')
        File "/root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/go/private/context.bzl", line 396, in _go_context_data_impl
                cc_common.configure_features(cc_toolchain = cc_toolchain, reque..., ...)
Incompatible flag --incompatible_require_ctx_in_configure_features has been flipped, and the mandatory parameter 'ctx' of cc_common.configure_features is missing. Please add 'ctx' as a named parameter.m/bazelbuild/bazel/issues/7793 for details.
ERROR: Analysis of target '//cmd/etcd-manager-ctl:etcd-manager-ctl' failed; build aborted: Analysis of target '@io_bazel_rules_go//:go_context_data' failed; build aborted
INFO: Elapsed time: 2.641s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (143 packages loaded, 1485 targets configured)
    Fetching @org_golang_x_tools; Restarting.
root@ip-10-0-13-221:/tmp/etcd-manager# https://github.com/bazelbuild/bazel/issues/7793 vi /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/go/private/context.b
-su: https://github.com/bazelbuild/bazel/issues/7793: No such file or directory

I applied the suggestion

"Incompatible flag --incompatible_require_ctx_in_configure_features has been flipped, and the mandatory parameter 'ctx' of cc_common.configure_features is missing. Please add 'ctx' as a named parameter.m/bazelbuild/bazel/issues/7793 for details."

This results in a new error.

root@ip-10-0-13-221:/tmp/etcd-manager# bazel build //cmd/etcd-manager-ctl
ERROR: /root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/BUILD.bazel:62:1: in go_context_data rule @io_bazel_rules_go//:go_context_data:
Traceback (most recent call last):
        File "/root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/BUILD.bazel", line 62
                go_context_data(name = 'go_context_data')
        File "/root/.cache/bazel/_bazel_root/7a0c727cd96de5bd2637ae27a8f98b84/external/io_bazel_rules_go/go/private/context.bzl", line 396, in _go_context_data_impl
                cc_common.configure_features(ctx = ctx, cc_toolchain = cc_toolc..., <2 more arguments>)
go_context_data has to declare 'CppConfiguration' as a required fragment in target configuration in order to access it.
ERROR: Analysis of target '//cmd/etcd-manager-ctl:etcd-manager-ctl' failed; build aborted: Analysis of target '@io_bazel_rules_go//:go_context_data' failed; build aborted
INFO: Elapsed time: 0.721s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (128 packages loaded, 1224 targets configured)

We like the fact that etcd-manager automatically takes backups into s3 - and can see this feature operating. However we have found no way to carry out restores apart from building etcd-manager-ctl - which we're unable to do.

In the meantime, can we restore the "old" way using the backup in the s3 bucket? In other words aws cp it to the nodes and perform an etcdctl
snapshot restore onto each node?

how to disable etcd-manager and revert to legacy etcd-server?

Are there any options to disable etcd-manager and revert to legacy etcd-server?
After upgrading k8s cluster to 1.12 with kops defaults (etcd-managed enabled) I tried to apply cluster.spec.etcdClusters[*].provider=Legacy, etcd-server started but with clean database (no deployments, services, etc). Etcd-manager saves db on EBS in different directory and seems like dbs are incompatible.
Are there any solution/documentation how to downgrade to pure etcd-server?
Thanks for ideas in advance.

Failed to upgrade aws route 53 records for etcd hosts.

When protokube used to manage etcd in kops versions prior to 1.10, it used to update the internal ip addresses of the etcd members in AWS route53.

Not sure if this has to be now expected from etcd-manager. Because as of now none (protokube / etcd-manager) seems to be taking care of updating the etcd endpoints.

Even though the cluster is healthy and etcd is discoverable to the api server, I am not sure if this is the desired behavior.

etcd-manager restore leads to incorrect addresses in "kubernetes" endpoint

We're testing out a procedure for a full master refresh using kops/etcd-manager (described here: https://hindenes.com/2019-08-09-Kops-Restore/).
In short, we wipe the masters, let kops set up new masters, and use etcd-manager-ctl to restore the last known backup. This seems to work very well.

However, we're noticing that in-cluster apps that need access to the Kubernetes api sometimes fail. This seems to be caused by the fact that old (deleted) masters are still present in the kubernetesendpoint (kubectl -n default get endpoints kubernetes -o=yaml).

This is probably not a etcd-manager problem at all, but I'm at a loss regarding how to get rid of references to old (non-existing) masters, so any pointers would be deeply appreciated.

Does etcd-manager use etcd-$member.internal.$cluster Route53 records?

Hi,

I've noticed after upgrading to Kops/Kubernetes 1.12 that the internal record sets for etcd are set to the default placeholder 203.0.113.123. However, etcd seems to be functioning normally. Is this expected?

Default backup retention period

Good day.

We use etcd-manager with kops to manage etcd. By default, etcd-managetr sets up backups to the bake, every 15 minutes. But I could not find out what default retention (https://github.com/kopeio/etcd-manager/blob/master/pkg/backupcontroller/cleanup.go) is worth it and how can it be configured?

etdc-manager overwriting existing /etc/hosts entries

What steps did you take and what happened:
Running etcd-manager via kops in AWS on kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17. We've observed the issue below in the following configurations (which is not intended as an exhaustive list of affected configurations, just the configurations we've tried):

kops-1.11.0 , kopeio/etcd-manager:1.0.20181001, kubernetes 1.11.6, 3 masters
kops-1.11.1, kopeio/etcd-manager:3.0.20190224, kubernetes 1.11.6, 3 masters

Kops with etcd-manager enabled appears to by default start two instances of etcd-manager on each master, one for "main" and one for events.

The master images have manage_etc_hosts set which means at boot time a handful of lines are placed into /etc/hosts, i.e.:

# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either 
# a.) make changes to the master file in /etc/cloud/templates/hosts.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
#     /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 your-ec2-fqdn your-ec2-shortname
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Each instance of etcd-manager (main/events) writes records about the etcd-manager cluster into /etc/hosts, apparently every 10 seconds:

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
your-master1-ip	your-master1-name
your-master2-ip	your-master2-name
your-master3-ip	your-master3-name
# End host entries managed by etcd-manager[etcd-events]

# Begin host entries managed by etcd-manager[etcd] - do not edit
your-master1-ip	your-master1-name
your-master2-ip	your-master2-name
your-master3-ip	your-master3-name
# End host entries managed by etcd-manager[etcd]

At some indeterminate time after boot (hours or days), we are seeing the manage_etc_hosts entries disappear from /etc/hosts, including localhost entries, leaving only the etcd-manager entries. Per auditd logging no other processes are writing to this file, so etcd-manager appears to be the cause of the disappearing entries.

What did you expect to happen:
Existing entries in /etc/hosts to remain undisturbed.

Anything else you would like to add:
A reboot of the node will (temporarily) restore the records, and the entries can of course be (temporarily) re-added by hand.

Data loss during etcd2 -> etcd3 migration

Versions used:

kops 1.11.1
k8s cluster: 1.11.9
infrastructure provider: aws

Our team was upgrading the etcd cluster (from 2.2.1 to 3.1.12) using kops, using the following scenario:

kops edit cluster - add etcd manager and backups
kops update cluster --out terraform, terraform apply
kops rolling-update cluster --yes
kops edit cluster - add etcd version 3.1.12
kops update cluster --out terraform, terraform apply
kops rolling-update cluster --yes

After some minutes I have executed kubectl get nodes and a big surprise - I see there only one node, with status "NotReady" - all other cluster nodes are gone.
Quick check and it seems that etcd-manager performed an upgrade of etcd2 to etcd3, but it lost the data and created new, empty cluster.

As an unexpected side effect, it has also affected kube-dns and flannel, which rendered k8s services (and therefore all ingresses and all services exposed via them) unavailable - so I consider a major outage, as not only masters were affected, but also services running inside k8s cluster were not able to reach each other and were not reachable from the Internet.

etcd-manager logged massive amount of data and the whole migration process, hopefully that's good enought to analyse the problem: https://gist.github.com/marek-obuchowicz/adda812f89644accc508b8d4db5db03c

"Luckily v1": "we have backups". At this moment we realised that there is no documentation provided how to restore those backups using etcd-manager. We considered going back to pure etcd (without etcd-manager) first in order to restore the contents, but this idea was rejected.

"Luckily v2": etcd2 data was still available on the volumes, as etcd3 cluster was created with another name (another directory name was used for data). I was able to workaround the issue and bring up my etcd2 cluster with original data by:

editing state file on one node and forcing it back to old directory name / version 2.2.1 + changing etcd-cluster-spec back to version 2.2.1. It wasn't easy as the state file is a binary file (encoded with protobuf), so we had to write a little bit of go code to unmarshal the file first, change contents and then marshal it again: https://gist.github.com/marek-obuchowicz/c553effc19a97e40f01bc8e924b516ee
editing etcd-cluster-spec file on s3 - change version back to 2.2.1
restarting the node, on which state file was adjusted

By doing that, I was able to get again etcd2 cluster with old data. Manager correctly recognised on the node that "cluster wanted" and "local state" versions are 2.2.1, so it automatically created etcd2 cluster, using existing data. This solution however is pretty hacky and took long time to discover.

Please let me know if there is any more information I could provide to help analysing the problem.

We have executed the same operation, with the same steps, around two weeks earlier on a testing cluster - it was succesfull. There are two minor differences between testing cluster (uses CNI networking and is hosted in us-east-1 region) and live cluster that crashed (uses flannel networking and is hosted in eu-central-1 region). So I suspect the different behaviour might have been caused by latest etcd-manager updates.

I'm not sure if this is the correct place to report this issue or if I should open it in kops project, but looks to me like it's related to etcd-manager directly.

Update aws-sdk-go for new AWS region (me-south-1)

Currently etcd-manager uses aws-sdk-go v1.21.6 which doesn't support the me-south-1 region.
I have built etcd-manager with aws-sdk-go v1.21.7, it works fine.

Restored etcd-manager backup, but deleted deployments, secrets etc. dont appear to restore

I recently got etcd-manager-ctl working (Ref issue 224, now closed)

I have followed the documentation in terms of carrying out a restore, and it appears to have worked - however looking at the cluster afterwards, what I'd expect to be restored isn't there.

Are there any logs which state whether the backups/restores are functioning?

Use case below

Create deployments/secrets in cluster

[centos@ee78cb168c41 tmp]$ kubectl apply -f nginx_with_pv.yaml
namespace/nginx-example created
persistentvolume/nginx-logs-volume created
persistentvolumeclaim/nginx-logs created
deployment.apps/nginx-deployment created
service/my-nginx created


[centos@ee78cb168c41 tmp]$ kubectl get deployments --all-namespaces
NAMESPACE       NAME                      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system     calico-kube-controllers   0         0         0            0           57m
kube-system     dns-controller            1         1         1            1           57m
kube-system     kube-dns                  2         2         2            2           57m
kube-system     kube-dns-autoscaler       1         1         1            1           57m
nginx-example   nginx-deployment          1         1         1            1           17s


[centos@ee78cb168c41 tmp]$ kubectl create secret generic db-user-pass-bloop --from-file=./username.txt --from-file=./password.txt --namespace nginx-example
secret/db-user-pass-bloop created

Wait for etcd-manager backup

root@ip-10-0-25-247:/tmp/etcd-manager# ./etcd-manager-ctl -backup-store=s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main list-backups
Backup Store: s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main
I0704 14:02:17.605787   21750 vfs.go:94] listed backups in s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main: [2019-07-04T12:45:48Z-000001 2019-07-04T13:01:39Z-000001 2019-07-04T13:16:41Z-000002 2019-07-04T13:31:43Z-000003 2019-07-04T13:46:46Z-000004 2019-07-04T14:01:48Z-000001]
2019-07-04T12:45:48Z-000001
2019-07-04T13:01:39Z-000001
2019-07-04T13:16:41Z-000002
2019-07-04T13:31:43Z-000003
2019-07-04T13:46:46Z-000004
2019-07-04T14:01:48Z-000001

Create havoc - delete deployment and secret

root@ip-10-0-25-247:/tmp/etcd-manager# kubectl get deployments --all-namespaces
NAMESPACE       NAME                      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system     calico-kube-controllers   0         0         0            0           75m
kube-system     dns-controller            1         1         1            1           75m
kube-system     kube-dns                  2         2         2            2           75m
kube-system     kube-dns-autoscaler       1         1         1            1           75m
nginx-example   nginx-deployment          1         1         1            1           18m

root@ip-10-0-25-247:/tmp/etcd-manager# kubectl delete deployment nginx-deployment -n nginx-example
deployment.extensions "nginx-deployment" deleted


root@ip-10-0-25-247:/tmp/etcd-manager# kubectl delete secret db-user-pass-bloop -n nginx-example
secret "db-user-pass-bloop" deleted

Restore the backup

root@ip-10-0-25-247:/tmp/etcd-manager# ./etcd-manager-ctl -backup-store=s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main restore-backup 2019-07-04T14:01:48Z-000001
Backup Store: s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main
I0704 14:04:04.484006   22622 vfs.go:60] Adding command at s3://cms-controller-kops-sandbox-pr998/kops-cluster-sb.platformdxc-sb.internal/backups/etcd/main/control/2019-07-04T14:04:04Z-000000/_command.json: timestamp:1562249044483908780 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.2.24" > backup:"2019-07-04T14:01:48Z-000001" >
added restore-backup command: timestamp:1562249044483908780 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.2.24" > backup:"2019-07-04T14:01:48Z-000001" >

Wait a while - check for the deleted items

[centos@ee78cb168c41 kops-cluster-sb]$ kubectl get deployment --all-namespaces
NAMESPACE     NAME                      DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system   calico-kube-controllers   0         0         0            0           94m
kube-system   dns-controller            1         1         1            1           94m
kube-system   kube-dns                  2         2         2            2           94m
kube-system   kube-dns-autoscaler       1         1         1            1           94m


[centos@ee78cb168c41 kops-cluster-sb]$ kubectl get secrets -n nginx-example
NAME                  TYPE                                  DATA      AGE
default-token-8mj5q   kubernetes.io/service-account-token   3         37m

Does anyone know if I am doing this incorrectly?

Or is my expectation of what etcd-manager backs up incorrect?

Thanks

Document the "on disk" backup format

We want this to be a neutral (de-facto) standard

etcd-manager-ctl needs ability to add doBackup command

As a user, I would like to be able to manually force an etcd backup before cluster maintenance. I would like the etcd-manager-ctl to had a "create-backup" command.

Resize inode not valid

We are running internal e2e tests all the time and we see sometimes issues like this, when creating new cluster using kops and etcd-manager:

root@master-zone-1-3-1-clusterpr-3d22d3-k8s-local:/home/debian# docker logs 5dda81d9c876
etcd-manager
I0830 10:21:21.645797    6788 volumes.go:200] Found project="c2cd83b134244985b80038bf5c9e5e42"
I0830 10:21:21.645918    6788 volumes.go:209] Found instanceName="master-zone-1-3-1-clusterpr-3d22d3-k8s-local"
I0830 10:21:23.111471    6788 volumes.go:229] Found internalIP="10.1.32.9" and zone="zone-1"
I0830 10:21:23.111514    6788 main.go:254] Mounting available etcd volumes matching tags [KubernetesCluster=clusterpr-3d22d3.k8s.local k8s.io/etcd/main k8s.io/role/master=1]; nameTag=k8s.io/etcd/main
I0830 10:21:23.111542    6788 volumes.go:299] Listing Openstack disks in c2cd83b134244985b80038bf5c9e5e42/zone-1
I0830 10:21:23.605418    6788 mounter.go:288] Trying to mount master volume: "00e3f964-da00-4ea0-91f5-5a7a2a68de88"
I0830 10:21:26.217984    6788 mounter.go:302] Currently attached volumes: [0xc000246f80]
I0830 10:21:26.218061    6788 mounter.go:64] Master volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" is attached at "/dev/vdd"
I0830 10:21:26.218137    6788 mounter.go:78] Doing safe-format-and-mount of /dev/vdd to /mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local
I0830 10:21:26.218174    6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:27.218470    6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:28.218911    6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:29.219109    6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:30.219328    6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:31.219656    6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:32.219854    6788 mounter.go:113] Waiting for volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" to be mounted
I0830 10:21:33.220191    6788 mounter.go:116] Found volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" mounted at device "/dev/vdd"
I0830 10:21:33.221050    6788 mounter.go:161] Creating mount directory "/rootfs/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local"
I0830 10:21:33.221180    6788 mounter.go:166] Mounting device "/dev/vdd" on "/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local"
I0830 10:21:33.221221    6788 mount_linux.go:440] Checking for issues with fsck on disk: /dev/vdd
I0830 10:21:33.221227    6788 nsenter_exec.go:50] Running command : nsenter [--mount=/rootfs/proc/1/ns/mnt -- fsck -a /dev/vdd]
W0830 10:21:33.257556    6788 mounter.go:82] unable to mount master volume: "error formatting and mounting disk \"/dev/vdd\" on \"/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local\": 'fsck' found errors on device /dev/vdd but could not correct them: fsck from util-linux 2.33.1\n/dev/vdd: Superblock has an invalid journal (inode 8).\nCLEARED.\n*** journal has been deleted ***\n\n/dev/vdd: Resize inode not valid.  \n\n/dev/vdd: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.\n\t(i.e., without -a or -p options)\n."
I0830 10:21:33.257581    6788 boot.go:49] waiting for volumes
I0830 10:22:33.257754    6788 volumes.go:299] Listing Openstack disks in c2cd83b134244985b80038bf5c9e5e42/zone-1
I0830 10:22:33.721984    6788 mounter.go:64] Master volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" is attached at "/dev/vdd"
I0830 10:22:33.722177    6788 mounter.go:78] Doing safe-format-and-mount of /dev/vdd to /mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local
I0830 10:22:33.722270    6788 mounter.go:116] Found volume "00e3f964-da00-4ea0-91f5-5a7a2a68de88" mounted at device "/dev/vdd"
I0830 10:22:33.722843    6788 mounter.go:161] Creating mount directory "/rootfs/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local"
I0830 10:22:33.722985    6788 mounter.go:166] Mounting device "/dev/vdd" on "/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local"
I0830 10:22:33.723069    6788 mount_linux.go:440] Checking for issues with fsck on disk: /dev/vdd
I0830 10:22:33.723154    6788 nsenter_exec.go:50] Running command : nsenter [--mount=/rootfs/proc/1/ns/mnt -- fsck -a /dev/vdd]
W0830 10:22:33.753342    6788 mounter.go:82] unable to mount master volume: "error formatting and mounting disk \"/dev/vdd\" on \"/mnt/master-3.etcd-main.clusterpr-3d22d3.k8s.local\": 'fsck' found errors on device /dev/vdd but could not correct them: fsck from util-linux 2.33.1\n/dev/vdd contains a file system with errors, check forced.\n/dev/vdd: Resize inode not valid.  \n\n/dev/vdd: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.\n\t(i.e., without -a or -p options)\n."
I0830 10:22:33.753361    6788 boot.go:49] waiting for volumes

What is causing this issue? Like in this case we got 2 of 3 masters up and running. Events etcd-manager is running fine, but one of the main etcd volumes failed which lead to one master failure.

Fails to mount volume

kopeio/etcd-manager:1.0.20180729 fails to mount etcd volumes with Failed to create bus connection: No data available

kops Version 1.10.0 (git-8b52ea6d1)

I0822 11:02:52.053982       1 mounter.go:150] Creating mount directory "/rootfs/mnt/master-vol-04c550c9347a13de8"
I0822 11:02:52.053995       1 mounter.go:155] Mounting device "/dev/xvdu" on "/mnt/master-vol-04c550c9347a13de8"
I0822 11:02:52.054005       1 mount_linux.go:472] Checking for issues with fsck on disk: /dev/xvdu
I0822 11:02:52.054011       1 nsenter_exec.go:50] Running command : nsenter [--mount=/rootfs/proc/1/ns/mnt -- fsck -a /dev/xvdu]
I0822 11:02:52.083063       1 mount_linux.go:491] Attempting to mount disk:  /dev/xvdu /mnt/master-vol-04c550c9347a13de8
I0822 11:02:52.083097       1 nsenter_mount.go:81] nsenter mount /dev/xvdu /mnt/master-vol-04c550c9347a13de8  [defaults]
I0822 11:02:52.083117       1 nsenter.go:106] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/systemd-run --description=Kubernetes transient mount for /mnt/master-vol-04c550c9347a13de8 --scope -- /bin/mount -o defaults /dev/xvdu /mnt/master-vol-04c550c9347a13de8]
I0822 11:02:52.092447       1 nsenter_mount.go:85] Output of mounting /dev/xvdu to /mnt/master-vol-04c550c9347a13de8: Failed to create bus connection: No data available
I0822 11:02:52.092465       1 mount_linux.go:542] Attempting to determine if disk "/dev/xvdu" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/xvdu])
I0822 11:02:52.092478       1 nsenter_exec.go:50] Running command : nsenter [--mount=/rootfs/proc/1/ns/mnt -- blkid -p -s TYPE -s PTTYPE -o export /dev/xvdu]
I0822 11:02:52.106696       1 mount_linux.go:545] Output: "DEVNAME=/dev/xvdu\nTYPE=ext4\n", err: <nil>
W0822 11:02:52.106733       1 mounter.go:79] unable to mount master volume: "error formatting and mounting disk \"/dev/xvdu\" on \"/mnt/master-vol-04c550c9347a13de8\": exit status 1"

cluster.yml

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-06-20T09:30:00Z
  name: xxx
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    alwaysAllow: {}
  channel: alpha
  cloudProvider: aws
  configBase: s3://xxx
  dnsZone: xxx
  docker:
    bridgeIP: 192.168.5.1/24
    storage: overlay2
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
      encryptedVolume: true
    - instanceGroup: master-eu-west-1b
      name: b
      encryptedVolume: true
    - instanceGroup: master-eu-west-1c
      name: c
      encryptedVolume: true
    name: main
    manager:
      image: kopeio/etcd-manager:1.0.20180729
  - etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
      encryptedVolume: true
    - instanceGroup: master-eu-west-1b
      name: b
      encryptedVolume: true
    - instanceGroup: master-eu-west-1c
      name: c
      encryptedVolume: true
    name: events
    manager:
      image: kopeio/etcd-manager:1.0.20180729
  kubeAPIServer:
    runtimeConfig:
      batch/v2alpha1: "true"
    logLevel: 1
  kubelet:
    logLevel: 1
    podInfraContainerImage: gcr.io/google_containers/pause-amd64:3.1
  kubeProxy:
    logLevel: 1
  kubeControllerManager:
    logLevel: 1
  kubeScheduler:
    logLevel: 1
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.10.7
  masterPublicName: api.xxx
  networkCIDR: xxx
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  hooks:
  - name: disable-locksmithd.service
    before:
    - locksmithd.service
    manifest: |
      Type=oneshot
      ExecStart=/usr/bin/systemctl mask locksmithd.service
      ExecStart=-/usr/bin/systemctl stop locksmithd.service
  sshAccess:
  - xxx
  subnets:
  - cidr: xxx
    name: eu-west-1a
    type: Public
    zone: eu-west-1a
  - cidr: xxx
    name: eu-west-1b
    type: Public
    zone: eu-west-1b
  - cidr: xxx
    name: eu-west-1c
    type: Public
    zone: eu-west-1c
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

one of the three masters:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-06-22T07:08:46Z
  labels:
    kops.k8s.io/cluster: xxx
  name: master-eu-west-1a
spec:
  detailedInstanceMonitoring: true
  image: coreos.com/CoreOS-stable-*-hvm
  machineType: t2.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    beta.kubernetes.io/fluentd-ds-ready: "true"
  role: Master
  subnets:
  - eu-west-1a

Enable TLS encryption in etcd

We should be able to create keys and distribute them security

Etcd-manager is damaging /etc/hosts file

We are using kopeio/etcd-manager:3.0.20190801 version in our k8s cluster for events and main, and they corrupted the /etc/hosts file after some hours.

for the consitent master it looks like this:

# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
#     /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 ip-1-2-3-4.ourdomain.pri ip-1-2-3-4
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
1.2.3.4      etcd-events-a.internal.example.com
1.2.3.5     etcd-events-b.internal.example.com
1.2.3.6    etcd-events-c.internal.example.com
# End host entries managed by etcd-manager[etcd-events]

# Begin host entries managed by etcd-manager[etcd] - do not edit
1.2.3.4      etcd-a.internal.example.com
1.2.3.5     etcd-b.internal.example.com
1.2.3.6    etcd-c.internal.example.com
# End host entries managed by etcd-manager[etcd]

while on one of the other master, where it is damaged:

r-data
#
127.0.1.1 ip-1-2-3-6.ourdomain.pri ip-1-2-3-6
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd] - do not edit
1.2.3.4      etcd-a.internal.example.com
1.2.3.5     etcd-b.internal.example.com
1.2.3.6    etcd-c.internal.example.com
# End host entries managed by etcd-manager[etcd]

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
1.2.3.4      etcd-events-a.internal.example.com
1.2.3.5     etcd-events-b.internal.example.com
1.2.3.6    etcd-events-c.internal.example.com
# End host entries managed by etcd-manager[etcd-events]

As you can see after some concurrent writes the events and the main etcd-manager damaged the beginning of the file (partially removing part of cloud.cfg comment). After some time they will remove the host entries as well, and we end up with a file, that doesn't contain any entries for loclahost and for the hostname ip-x-x-x-x, which causes all the calico nodes in the cluster become unready.

Attaching the 2 host file, and part of kibanlogs we see:

consistent-etc-hosts.txt

damaged-etc-hosts.txt

filtered-kibana-log.txt

trying to mount incorrect volume in OpenStack

We are sometimes seeing situation like this when using OpenStack

I1027 06:32:33.238927    5130 volumes.go:300] Listing Openstack disks in 44a6f8538efe47cd9b55182e0a94e478/zone-1
I1027 06:32:33.660679    5130 mounter.go:288] Trying to mount master volume: "1bc7494f-be09-443b-8713-c478f8f2c5ed"
W1027 06:32:33.952050    5130 mounter.go:293] Error attaching volume "1bc7494f-be09-443b-8713-c478f8f2c5ed": error attaching volume 1bc7494f-be09-443b-8713-c478f8f2c5ed to server 061822c7-0fcd-4e49-96f3-ee0a204a448c: Bad request with: [POST https://foobar.com/v2.1/servers/061822c7-0fcd-4e49-96f3-ee0a204a448c/os-volume_attachments], error message: {"badRequest": {"message": "Invalid volume: volume 1bc7494f-be09-443b-8713-c478f8f2c5ed already attached", "code": 400}}
I1027 06:32:33.952206    5130 mounter.go:302] Currently attached volumes: []
I1027 06:32:33.952256    5130 boot.go:49] waiting for volumes

% openstack volume list --project kaas-clusterpr-6aef63-k8s-local
+--------------------------------------+------------------------------------------+-----------+------+---------------------------------------------------------------+
| ID                                   | Name                                     | Status    | Size | Attached to                                                   |
+--------------------------------------+------------------------------------------+-----------+------+---------------------------------------------------------------+
| 1dc01e68-a4db-4a67-b00f-da9e26fbd7af | 1.etcd-events.clusterpr-6aef63.k8s.local | in-use    |    8 | Attached to 18e52d29-2771-404c-bf6a-37a94631e506 on /dev/vdc  |
| ec8f7fa7-2594-4619-88f5-83b5e93b2886 | 1.etcd-main.clusterpr-6aef63.k8s.local   | in-use    |    8 | Attached to 18e52d29-2771-404c-bf6a-37a94631e506 on /dev/vdd  |
| 854de0e0-39ba-43f3-982b-b1affc774e55 | 3.etcd-events.clusterpr-6aef63.k8s.local | in-use    |    8 | Attached to 061822c7-0fcd-4e49-96f3-ee0a204a448c on /dev/vdd  |
| 7e549e7b-7105-46c7-b976-2f3bb4bf6c8f | 2.etcd-events.clusterpr-6aef63.k8s.local | in-use    |    8 | Attached to 9d55ba26-ee07-4422-afa1-b37ffec92d73 on /dev/vdd  |
| 52d58cb5-71da-415f-8695-e9bea97380a6 | 3.etcd-main.clusterpr-6aef63.k8s.local   | in-use    |    8 | Attached to 9d55ba26-ee07-4422-afa1-b37ffec92d73 on /dev/vdc  |
| 1bc7494f-be09-443b-8713-c478f8f2c5ed | 2.etcd-main.clusterpr-6aef63.k8s.local   | available |    8 |                                                               |
+--------------------------------------+------------------------------------------+-----------+------+---------------------------------------------------------------+

So for some reason manager decides to take incorrect volume. Maybe better tags for volumes needed? I am running this in single zone so volumes can be mounted to any master.

edit:
Hmm now when I check the error message and volume list, the ids actually match to non-mounted volume but the volume is not somehow attached? but it says its attached?

etcd-manager can get into state where it doesn't start etcd on some nodes

Currently seeing our cluster has gotten into a state where it's cluster state knows about all three members, but marks one as unhealthy because it's not responding to etcd checks. However, the reason it's not responding is because the gRPC command to join the cluster hasn't been initiated, because it already knows the member exists.

Of note is that this host runs two instances of etcd-manager, one for events and one for main Kubernetes objects. Only one of the instances is "broken".

Log excerpt from etcd-manager leader:

0103 18:44:51.268012   18771 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0103 18:44:51.818575   18771 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0103 18:44:51.879649   18771 hosts.go:84] hosts update: primary=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 18:44:51.879750   18771 hosts.go:181] skipping update of unchanged /etc/hosts
2020-01-03 18:44:55.679341 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-03 18:44:55.679371 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
I0103 18:44:58.933053   18771 controller.go:173] starting controller iteration
I0103 18:44:58.933090   18771 controller.go:269] I am leader with token "[REDACTED]"
2020-01-03 18:45:00.679490 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:00.679521 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
W0103 18:45:03.957167   18771 controller.go:703] health-check unable to reach member 2595344402187300919: error building etcd client for https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002: dial tcp 172.28.196.130:4002: connect: connection refused
I0103 18:45:03.957196   18771 controller.go:276] etcd cluster state: etcdClusterState
  members:
    {"name":"etcd-events-etcd-us-west-2c","peerURLs":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002"],"ID":"2595344402187300919"}
      NOT HEALTHY
    {"name":"etcd-events-etcd-us-west-2b","peerURLs":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002"],"ID":"14454711989398209995"}
    {"name":"etcd-events-etcd-us-west-2a","peerURLs":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002"],"ID":"16707933308235350511"}
  peers:
    etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2a" endpoints:"172.28.192.102:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" > etcd_state:<cluster:<cluster_token:"Ty6K7M5AzR1HeBeARXgqAA" nodes:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" tls_enabled:true > > etcd_version:"3.3.13" > }
    etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2b" endpoints:"172.28.194.230:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" > etcd_state:<cluster:<cluster_token:"Ty6K7M5AzR1HeBeARXgqAA" nodes:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" tls_enabled:true > > etcd_version:"3.3.13" > }
    etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2c" endpoints:"172.28.196.130:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" > }
I0103 18:45:03.957341   18771 controller.go:277] etcd cluster members: map[14454711989398209995:{"name":"etcd-events-etcd-us-west-2b","peerURLs":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002"],"ID":"14454711989398209995"} 16707933308235350511:{"name":"etcd-events-etcd-us-west-2a","peerURLs":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002"],"ID":"16707933308235350511"} 2595344402187300919:{"name":"etcd-events-etcd-us-west-2c","peerURLs":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002"],"ID":"2595344402187300919"}]
I0103 18:45:03.957362   18771 controller.go:615] sending member map to all peers: members:<name:"etcd-events-etcd-us-west-2a" dns:"etcd-events-etcd-us-west-2a.internal.redacted.k8s.local" addresses:"172.28.192.102" > members:<name:"etcd-events-etcd-us-west-2b" dns:"etcd-events-etcd-us-west-2b.internal.redacted.k8s.local" addresses:"172.28.194.230" >
I0103 18:45:03.957569   18771 etcdserver.go:226] updating hosts: map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]]
I0103 18:45:03.957808   18771 hosts.go:84] hosts update: primary=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 18:45:04.011465   18771 commands.go:22] not refreshing commands - TTL not hit
I0103 18:45:04.011495   18771 s3fs.go:220] Reading file "s3://zendesk-compute-kops-state-staging/redacted.k8s.local/backups/etcd/events/control/etcd-cluster-created"
I0103 18:45:04.042214   18771 controller.go:369] spec member_count:3 etcd_version:"3.3.13"
I0103 18:45:04.042271   18771 controller.go:494] etcd has unhealthy members, but we already have a slot where we could add another member
I0103 18:45:04.042294   18771 controller.go:531] controller loop complete
2020-01-03 18:45:05.679628 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:05.679658 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-03 18:45:10.679762 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:10.679790 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")

Project status?

This project is linked to from the Kops roadmap. What's the status of it? Is it production ready or still a work-in-progress?

Using raft?

HI Justin,
I have been studying this project. I found that you have implemented a gossip and leader election alg here. Have you considered using raft itself to do so instead of reinventing this?

Thanks.

How to compile etcd-manager docker image

I am trying to compile etcd-manager docker image using make push

root@bazeltest:/home/debian/etcd-manager# bazel version
Build label: 0.28.1
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Jul 19 15:19:51 2019 (1563549591)
Build timestamp: 1563549591
Build timestamp as int: 1563549591

root@bazeltest:/home/debian/etcd-manager# make push
bazel run --features=pure --platforms=@io_bazel_rules_go//go/toolchain:linux_amd64 //images:push-etcd-manager
ERROR: /root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/subpar/subpar.bzl:111:17: Traceback (most recent call last):
	File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/subpar/subpar.bzl", line 108
		rule(attrs = {"src": attr.label(manda...")}, <2 more arguments>)
	File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/subpar/subpar.bzl", line 111, in rule
		attr.label(mandatory = True, allow_files = Tr..., ...)
'single_file' is no longer supported. use allow_single_file instead. You can use --incompatible_disable_deprecated_attr_params=false to temporarily disable this check.
ERROR: /home/debian/etcd-manager/images/BUILD:102:1: every rule of type container_push implicitly depends upon the target '@containerregistry//:pusher', but this target could not be found because of: error loading package '@containerregistry//': Extension file 'subpar.bzl' has errors
ERROR: /home/debian/etcd-manager/images/BUILD:102:1: every rule of type container_push implicitly depends upon the target '@containerregistry//:digester', but this target could not be found because of: error loading package '@containerregistry//': Extension file 'subpar.bzl' has errors
ERROR: Analysis of target '//images:push-etcd-manager' failed; build aborted: error loading package '@containerregistry//': Extension file 'subpar.bzl' has errors
INFO: Elapsed time: 0.780s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (16 packages loaded, 143 targets configured)
FAILED: Build did NOT complete successfully (16 packages loaded, 143 targets configured)
    currently loading: @containerregistry//
Makefile:23: recipe for target 'push-etcd-manager' failed
make: *** [push-etcd-manager] Error 1

Lets add --incompatible_disable_deprecated_attr_params=false to parameters:

root@bazeltest:/home/debian/etcd-manager# make push
bazel run --features=pure --platforms=@io_bazel_rules_go//go/toolchain:linux_amd64 --incompatible_disable_deprecated_attr_params=false //images:push-etcd-manager
ERROR: /home/debian/etcd-manager/images/BUILD:29:1: in container_layer_ rule //images:etcd-3-1-12-layer:
Traceback (most recent call last):
	File "/home/debian/etcd-manager/images/BUILD", line 29
		container_layer_(name = 'etcd-3-1-12-layer')
	File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/io_bazel_rules_docker/container/layer.bzl", line 184, in _impl
		zip_layer(ctx, unzipped_layer)
	File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/io_bazel_rules_docker/container/layer.bzl", line 121, in zip_layer
		_gzip(ctx, layer)
	File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/external/io_bazel_rules_docker/skylib/zip.bzl", line 19, in _gzip
		ctx.actions.run_shell(command = ("%s -n < %s > %s" % (...)), <4 more arguments>)
Found tool(s) 'bazel-out/host/bin/external/gzip/gzip' in inputs. A tool is an input with executable=True set. All tools should be passed using the 'tools' argument instead of 'inputs' in order to make their runfiles available to the action. This safety check will not be performed once the action is modified to take a 'tools' argument. To temporarily disable this check, set --incompatible_no_support_tools_in_action_inputs=false.
ERROR: Analysis of target '//images:push-etcd-manager' failed; build aborted: Analysis of target '//images:etcd-3-1-12-layer' failed; build aborted
INFO: Elapsed time: 1.299s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (159 packages loaded, 5762 targets configured)
FAILED: Build did NOT complete successfully (159 packages loaded, 5762 targets configured)
Makefile:23: recipe for target 'push-etcd-manager' failed
make: *** [push-etcd-manager] Error 1

Still it fails, lets add --incompatible_no_support_tools_in_action_inputs=false to parameters.

root@bazeltest:/home/debian/etcd-manager# make push
bazel run --features=pure --platforms=@io_bazel_rules_go//go/toolchain:linux_amd64 --incompatible_disable_deprecated_attr_params=false --incompatible_no_support_tools_in_action_inputs=false //images:push-etcd-manager
INFO: Analyzed target //images:push-etcd-manager (322 packages loaded, 8428 targets configured).
INFO: Found 1 target...
ERROR: /home/debian/etcd-manager/images/BUILD:102:1: ContainerPushDigest images/push-etcd-manager.digest failed (Exit 1) digester failed: error executing command bazel-out/host/bin/external/containerregistry/digester --config bazel-out/k8-fastbuild/bin/images/etcd-manager.0.config --manifest bazel-out/k8-fastbuild/bin/images/etcd-manager.0.manifest --digest ... (remaining 61 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/sandbox/linux-sandbox/199/execroot/__main__/bazel-out/host/bin/external/containerregistry/digester.runfiles/containerregistry/tools/image_digester_.py", line 28, in <module>
    from containerregistry.client.v2_2 import docker_image as v2_2_image
  File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/sandbox/linux-sandbox/199/execroot/__main__/bazel-out/host/bin/external/containerregistry/digester.runfiles/containerregistry/client/__init__.py", line 23, in <module>
    from containerregistry.client import docker_creds_
  File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/sandbox/linux-sandbox/199/execroot/__main__/bazel-out/host/bin/external/containerregistry/digester.runfiles/containerregistry/client/docker_creds_.py", line 31, in <module>
    import httplib2
  File "/root/.cache/bazel/_bazel_root/96468b218fe40a0551bedd20e6b7fe69/sandbox/linux-sandbox/199/execroot/__main__/bazel-out/host/bin/external/containerregistry/digester.runfiles/httplib2/__init__.py", line 988
    raise socket.error, msg
                      ^
SyntaxError: invalid syntax
----------------
Note: The failure of target @containerregistry//:digester (with exit code 1) may have been caused by the fact that it is running under Python 3 instead of Python 2. Examine the error to determine if that appears to be the problem. Since this target is built in the host configuration, the only way to change its version is to set --host_force_python=PY2, which affects the entire build.

If this error started occurring in Bazel 0.27 and later, it may be because the Python toolchain now enforces that targets analyzed as PY2 and PY3 run under a Python 2 and Python 3 interpreter, respectively. See https://github.com/bazelbuild/bazel/issues/7899 for more information.
----------------
Target //images:push-etcd-manager failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 2.594s, Critical Path: 0.35s
INFO: 0 processes.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
Makefile:23: recipe for target 'push-etcd-manager' failed
make: *** [push-etcd-manager] Error 1

I am out of ideas how to build docker image of etcd-manager.

latest etcd versions support v3.2.26 v3.3.13 etc

Hey @justinsb
We've slightly touched that topic on kubecon in Barcelona.
First of all I much appreciate all your (authors and collaborators) hard work in these exceptional projects.

My current k8s@aws v1.12 was build by kops 1.11 using normal official etcd image of v3.2.26
Now kops 1.12 is released and has all the etcd versions hardcoded.

Curious if there is way we can have our official images and versions chosen during kops/etcdmanager setup?

IMHO such hard-coding adds excess maintenance.
Whole community will depend on contributors will and free-time.
It's GO code and not end-user-friendly YAML.

Also this commit:

justinsb committed 12 days ago Support etcd 3.3.10 (May 16, 2019)

Any specific reason why not 3.3.13, or anything else after 3.3.10 ?

Related issue from kops repo kubernetes/kops#6756

go get times out on vanity URL

I'm getting a timeout trying to go get the project:

$ go get kope.io/etcd-manager
package kope.io/etcd-manager: unrecognized import path "kope.io/etcd-manager" (https fetch: Get https://kope.io/etcd-manager?go-get=1: dial tcp 104.197.25.62:443: i/o timeout

It looks like the vanity URL is unable to respond.

etcd 3.3.15

Hopefully this beautiful piuece of software will be update to latest etcd to enable latest secutiry fixes

https://groups.google.com/forum/#!msg/golang-announce/65QixT3tcmg/DrFiG6vvCwAJ

https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3315-2019-08-19

Better handling for >1 matching EBS volumes

We've been using kops for a few years, and prior to the introduction of etcd-manager we relied on our own EBS backup strategy. This led to a number of etcd volumes being present in our AWS account that matched the tags used by etcd-manager to select and mount storage.

The first host that came up after the rolling-update that installed etcd-manager had the following in its logs:

I1209 16:26:18.958984   18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.959912   18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.960468   18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.961016   18726 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I1209 16:26:18.961520   18726 main.go:254] Mounting available etcd volumes matching tags [k8s.io/etcd/main k8s.io/role/master=1 kubernetes.io/cluster/kube.us-east-1.dev.deploys.brightcove.com=owned]; nameTag=k8s.io/etcd/main
I1209 16:26:18.962655   18726 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I1209 16:26:19.152540   18726 mounter.go:302] Currently attached volumes: [0xc00025af00]
I1209 16:26:19.152574   18726 mounter.go:64] Master volume "vol-0a5a75bec90179bd8" is attached at "/dev/xvdu"
I1209 16:26:19.152590   18726 mounter.go:78] Doing safe-format-and-mount of /dev/xvdu to /mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.152604   18726 volumes.go:233] volume vol-0a5a75bec90179bd8 not mounted at /rootfs/dev/xvdu
I1209 16:26:19.152639   18726 volumes.go:247] found nvme volume "nvme-Amazon_Elastic_Block_Store_vol0a5a75bec90179bd8" at "/dev/nvme1n1"
I1209 16:26:19.152652   18726 mounter.go:116] Found volume "vol-0a5a75bec90179bd8" mounted at device "/dev/nvme1n1"
I1209 16:26:19.153151   18726 mounter.go:173] Device already mounted on "/mnt/master-vol-0a5a75bec90179bd8", verifying it is our device
I1209 16:26:19.153167   18726 mounter.go:185] Found existing mount of "/dev/nvme1n1" at "/mnt/master-vol-0a5a75bec90179bd8"
I1209 16:26:19.153241   18726 mount_linux.go:164] Detected OS without systemd
I1209 16:26:19.153789   18726 mounter.go:226] matched device "/dev/nvme1n1" and "/dev/nvme1n1" via '\x00'
I1209 16:26:19.153803   18726 mounter.go:86] mounted master volume "vol-0a5a75bec90179bd8" on /mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.153816   18726 main.go:269] discovered IP address: 10.250.16.215
I1209 16:26:19.153823   18726 main.go:274] Setting data dir to /rootfs/mnt/master-vol-0a5a75bec90179bd8
I1209 16:26:19.154260   18726 server.go:71] starting GRPC server using TLS, ServerName="etcd-manager-server-etcd-b"
I1209 16:26:19.154403   18726 s3context.go:331] product_uuid is "ec2004e4-d619-9524-bf5b-e56ce28c2bd6", assuming running on EC2
I1209 16:26:19.155152   18726 s3context.go:164] got region from metadata: "us-east-1"
I1209 16:26:19.212772   18726 s3context.go:210] found bucket in region "us-east-1"
I1209 16:26:19.212798   18726 s3fs.go:128] Writing file "s3://com.brightcove.deploys.dev.kube.dev-us-east-1/kube.us-east-1.dev.deploys.brightcove.com/backups/etcd/main/control/etcd-cluster-created"
I1209 16:26:19.212816   18726 s3context.go:238] Checking default bucket encryption for "com.brightcove.deploys.dev.kube.dev-us-east-1"
W1209 16:26:19.272282   18726 controller.go:135] not enabling TLS for etcd, this is insecure
I1209 16:26:19.272306   18726 server.go:89] GRPC server listening on "10.250.16.215:3996"
I1209 16:26:19.272403   18726 etcdserver.go:534] starting etcd with state cluster:<cluster_token:"ckDjqRPhIBJGj0dtx6qVlw" nodes:<name:"etcd-a" peer_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > nodes:<name:"etcd-b" peer_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > nodes:<name:"etcd-c" peer_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:2380" client_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:4001" quarantined_client_urls:"http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:3994" > > etcd_version:"2.2.1" 
I1209 16:26:19.272549   18726 etcdserver.go:543] starting etcd with datadir /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
I1209 16:26:19.272548   18726 volumes.go:85] AWS API Request: ec2/DescribeVolumes
W1209 16:26:19.272599   18726 pki.go:46] not generating peer keypair as peers-ca not set
W1209 16:26:19.272626   18726 pki.go:84] not generating client keypair as clients-ca not set
I1209 16:26:19.272703   18726 etcdprocess.go:180] executing command /opt/etcd-v2.2.1-linux-amd64/etcd [/opt/etcd-v2.2.1-linux-amd64/etcd]
W1209 16:26:19.272749   18726 etcdprocess.go:234] using insecure configuration for etcd peers
W1209 16:26:19.272774   18726 etcdprocess.go:243] using insecure configuration for etcd clients
2019-12-09 16:26:19.277754 I | flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001
2019-12-09 16:26:19.277784 I | flags: recognized and used environment variable ETCD_DATA_DIR=/rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.277799 I | flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380
2019-12-09 16:26:19.277814 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-a=http://etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:2380,etcd-b=http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:2380,etcd-c=http://etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:2380
2019-12-09 16:26:19.277820 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
2019-12-09 16:26:19.277830 I | flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.277838 I | flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:4001
2019-12-09 16:26:19.277848 I | flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2019-12-09 16:26:19.277859 I | flags: recognized and used environment variable ETCD_NAME=etcd-b
2019-12-09 16:26:19.277889 W | flags: unrecognized environment variable ETCD_LISTEN_METRICS_URLS=
2019-12-09 16:26:19.277934 I | etcdmain: etcd Version: 2.2.1
2019-12-09 16:26:19.277938 I | etcdmain: Git SHA: Not provided (use ./build instead of go build)
2019-12-09 16:26:19.277941 I | etcdmain: Go Version: go1.12.5
2019-12-09 16:26:19.277945 I | etcdmain: Go OS/Arch: linux/amd64
2019-12-09 16:26:19.277949 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2019-12-09 16:26:19.277992 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2019-12-09 16:26:19.278095 I | etcdmain: listening for peers on http://0.0.0.0:2380
2019-12-09 16:26:19.278118 I | etcdmain: listening for client requests on http://0.0.0.0:4001
2019-12-09 16:26:19.371091 I | etcdserver: recovered store from snapshot at index 380038
2019-12-09 16:26:19.371126 I | etcdserver: name = etcd-b
2019-12-09 16:26:19.371130 I | etcdserver: data dir = /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw
2019-12-09 16:26:19.371134 I | etcdserver: member dir = /rootfs/mnt/master-vol-0a5a75bec90179bd8/data/ckDjqRPhIBJGj0dtx6qVlw/member
2019-12-09 16:26:19.371138 I | etcdserver: heartbeat = 100ms
2019-12-09 16:26:19.371140 I | etcdserver: election = 1000ms
2019-12-09 16:26:19.371144 I | etcdserver: snapshot count = 10000
2019-12-09 16:26:19.371155 I | etcdserver: advertise client URLs = http://etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:4001
2019-12-09 16:26:19.371185 I | etcdserver: loaded cluster information from store: <nil>
I1209 16:26:19.373963   18726 volumes.go:85] AWS API Request: ec2/DescribeInstances
2019-12-09 16:26:19.412180 I | etcdserver: restarting member a8bc606d954cb360 in cluster 362b3eb57d5b3247 at commit index 386849
2019-12-09 16:26:19.412557 I | raft: a8bc606d954cb360 became follower at term 826
2019-12-09 16:26:19.412578 I | raft: newRaft a8bc606d954cb360 [peers: [21c1cba54be22c9a,85558b08fd6377a2,a8bc606d954cb360], term: 826, commit: 386849, applied: 380038, lastindex: 386849, lastterm: 20]
2019-12-09 16:26:19.419037 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.419054 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.419064 E | rafthttp: failed to dial 21c1cba54be22c9a on stream Message (cluster ID mismatch)
2019-12-09 16:26:19.419073 E | rafthttp: failed to dial 21c1cba54be22c9a on stream MsgApp v2 (cluster ID mismatch)
2019-12-09 16:26:19.419903 I | etcdserver: starting server... [version: 2.2.1, cluster version: 2.2]
2019-12-09 16:26:19.422255 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[85558b08fd6377a2]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.422279 E | rafthttp: failed to dial 85558b08fd6377a2 on stream Message (cluster ID mismatch)
2019-12-09 16:26:19.422552 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[85558b08fd6377a2]=138e61746ab70219, local=362b3eb57d5b3247)
2019-12-09 16:26:19.422565 E | rafthttp: failed to dial 85558b08fd6377a2 on stream MsgApp v2 (cluster ID mismatch)
I1209 16:26:19.439578   18726 peers.go:101] found new candidate peer from discovery: etcd-a [{10.250.17.141 0} {10.250.17.141 0}]
I1209 16:26:19.439616   18726 peers.go:101] found new candidate peer from discovery: etcd-b [{10.250.16.215 0} {10.250.16.215 0}]
I1209 16:26:19.439629   18726 peers.go:101] found new candidate peer from discovery: etcd-c [{10.250.18.173 0} {10.250.18.173 0}]
I1209 16:26:19.439703   18726 peers.go:281] connecting to peer "etcd-a" with TLS policy, servername="etcd-manager-server-etcd-a"
I1209 16:26:19.439733   18726 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.17.141 10.250.17.141] etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.16.215 10.250.16.215] etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com:[10.250.18.173 10.250.18.173]], final=map[10.250.16.215:[etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-b.internal.kube.us-east-1.dev.deploys.brightcove.com] 10.250.17.141:[etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-a.internal.kube.us-east-1.dev.deploys.brightcove.com] 10.250.18.173:[etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com etcd-c.internal.kube.us-east-1.dev.deploys.brightcove.com]]
I1209 16:26:19.439903   18726 peers.go:281] connecting to peer "etcd-c" with TLS policy, servername="etcd-manager-server-etcd-c"
I1209 16:26:19.439982   18726 peers.go:281] connecting to peer "etcd-b" with TLS policy, servername="etcd-manager-server-etcd-b"
W1209 16:26:19.440686   18726 peers.go:325] unable to grpc-ping discovered peer 10.250.18.173:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.250.18.173:3996: connect: connection refused"
I1209 16:26:19.440719   18726 peers.go:347] was not able to connect to peer etcd-c: map[10.250.18.173:3996:true]
W1209 16:26:19.440745   18726 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-c
W1209 16:26:19.441043   18726 peers.go:325] unable to grpc-ping discovered peer 10.250.17.141:3996: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.250.17.141:3996: connect: connection refused"
I1209 16:26:19.441077   18726 peers.go:347] was not able to connect to peer etcd-a: map[10.250.17.141:3996:true]
W1209 16:26:19.441096   18726 peers.go:215] unexpected error from peer intercommunications: unable to connect to peer etcd-a
2019-12-09 16:26:19.520566 E | rafthttp: request sent was ignored (cluster ID mismatch: remote[21c1cba54be22c9a]=138e61746ab70219, local=362b3eb57d5b3247)

As the first of 3 hosts that would eventually have etcd-manager installed, the gossip-specific warnings are to be expected. The cluster ID mismatch errors are more significant, and are the consequence of etcd mounting a volume that was several months old.

Some of the approaches that occur to me to address this are:

Ordering matching etcd volumes by creation date and selecting the most recent by default
Log a warning message when multiple volumes are found matching etcd-manager's search for what to mount, but continue to start etcd as we do now
Log a warning message when multiple volumes are found matching etcd-manager's search for what to mount, prevent etcd from starting

I think the first would suffice in most cases, but the other options result in less unexpected behavior in the future.

Plan to do in-place upgrade from "3.0.17" -> "3.2.24"?

Hello,

It is not an issue. It is a question. Do you have any plan to implement in-place-upgrade from etcd 3.0.17 to 3.2.24?

We have k8s environment with etcd 3.0.17 and would like to protect all etcd communication with TLS but etc-manager doesn't support 3.0.17 version.

etcd-manager/pkg/controller/controller.go

Line 421 in 7af893b

// TODO: Automatic intermediate upgrades? 3.1 -> 3.2 -> 3.3 ?

How to build the etcd-manager(docker image) to support me-south-1 of aws region

me-south-1 is aws new region
I use kops to deploy k8s in me-south-1
but etcd-manager return error: me-south-1 is invalid region
I watch the pull request, and find the code to support me-south-1, had upload in 8 days ago
how to build it and use it to support my deploy

Seeing data corruption after migrating to etcd-manager (Kops)

While testing out Kops 1.11-beta.1 with K8s 1.12.3 I noticed some data corruption after migrating to etcd-manager.

Replication process, create a new k8s cluster with Kops.
Kops version: 1.10
Kubernetes version: 1.10
etcd version: 3.2.12

Update etcd and k8s version.
Kubernetes version: 1.13
etcd version: 3.2.18 / 3.2.24 (Tested with both and saw the same issue)

Below is the logs I'm seeing from the etcd-manager container when the corruption seems to happen. When this happens etcd does not start and unfortunately I have not been able to find any relevant logs as to why.
Flag --insecure-bind-address has been deprecated, This flag will be removed in a future version. Flag --insecure-port has been deprecated, This flag will be removed in a future version. I1212 00:12:42.352817 7 flags.go:33] FLAG: --address="127.0.0.1" I1212 00:12:42.352874 7 flags.go:33] FLAG: --admission-control="[]" I1212 00:12:42.352885 7 flags.go:33] FLAG: --admission-control-config-file="" I1212 00:12:42.352892 7 flags.go:33] FLAG: --advertise-address="<nil>" I1212 00:12:42.352896 7 flags.go:33] FLAG: --allow-privileged="true" I1212 00:12:42.352900 7 flags.go:33] FLAG: --alsologtostderr="false" I1212 00:12:42.352904 7 flags.go:33] FLAG: --anonymous-auth="false" I1212 00:12:42.352907 7 flags.go:33] FLAG: --apiserver-count="5" I1212 00:12:42.352911 7 flags.go:33] FLAG: --audit-log-batch-buffer-size="10000" I1212 00:12:42.352915 7 flags.go:33] FLAG: --audit-log-batch-max-size="1" I1212 00:12:42.352917 7 flags.go:33] FLAG: --audit-log-batch-max-wait="0s" I1212 00:12:42.352921 7 flags.go:33] FLAG: --audit-log-batch-throttle-burst="0" I1212 00:12:42.352924 7 flags.go:33] FLAG: --audit-log-batch-throttle-enable="false" I1212 00:12:42.352927 7 flags.go:33] FLAG: --audit-log-batch-throttle-qps="0" I1212 00:12:42.352934 7 flags.go:33] FLAG: --audit-log-format="json" I1212 00:12:42.352937 7 flags.go:33] FLAG: --audit-log-maxage="10" I1212 00:12:42.352940 7 flags.go:33] FLAG: --audit-log-maxbackup="5" I1212 00:12:42.352943 7 flags.go:33] FLAG: --audit-log-maxsize="100" I1212 00:12:42.352946 7 flags.go:33] FLAG: --audit-log-mode="blocking" I1212 00:12:42.352949 7 flags.go:33] FLAG: --audit-log-path="/var/log/kube-audit.log" I1212 00:12:42.352952 7 flags.go:33] FLAG: --audit-log-truncate-enabled="false" I1212 00:12:42.352955 7 flags.go:33] FLAG: --audit-log-truncate-max-batch-size="10485760" I1212 00:12:42.352960 7 flags.go:33] FLAG: --audit-log-truncate-max-event-size="102400" I1212 00:12:42.352963 7 flags.go:33] FLAG: --audit-log-version="audit.k8s.io/v1beta1" I1212 00:12:42.352966 7 flags.go:33] FLAG: --audit-policy-file="/srv/kubernetes/audit_policy.yaml" I1212 00:12:42.352969 7 flags.go:33] FLAG: --audit-webhook-batch-buffer-size="10000" I1212 00:12:42.352972 7 flags.go:33] FLAG: --audit-webhook-batch-initial-backoff="10s" I1212 00:12:42.352975 7 flags.go:33] FLAG: --audit-webhook-batch-max-size="400" I1212 00:12:42.352978 7 flags.go:33] FLAG: --audit-webhook-batch-max-wait="30s" I1212 00:12:42.352981 7 flags.go:33] FLAG: --audit-webhook-batch-throttle-burst="15" I1212 00:12:42.352984 7 flags.go:33] FLAG: --audit-webhook-batch-throttle-enable="true" I1212 00:12:42.352987 7 flags.go:33] FLAG: --audit-webhook-batch-throttle-qps="10" I1212 00:12:42.352990 7 flags.go:33] FLAG: --audit-webhook-config-file="" I1212 00:12:42.352993 7 flags.go:33] FLAG: --audit-webhook-initial-backoff="10s" I1212 00:12:42.352996 7 flags.go:33] FLAG: --audit-webhook-mode="batch" I1212 00:12:42.352999 7 flags.go:33] FLAG: --audit-webhook-truncate-enabled="false" I1212 00:12:42.353002 7 flags.go:33] FLAG: --audit-webhook-truncate-max-batch-size="10485760" I1212 00:12:42.353005 7 flags.go:33] FLAG: --audit-webhook-truncate-max-event-size="102400" I1212 00:12:42.353008 7 flags.go:33] FLAG: --audit-webhook-version="audit.k8s.io/v1beta1" I1212 00:12:42.353011 7 flags.go:33] FLAG: --authentication-token-webhook-cache-ttl="2m0s" I1212 00:12:42.353014 7 flags.go:33] FLAG: --authentication-token-webhook-config-file="/etc/kubernetes/authn.config" I1212 00:12:42.353017 7 flags.go:33] FLAG: --authorization-mode="[RBAC]" I1212 00:12:42.353021 7 flags.go:33] FLAG: --authorization-policy-file="" I1212 00:12:42.353024 7 flags.go:33] FLAG: --authorization-webhook-cache-authorized-ttl="5m0s" I1212 00:12:42.353027 7 flags.go:33] FLAG: --authorization-webhook-cache-unauthorized-ttl="30s" I1212 00:12:42.353030 7 flags.go:33] FLAG: --authorization-webhook-config-file="" I1212 00:12:42.353032 7 flags.go:33] FLAG: --basic-auth-file="/srv/kubernetes/basic_auth.csv" I1212 00:12:42.353036 7 flags.go:33] FLAG: --bind-address="0.0.0.0" I1212 00:12:42.353039 7 flags.go:33] FLAG: --cert-dir="/var/run/kubernetes" I1212 00:12:42.353042 7 flags.go:33] FLAG: --client-ca-file="/srv/kubernetes/ca.crt" I1212 00:12:42.353045 7 flags.go:33] FLAG: --cloud-config="/etc/kubernetes/cloud.config" I1212 00:12:42.353048 7 flags.go:33] FLAG: --cloud-provider="aws" I1212 00:12:42.353051 7 flags.go:33] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16" I1212 00:12:42.353056 7 flags.go:33] FLAG: --contention-profiling="false" I1212 00:12:42.353059 7 flags.go:33] FLAG: --cors-allowed-origins="[]" I1212 00:12:42.353065 7 flags.go:33] FLAG: --default-not-ready-toleration-seconds="300" I1212 00:12:42.353068 7 flags.go:33] FLAG: --default-unreachable-toleration-seconds="300" I1212 00:12:42.353071 7 flags.go:33] FLAG: --default-watch-cache-size="100" I1212 00:12:42.353074 7 flags.go:33] FLAG: --delete-collection-workers="1" I1212 00:12:42.353077 7 flags.go:33] FLAG: --deserialization-cache-size="0" I1212 00:12:42.353080 7 flags.go:33] FLAG: --disable-admission-plugins="[]" I1212 00:12:42.353083 7 flags.go:33] FLAG: --enable-admission-plugins="[Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,NodeRestriction,ResourceQuota]" I1212 00:12:42.353100 7 flags.go:33] FLAG: --enable-aggregator-routing="false" I1212 00:12:42.353107 7 flags.go:33] FLAG: --enable-bootstrap-token-auth="false" I1212 00:12:42.353109 7 flags.go:33] FLAG: --enable-garbage-collector="true" I1212 00:12:42.353112 7 flags.go:33] FLAG: --enable-logs-handler="true" I1212 00:12:42.353115 7 flags.go:33] FLAG: --enable-swagger-ui="false" I1212 00:12:42.353118 7 flags.go:33] FLAG: --endpoint-reconciler-type="lease" I1212 00:12:42.353121 7 flags.go:33] FLAG: --etcd-cafile="" I1212 00:12:42.353123 7 flags.go:33] FLAG: --etcd-certfile="" I1212 00:12:42.353126 7 flags.go:33] FLAG: --etcd-compaction-interval="5m0s" I1212 00:12:42.353129 7 flags.go:33] FLAG: --etcd-count-metric-poll-period="1m0s" I1212 00:12:42.353132 7 flags.go:33] FLAG: --etcd-keyfile="" I1212 00:12:42.353135 7 flags.go:33] FLAG: --etcd-prefix="/registry" I1212 00:12:42.353138 7 flags.go:33] FLAG: --etcd-quorum-read="true" I1212 00:12:42.353141 7 flags.go:33] FLAG: --etcd-servers="[http://127.0.0.1:4001]" I1212 00:12:42.353145 7 flags.go:33] FLAG: --etcd-servers-overrides="[/events#http://127.0.0.1:4002]" I1212 00:12:42.353150 7 flags.go:33] FLAG: --event-ttl="1h0m0s" I1212 00:12:42.353156 7 flags.go:33] FLAG: --experimental-encryption-provider-config="" I1212 00:12:42.353159 7 flags.go:33] FLAG: --external-hostname="" I1212 00:12:42.353162 7 flags.go:33] FLAG: --feature-gates="" I1212 00:12:42.353167 7 flags.go:33] FLAG: --help="false" I1212 00:12:42.353170 7 flags.go:33] FLAG: --http2-max-streams-per-connection="0" I1212 00:12:42.353172 7 flags.go:33] FLAG: --insecure-bind-address="127.0.0.1" I1212 00:12:42.353176 7 flags.go:33] FLAG: --insecure-port="8080" I1212 00:12:42.353179 7 flags.go:33] FLAG: --kubelet-certificate-authority="" I1212 00:12:42.353182 7 flags.go:33] FLAG: --kubelet-client-certificate="/srv/kubernetes/kubelet-api.pem" I1212 00:12:42.353185 7 flags.go:33] FLAG: --kubelet-client-key="/srv/kubernetes/kubelet-api-key.pem" I1212 00:12:42.353188 7 flags.go:33] FLAG: --kubelet-https="true" I1212 00:12:42.353191 7 flags.go:33] FLAG: --kubelet-port="10250" I1212 00:12:42.353199 7 flags.go:33] FLAG: --kubelet-preferred-address-types="[InternalIP,Hostname,ExternalIP]" I1212 00:12:42.353203 7 flags.go:33] FLAG: --kubelet-read-only-port="10255" I1212 00:12:42.353206 7 flags.go:33] FLAG: --kubelet-timeout="5s" I1212 00:12:42.353209 7 flags.go:33] FLAG: --kubernetes-service-node-port="0" I1212 00:12:42.353212 7 flags.go:33] FLAG: --log-backtrace-at=":0" I1212 00:12:42.353219 7 flags.go:33] FLAG: --log-dir="" I1212 00:12:42.353222 7 flags.go:33] FLAG: --log-flush-frequency="5s" I1212 00:12:42.353225 7 flags.go:33] FLAG: --logtostderr="true" I1212 00:12:42.353228 7 flags.go:33] FLAG: --master-service-namespace="default" I1212 00:12:42.353231 7 flags.go:33] FLAG: --max-connection-bytes-per-sec="0" I1212 00:12:42.353234 7 flags.go:33] FLAG: --max-mutating-requests-inflight="200" I1212 00:12:42.353237 7 flags.go:33] FLAG: --max-requests-inflight="400" I1212 00:12:42.353240 7 flags.go:33] FLAG: --min-request-timeout="1800" I1212 00:12:42.353243 7 flags.go:33] FLAG: --oidc-ca-file="" I1212 00:12:42.353246 7 flags.go:33] FLAG: --oidc-client-id="" I1212 00:12:42.353249 7 flags.go:33] FLAG: --oidc-groups-claim="" I1212 00:12:42.353251 7 flags.go:33] FLAG: --oidc-groups-prefix="" I1212 00:12:42.353254 7 flags.go:33] FLAG: --oidc-issuer-url="" I1212 00:12:42.353257 7 flags.go:33] FLAG: --oidc-required-claim="" I1212 00:12:42.353261 7 flags.go:33] FLAG: --oidc-signing-algs="[RS256]" I1212 00:12:42.353266 7 flags.go:33] FLAG: --oidc-username-claim="sub" I1212 00:12:42.353269 7 flags.go:33] FLAG: --oidc-username-prefix="" I1212 00:12:42.353271 7 flags.go:33] FLAG: --port="8080" I1212 00:12:42.353274 7 flags.go:33] FLAG: --profiling="true" I1212 00:12:42.353277 7 flags.go:33] FLAG: --proxy-client-cert-file="/srv/kubernetes/apiserver-aggregator.cert" I1212 00:12:42.353281 7 flags.go:33] FLAG: --proxy-client-key-file="/srv/kubernetes/apiserver-aggregator.key" I1212 00:12:42.353284 7 flags.go:33] FLAG: --repair-malformed-updates="false" I1212 00:12:42.353287 7 flags.go:33] FLAG: --request-timeout="1m0s" I1212 00:12:42.353290 7 flags.go:33] FLAG: --requestheader-allowed-names="[aggregator]" I1212 00:12:42.353294 7 flags.go:33] FLAG: --requestheader-client-ca-file="/srv/kubernetes/apiserver-aggregator-ca.cert" I1212 00:12:42.353299 7 flags.go:33] FLAG: --requestheader-extra-headers-prefix="[X-Remote-Extra-]" I1212 00:12:42.353304 7 flags.go:33] FLAG: --requestheader-group-headers="[X-Remote-Group]" I1212 00:12:42.353307 7 flags.go:33] FLAG: --requestheader-username-headers="[X-Remote-User]" I1212 00:12:42.353313 7 flags.go:33] FLAG: --runtime-config="admissionregistration.k8s.io/v1alpha1=true" I1212 00:12:42.353320 7 flags.go:33] FLAG: --secure-port="443" I1212 00:12:42.353323 7 flags.go:33] FLAG: --service-account-api-audiences="[]" I1212 00:12:42.353326 7 flags.go:33] FLAG: --service-account-issuer="" I1212 00:12:42.353329 7 flags.go:33] FLAG: --service-account-key-file="[]" I1212 00:12:42.353338 7 flags.go:33] FLAG: --service-account-lookup="true" I1212 00:12:42.353341 7 flags.go:33] FLAG: --service-account-max-token-expiration="0s" I1212 00:12:42.353344 7 flags.go:33] FLAG: --service-account-signing-key-file="" I1212 00:12:42.353347 7 flags.go:33] FLAG: --service-cluster-ip-range="100.64.0.0/13" I1212 00:12:42.353352 7 flags.go:33] FLAG: --service-node-port-range="30000-32767" I1212 00:12:42.353359 7 flags.go:33] FLAG: --ssh-keyfile="" I1212 00:12:42.353362 7 flags.go:33] FLAG: --ssh-user="" I1212 00:12:42.353364 7 flags.go:33] FLAG: --stderrthreshold="2" I1212 00:12:42.353367 7 flags.go:33] FLAG: --storage-backend="etcd3" I1212 00:12:42.353370 7 flags.go:33] FLAG: --storage-media-type="application/vnd.kubernetes.protobuf" I1212 00:12:42.353374 7 flags.go:33] FLAG: --storage-versions="admission.k8s.io/v1beta1,admissionregistration.k8s.io/v1beta1,apps/v1,authentication.k8s.io/v1,authorization.k8s.io/v1,autoscaling/v1,batch/v1,certificates.k8s.io/v1beta1,coordination.k8s.io/v1beta1,events.k8s.io/v1beta1,extensions/v1beta1,imagepolicy.k8s.io/v1alpha1,networking.k8s.io/v1,policy/v1beta1,rbac.authorization.k8s.io/v1,scheduling.k8s.io/v1beta1,settings.k8s.io/v1alpha1,storage.k8s.io/v1,v1" I1212 00:12:42.353390 7 flags.go:33] FLAG: --target-ram-mb="0" I1212 00:12:42.353393 7 flags.go:33] FLAG: --tls-cert-file="/srv/kubernetes/server.cert" I1212 00:12:42.353396 7 flags.go:33] FLAG: --tls-cipher-suites="[]" I1212 00:12:42.353400 7 flags.go:33] FLAG: --tls-min-version="" I1212 00:12:42.353403 7 flags.go:33] FLAG: --tls-private-key-file="/srv/kubernetes/server.key" I1212 00:12:42.353406 7 flags.go:33] FLAG: --tls-sni-cert-key="[]" I1212 00:12:42.353410 7 flags.go:33] FLAG: --token-auth-file="/srv/kubernetes/known_tokens.csv" I1212 00:12:42.353413 7 flags.go:33] FLAG: --v="2" I1212 00:12:42.353416 7 flags.go:33] FLAG: --version="false" I1212 00:12:42.353421 7 flags.go:33] FLAG: --vmodule="" I1212 00:12:42.353424 7 flags.go:33] FLAG: --watch-cache="true" I1212 00:12:42.353427 7 flags.go:33] FLAG: --watch-cache-sizes="[]" I1212 00:12:42.353695 7 server.go:681] external host was not specified, using 10.5.0.30 I1212 00:12:42.354026 7 server.go:705] Initializing deserialization cache size based on 0MB limit I1212 00:12:42.354036 7 server.go:724] Initializing cache sizes based on 0MB limit I1212 00:12:42.354101 7 server.go:152] Version: v1.12.3 W1212 00:12:42.832684 7 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts. I1212 00:12:42.832846 7 feature_gate.go:206] feature gates: &{map[Initializers:true]} I1212 00:12:42.832863 7 initialization.go:90] enabled Initializers feature as part of admission plugin setup I1212 00:12:42.833085 7 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook,Initializers. I1212 00:12:42.833094 7 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota. W1212 00:12:42.833382 7 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts. I1212 00:12:42.833654 7 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook,Initializers. I1212 00:12:42.833664 7 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota. I1212 00:12:42.835749 7 store.go:1414] Monitoring customresourcedefinitions.apiextensions.k8s.io count at <storage-prefix>//apiextensions.k8s.io/customresourcedefinitions I1212 00:12:42.859202 7 master.go:240] Using reconciler: lease I1212 00:12:42.862882 7 store.go:1414] Monitoring podtemplates count at <storage-prefix>//podtemplates I1212 00:12:42.863313 7 store.go:1414] Monitoring events count at <storage-prefix>//events I1212 00:12:42.863693 7 store.go:1414] Monitoring limitranges count at <storage-prefix>//limitranges I1212 00:12:42.864078 7 store.go:1414] Monitoring resourcequotas count at <storage-prefix>//resourcequotas I1212 00:12:42.864499 7 store.go:1414] Monitoring secrets count at <storage-prefix>//secrets I1212 00:12:42.864886 7 store.go:1414] Monitoring persistentvolumes count at <storage-prefix>//persistentvolumes I1212 00:12:42.865271 7 store.go:1414] Monitoring persistentvolumeclaims count at <storage-prefix>//persistentvolumeclaims I1212 00:12:42.865659 7 store.go:1414] Monitoring configmaps count at <storage-prefix>//configmaps I1212 00:12:42.866063 7 store.go:1414] Monitoring namespaces count at <storage-prefix>//namespaces I1212 00:12:42.866465 7 store.go:1414] Monitoring endpoints count at <storage-prefix>//services/endpoints I1212 00:12:42.866890 7 store.go:1414] Monitoring nodes count at <storage-prefix>//minions I1212 00:12:42.867659 7 store.go:1414] Monitoring pods count at <storage-prefix>//pods I1212 00:12:42.868099 7 store.go:1414] Monitoring serviceaccounts count at <storage-prefix>//serviceaccounts I1212 00:12:42.868523 7 store.go:1414] Monitoring services count at <storage-prefix>//services/specs I1212 00:12:42.869296 7 store.go:1414] Monitoring replicationcontrollers count at <storage-prefix>//controllers I1212 00:12:43.236425 7 master.go:432] Enabling API group "authentication.k8s.io". I1212 00:12:43.236452 7 master.go:432] Enabling API group "authorization.k8s.io". I1212 00:12:43.237028 7 store.go:1414] Monitoring horizontalpodautoscalers.autoscaling count at <storage-prefix>//horizontalpodautoscalers I1212 00:12:43.237503 7 store.go:1414] Monitoring horizontalpodautoscalers.autoscaling count at <storage-prefix>//horizontalpodautoscalers I1212 00:12:43.237908 7 store.go:1414] Monitoring horizontalpodautoscalers.autoscaling count at <storage-prefix>//horizontalpodautoscalers I1212 00:12:43.237922 7 master.go:432] Enabling API group "autoscaling". I1212 00:12:43.238316 7 store.go:1414] Monitoring jobs.batch count at <storage-prefix>//jobs I1212 00:12:43.238723 7 store.go:1414] Monitoring cronjobs.batch count at <storage-prefix>//cronjobs I1212 00:12:43.238739 7 master.go:432] Enabling API group "batch". I1212 00:12:43.239112 7 store.go:1414] Monitoring certificatesigningrequests.certificates.k8s.io count at <storage-prefix>//certificatesigningrequests I1212 00:12:43.239127 7 master.go:432] Enabling API group "certificates.k8s.io". I1212 00:12:43.239556 7 store.go:1414] Monitoring leases.coordination.k8s.io count at <storage-prefix>//leases I1212 00:12:43.239572 7 master.go:432] Enabling API group "coordination.k8s.io". I1212 00:12:43.239956 7 store.go:1414] Monitoring replicationcontrollers count at <storage-prefix>//controllers I1212 00:12:43.240365 7 store.go:1414] Monitoring daemonsets.extensions count at <storage-prefix>//daemonsets I1212 00:12:43.240731 7 store.go:1414] Monitoring deployments.extensions count at <storage-prefix>//deployments I1212 00:12:43.241123 7 store.go:1414] Monitoring ingresses.extensions count at <storage-prefix>//ingress I1212 00:12:43.241545 7 store.go:1414] Monitoring podsecuritypolicies.policy count at <storage-prefix>//podsecuritypolicy I1212 00:12:43.241975 7 store.go:1414] Monitoring replicasets.extensions count at <storage-prefix>//replicasets I1212 00:12:43.242372 7 store.go:1414] Monitoring networkpolicies.networking.k8s.io count at <storage-prefix>//networkpolicies I1212 00:12:43.242385 7 master.go:432] Enabling API group "extensions". I1212 00:12:43.242779 7 store.go:1414] Monitoring networkpolicies.networking.k8s.io count at <storage-prefix>//networkpolicies I1212 00:12:43.242791 7 master.go:432] Enabling API group "networking.k8s.io". I1212 00:12:43.243237 7 store.go:1414] Monitoring poddisruptionbudgets.policy count at <storage-prefix>//poddisruptionbudgets I1212 00:12:43.243653 7 store.go:1414] Monitoring podsecuritypolicies.policy count at <storage-prefix>//podsecuritypolicy I1212 00:12:43.243666 7 master.go:432] Enabling API group "policy". I1212 00:12:43.243998 7 store.go:1414] Monitoring roles.rbac.authorization.k8s.io count at <storage-prefix>//roles I1212 00:12:43.244431 7 store.go:1414] Monitoring rolebindings.rbac.authorization.k8s.io count at <storage-prefix>//rolebindings I1212 00:12:43.244808 7 store.go:1414] Monitoring clusterroles.rbac.authorization.k8s.io count at <storage-prefix>//clusterroles I1212 00:12:43.245201 7 store.go:1414] Monitoring clusterrolebindings.rbac.authorization.k8s.io count at <storage-prefix>//clusterrolebindings I1212 00:12:43.245546 7 store.go:1414] Monitoring roles.rbac.authorization.k8s.io count at <storage-prefix>//roles I1212 00:12:43.245916 7 store.go:1414] Monitoring rolebindings.rbac.authorization.k8s.io count at <storage-prefix>//rolebindings I1212 00:12:43.246314 7 store.go:1414] Monitoring clusterroles.rbac.authorization.k8s.io count at <storage-prefix>//clusterroles I1212 00:12:43.246702 7 store.go:1414] Monitoring clusterrolebindings.rbac.authorization.k8s.io count at <storage-prefix>//clusterrolebindings I1212 00:12:43.246718 7 master.go:432] Enabling API group "rbac.authorization.k8s.io". I1212 00:12:43.247889 7 store.go:1414] Monitoring priorityclasses.scheduling.k8s.io count at <storage-prefix>//priorityclasses I1212 00:12:43.247908 7 master.go:432] Enabling API group "scheduling.k8s.io". I1212 00:12:43.247920 7 master.go:424] Skipping disabled API group "settings.k8s.io". I1212 00:12:43.248329 7 store.go:1414] Monitoring storageclasses.storage.k8s.io count at <storage-prefix>//storageclasses I1212 00:12:43.248726 7 store.go:1414] Monitoring volumeattachments.storage.k8s.io count at <storage-prefix>//volumeattachments I1212 00:12:43.249164 7 store.go:1414] Monitoring storageclasses.storage.k8s.io count at <storage-prefix>//storageclasses I1212 00:12:43.249176 7 master.go:432] Enabling API group "storage.k8s.io". I1212 00:12:43.249588 7 store.go:1414] Monitoring deployments.extensions count at <storage-prefix>//deployments I1212 00:12:43.249996 7 store.go:1414] Monitoring statefulsets.apps count at <storage-prefix>//statefulsets I1212 00:12:43.250453 7 store.go:1414] Monitoring controllerrevisions.apps count at <storage-prefix>//controllerrevisions I1212 00:12:43.250895 7 store.go:1414] Monitoring deployments.extensions count at <storage-prefix>//deployments I1212 00:12:43.251298 7 store.go:1414] Monitoring statefulsets.apps count at <storage-prefix>//statefulsets I1212 00:12:43.251706 7 store.go:1414] Monitoring daemonsets.extensions count at <storage-prefix>//daemonsets I1212 00:12:43.252085 7 store.go:1414] Monitoring replicasets.extensions count at <storage-prefix>//replicasets I1212 00:12:43.274505 7 store.go:1414] Monitoring controllerrevisions.apps count at <storage-prefix>//controllerrevisions I1212 00:12:43.275002 7 store.go:1414] Monitoring deployments.extensions count at <storage-prefix>//deployments I1212 00:12:43.276141 7 store.go:1414] Monitoring statefulsets.apps count at <storage-prefix>//statefulsets I1212 00:12:43.277721 7 store.go:1414] Monitoring daemonsets.extensions count at <storage-prefix>//daemonsets I1212 00:12:43.279482 7 store.go:1414] Monitoring replicasets.extensions count at <storage-prefix>//replicasets I1212 00:12:43.279883 7 store.go:1414] Monitoring controllerrevisions.apps count at <storage-prefix>//controllerrevisions I1212 00:12:43.279895 7 master.go:432] Enabling API group "apps". I1212 00:12:43.280238 7 store.go:1414] Monitoring initializerconfigurations.admissionregistration.k8s.io count at <storage-prefix>//initializerconfigurations I1212 00:12:43.280641 7 store.go:1414] Monitoring validatingwebhookconfigurations.admissionregistration.k8s.io count at <storage-prefix>//validatingwebhookconfigurations I1212 00:12:43.280968 7 store.go:1414] Monitoring mutatingwebhookconfigurations.admissionregistration.k8s.io count at <storage-prefix>//mutatingwebhookconfigurations I1212 00:12:43.280979 7 master.go:432] Enabling API group "admissionregistration.k8s.io". I1212 00:12:43.281301 7 store.go:1414] Monitoring events count at <storage-prefix>//events I1212 00:12:43.281312 7 master.go:432] Enabling API group "events.k8s.io". W1212 00:12:43.516919 7 genericapiserver.go:325] Skipping API batch/v2alpha1 because it has no resources. W1212 00:12:43.835670 7 genericapiserver.go:325] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources. W1212 00:12:43.848163 7 genericapiserver.go:325] Skipping API scheduling.k8s.io/v1alpha1 because it has no resources. W1212 00:12:43.869772 7 genericapiserver.go:325] Skipping API storage.k8s.io/v1alpha1 because it has no resources. [restful] 2018/12/12 00:12:44 log.go:33: [restful/swagger] listing is available at https://10.5.0.30:443/swaggerapi [restful] 2018/12/12 00:12:44 log.go:33: [restful/swagger] https://10.5.0.30:443/swaggerui/ is mapped to folder /swagger-ui/ [restful] 2018/12/12 00:12:45 log.go:33: [restful/swagger] listing is available at https://10.5.0.30:443/swaggerapi [restful] 2018/12/12 00:12:45 log.go:33: [restful/swagger] https://10.5.0.30:443/swaggerui/ is mapped to folder /swagger-ui/ W1212 00:12:45.683798 7 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts. I1212 00:12:45.684127 7 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook,Initializers. I1212 00:12:45.684138 7 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota. I1212 00:12:45.686223 7 store.go:1414] Monitoring apiservices.apiregistration.k8s.io count at <storage-prefix>//apiregistration.k8s.io/apiservices I1212 00:12:45.686707 7 store.go:1414] Monitoring apiservices.apiregistration.k8s.io count at <storage-prefix>//apiregistration.k8s.io/apiservices I1212 00:12:48.453407 7 deprecated_insecure_serving.go:50] Serving insecurely on 127.0.0.1:8080 I1212 00:12:48.454725 7 secure_serving.go:116] Serving securely on [::]:443 I1212 00:12:48.454763 7 autoregister_controller.go:136] Starting autoregister controller I1212 00:12:48.454770 7 cache.go:32] Waiting for caches to sync for autoregister controller I1212 00:12:48.454874 7 apiservice_controller.go:90] Starting APIServiceRegistrationController I1212 00:12:48.454892 7 controller.go:84] Starting OpenAPI AggregationController I1212 00:12:48.454902 7 cache.go:32] Waiting for caches to sync for APIServiceRegistrationController controller I1212 00:12:48.454935 7 crdregistration_controller.go:112] Starting crd-autoregister controller I1212 00:12:48.454932 7 crd_finalizer.go:242] Starting CRDFinalizer I1212 00:12:48.454962 7 available_controller.go:278] Starting AvailableConditionController I1212 00:12:48.454967 7 cache.go:32] Waiting for caches to sync for AvailableConditionController controller I1212 00:12:48.454969 7 naming_controller.go:284] Starting NamingConditionController I1212 00:12:48.454994 7 establishing_controller.go:73] Starting EstablishingController I1212 00:12:48.454950 7 controller_utils.go:1027] Waiting for caches to sync for crd-autoregister controller I1212 00:12:48.455033 7 customresource_discovery_controller.go:199] Starting DiscoveryController I1212 00:12:58.923688 7 trace.go:76] Trace[1029194318]: "Create /api/v1/namespaces/kube-system/serviceaccounts" (started: 2018-12-12 00:12:48.921492128 +0000 UTC m=+6.629693070) (total time: 10.002174722s): Trace[1029194318]: [10.002174722s] [10.00039192s] END I1212 00:13:08.925557 7 trace.go:76] Trace[645995136]: "Create /apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings" (started: 2018-12-12 00:12:58.924586395 +0000 UTC m=+16.632787332) (total time: 10.000946296s): Trace[645995136]: [10.000946296s] [10.00036682s] END I1212 00:13:24.847128 7 shared_informer.go:119] stop requested I1212 00:13:24.847145 7 shared_informer.go:119] stop requested I1212 00:13:24.847146 7 shared_informer.go:119] stop requested I1212 00:13:24.847144 7 secure_serving.go:156] Stopped listening on 127.0.0.1:8080 I1212 00:13:24.847158 7 shared_informer.go:119] stop requested I1212 00:13:24.847160 7 shared_informer.go:119] stop requested I1212 00:13:24.847158 7 shared_informer.go:119] stop requested E1212 00:13:24.847165 7 customresource_discovery_controller.go:202] timed out waiting for caches to sync I1212 00:13:24.847168 7 crd_finalizer.go:246] Shutting down CRDFinalizer E1212 00:13:24.847171 7 controller_utils.go:1030] Unable to sync caches for crd-autoregister controller I1212 00:13:24.847172 7 shared_informer.go:119] stop requested I1212 00:13:24.847171 7 customresource_discovery_controller.go:203] Shutting down DiscoveryController E1212 00:13:24.847180 7 cache.go:35] Unable to sync caches for autoregister controller E1212 00:13:24.847148 7 cache.go:35] Unable to sync caches for APIServiceRegistrationController controller I1212 00:13:24.847157 7 establishing_controller.go:77] Shutting down EstablishingController I1212 00:13:24.847135 7 shared_informer.go:119] stop requested I1212 00:13:24.847215 7 secure_serving.go:156] Stopped listening on [::]:443 I1212 00:13:24.847215 7 controller.go:171] Shutting down kubernetes service endpoint reconciler E1212 00:13:24.847225 7 cache.go:35] Unable to sync caches for AvailableConditionController controller I1212 00:13:24.847152 7 naming_controller.go:288] Shutting down NamingConditionController I1212 00:13:24.847186 7 controller.go:90] Shutting down OpenAPI AggregationController I1212 00:13:24.848248 7 crdregistration_controller.go:117] Shutting down crd-autoregister controller I1212 00:13:24.849329 7 autoregister_controller.go:141] Shutting down autoregister controller I1212 00:13:24.850406 7 apiservice_controller.go:94] Shutting down APIServiceRegistrationController I1212 00:13:24.851479 7 available_controller.go:282] Shutting down AvailableConditionController E1212 00:13:34.847575 7 controller.go:173] rpc error: code = Unavailable desc = transport is closing E1212 00:13:48.464293 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.464359 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.465402 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.466507 7 trace.go:76] Trace[1330970842]: "List /apis/admissionregistration.k8s.io/v1alpha1/initializerconfigurations" (started: 2018-12-12 00:12:48.464197188 +0000 UTC m=+6.172398126) (total time: 1m0.00229233s): Trace[1330970842]: [1m0.00229233s] [1m0.002288147s] END E1212 00:13:48.466971 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.467596 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.468649 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.469741 7 trace.go:76] Trace[1868745693]: "List /apis/admissionregistration.k8s.io/v1alpha1/initializerconfigurations" (started: 2018-12-12 00:12:48.466884694 +0000 UTC m=+6.175085674) (total time: 1m0.002842133s): Trace[1868745693]: [1m0.002842133s] [1m0.002837372s] END E1212 00:13:48.470629 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.470821 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.470927 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.471076 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Secret: the server was unable to return a response in the time allotted, but may still be processing the request (get secrets) E1212 00:13:48.471550 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.471675 7 reflector.go:134] k8s.io/apiextensions-apiserver/pkg/client/informers/internalversion/factory.go:117: Failed to list *apiextensions.CustomResourceDefinition: the server was unable to return a response in the time allotted, but may still be processing the request (get customresourcedefinitions.apiextensions.k8s.io) E1212 00:13:48.471884 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.472979 7 trace.go:76] Trace[1800478074]: "List /apis/admissionregistration.k8s.io/v1alpha1/initializerconfigurations" (started: 2018-12-12 00:12:48.470532433 +0000 UTC m=+6.178733370) (total time: 1m0.002432073s): Trace[1800478074]: [1m0.002432073s] [1m0.002427554s] END E1212 00:13:48.474023 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.475085 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.477257 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout I1212 00:13:48.479430 7 trace.go:76] Trace[1280622339]: "List /api/v1/secrets" (started: 2018-12-12 00:12:48.470911773 +0000 UTC m=+6.179112712) (total time: 1m0.008501979s): Trace[1280622339]: [1m0.008501979s] [1m0.008459411s] END E1212 00:13:48.480486 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.481577 7 trace.go:76] Trace[1804652784]: "List /apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions" (started: 2018-12-12 00:12:48.471466491 +0000 UTC m=+6.179667429) (total time: 1m0.010100941s): Trace[1804652784]: [1m0.010100941s] [1m0.010060801s] END E1212 00:13:48.500678 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500713 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.500806 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500845 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *scheduling.PriorityClass: the server was unable to return a response in the time allotted, but may still be processing the request (get priorityclasses.scheduling.k8s.io) E1212 00:13:48.500882 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.ClusterRole: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterroles.rbac.authorization.k8s.io) E1212 00:13:48.500903 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500953 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500957 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500957 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500979 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.500987 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501007 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501090 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501091 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501147 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *storage.StorageClass: the server was unable to return a response in the time allotted, but may still be processing the request (get storageclasses.storage.k8s.io) E1212 00:13:48.501156 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501160 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *core.LimitRange: the server was unable to return a response in the time allotted, but may still be processing the request (get limitranges) E1212 00:13:48.501238 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501241 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501275 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.501284 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *core.Secret: the server was unable to return a response in the time allotted, but may still be processing the request (get secrets) E1212 00:13:48.501395 7 reflector.go:134] k8s.io/kube-aggregator/pkg/client/informers/internalversion/factory.go:117: Failed to list *apiregistration.APIService: the server was unable to return a response in the time allotted, but may still be processing the request (get apiservices.apiregistration.k8s.io) E1212 00:13:48.501398 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *core.PersistentVolume: the server was unable to return a response in the time allotted, but may still be processing the request (get persistentvolumes) E1212 00:13:48.501439 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.ServiceAccount: the server was unable to return a response in the time allotted, but may still be processing the request (get serviceaccounts) E1212 00:13:48.501457 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.ValidatingWebhookConfiguration: the server was unable to return a response in the time allotted, but may still be processing the request (get validatingwebhookconfigurations.admissionregistration.k8s.io) E1212 00:13:48.501501 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Service: the server was unable to return a response in the time allotted, but may still be processing the request (get services) E1212 00:13:48.501524 7 reflector.go:134] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:130: Failed to list *core.ResourceQuota: the server was unable to return a response in the time allotted, but may still be processing the request (get resourcequotas) E1212 00:13:48.501628 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods) E1212 00:13:48.501687 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1beta1.MutatingWebhookConfiguration: the server was unable to return a response in the time allotted, but may still be processing the request (get mutatingwebhookconfigurations.admissionregistration.k8s.io) E1212 00:13:48.501731 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Namespace: the server was unable to return a response in the time allotted, but may still be processing the request (get namespaces) E1212 00:13:48.501747 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.RoleBinding: the server was unable to return a response in the time allotted, but may still be processing the request (get rolebindings.rbac.authorization.k8s.io) E1212 00:13:48.501776 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.501974 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.502090 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Role: the server was unable to return a response in the time allotted, but may still be processing the request (get roles.rbac.authorization.k8s.io) I1212 00:13:48.502863 7 trace.go:76] Trace[2003208653]: "List /apis/scheduling.k8s.io/v1beta1/priorityclasses" (started: 2018-12-12 00:12:48.50058919 +0000 UTC m=+6.208790128) (total time: 1m0.002260482s): Trace[2003208653]: [1m0.002260482s] [1m0.002225647s] END E1212 00:13:48.503663 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.503680 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.503783 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.StorageClass: the server was unable to return a response in the time allotted, but may still be processing the request (get storageclasses.storage.k8s.io) E1212 00:13:48.503809 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.ClusterRoleBinding: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterrolebindings.rbac.authorization.k8s.io) E1212 00:13:48.503945 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.503984 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:48.504107 7 reflector.go:134] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints) E1212 00:13:48.504981 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.508235 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.509303 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.510393 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.511474 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.512543 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.513624 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.514705 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.515781 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.516852 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.517948 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.521205 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.522313 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.523383 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.536332 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.538482 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.539550 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:48.542790 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout I1212 00:13:48.544949 7 trace.go:76] Trace[1776629030]: "List /apis/rbac.authorization.k8s.io/v1/clusterroles" (started: 2018-12-12 00:12:48.500703254 +0000 UTC m=+6.208904191) (total time: 1m0.044230748s): Trace[1776629030]: [1m0.044230748s] [1m0.044191519s] END E1212 00:13:48.546005 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.547081 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.548160 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.549233 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.550326 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.551402 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.552483 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.553559 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.554642 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.555717 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.556795 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.557885 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.558957 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.560038 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.561114 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.562191 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:48.563270 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} I1212 00:13:48.564370 7 trace.go:76] Trace[1439133424]: "List /apis/storage.k8s.io/v1/storageclasses" (started: 2018-12-12 00:12:48.50081678 +0000 UTC m=+6.209017718) (total time: 1m0.063540529s): Trace[1439133424]: [1m0.063540529s] [1m0.063506486s] END I1212 00:13:48.565439 7 trace.go:76] Trace[1683817720]: "List /api/v1/serviceaccounts" (started: 2018-12-12 00:12:48.500940682 +0000 UTC m=+6.209141619) (total time: 1m0.064488328s): Trace[1683817720]: [1m0.064488328s] [1m0.064456016s] END I1212 00:13:48.566514 7 trace.go:76] Trace[491490319]: "List /api/v1/persistentvolumes" (started: 2018-12-12 00:12:48.500986525 +0000 UTC m=+6.209187462) (total time: 1m0.065518757s): Trace[491490319]: [1m0.065518757s] [1m0.065482619s] END I1212 00:13:48.567591 7 trace.go:76] Trace[1474503645]: "List /apis/apiregistration.k8s.io/v1/apiservices" (started: 2018-12-12 00:12:48.500875642 +0000 UTC m=+6.209076622) (total time: 1m0.066706928s): Trace[1474503645]: [1m0.066706928s] [1m0.066656322s] END I1212 00:13:48.568677 7 trace.go:76] Trace[635852309]: "List /api/v1/secrets" (started: 2018-12-12 00:12:48.500940686 +0000 UTC m=+6.209141623) (total time: 1m0.06772371s): Trace[635852309]: [1m0.06772371s] [1m0.067687683s] END I1212 00:13:48.569755 7 trace.go:76] Trace[175882069]: "List /api/v1/limitranges" (started: 2018-12-12 00:12:48.500925548 +0000 UTC m=+6.209126486) (total time: 1m0.06881921s): Trace[175882069]: [1m0.06881921s] [1m0.068781117s] END I1212 00:13:48.570836 7 trace.go:76] Trace[122202535]: "List /api/v1/pods" (started: 2018-12-12 00:12:48.500951889 +0000 UTC m=+6.209152828) (total time: 1m0.069871581s): Trace[122202535]: [1m0.069871581s] [1m0.069830326s] END I1212 00:13:48.571908 7 trace.go:76] Trace[865708000]: "List /api/v1/resourcequotas" (started: 2018-12-12 00:12:48.501056066 +0000 UTC m=+6.209257003) (total time: 1m0.070840152s): Trace[865708000]: [1m0.070840152s] [1m0.070808759s] END I1212 00:13:48.572979 7 trace.go:76] Trace[955305514]: "List /apis/rbac.authorization.k8s.io/v1/rolebindings" (started: 2018-12-12 00:12:48.501055621 +0000 UTC m=+6.209256562) (total time: 1m0.071915466s): Trace[955305514]: [1m0.071915466s] [1m0.071884923s] END I1212 00:13:48.574060 7 trace.go:76] Trace[1423473229]: "List /api/v1/namespaces" (started: 2018-12-12 00:12:48.501149822 +0000 UTC m=+6.209350759) (total time: 1m0.072900808s): Trace[1423473229]: [1m0.072900808s] [1m0.072867725s] END I1212 00:13:48.575139 7 trace.go:76] Trace[802608035]: "List /apis/admissionregistration.k8s.io/v1beta1/validatingwebhookconfigurations" (started: 2018-12-12 00:12:48.501149182 +0000 UTC m=+6.209350109) (total time: 1m0.073979725s): Trace[802608035]: [1m0.073979725s] [1m0.073948799s] END I1212 00:13:48.576217 7 trace.go:76] Trace[1021760621]: "List /apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations" (started: 2018-12-12 00:12:48.501154269 +0000 UTC m=+6.209355207) (total time: 1m0.075052452s): Trace[1021760621]: [1m0.075052452s] [1m0.075012296s] END I1212 00:13:48.577292 7 trace.go:76] Trace[1969470568]: "List /api/v1/services" (started: 2018-12-12 00:12:48.501258385 +0000 UTC m=+6.209459322) (total time: 1m0.076025789s): Trace[1969470568]: [1m0.076025789s] [1m0.076004504s] END I1212 00:13:48.578373 7 trace.go:76] Trace[1871147953]: "List /apis/rbac.authorization.k8s.io/v1/roles" (started: 2018-12-12 00:12:48.501860956 +0000 UTC m=+6.210061881) (total time: 1m0.076503388s): Trace[1871147953]: [1m0.076503388s] [1m0.076480245s] END I1212 00:13:48.579453 7 trace.go:76] Trace[640462565]: "List /apis/storage.k8s.io/v1/storageclasses" (started: 2018-12-12 00:12:48.503571787 +0000 UTC m=+6.211772724) (total time: 1m0.075871435s): Trace[640462565]: [1m0.075871435s] [1m0.075846101s] END I1212 00:13:48.580530 7 trace.go:76] Trace[759626822]: "List /apis/rbac.authorization.k8s.io/v1/clusterrolebindings" (started: 2018-12-12 00:12:48.50357283 +0000 UTC m=+6.211773767) (total time: 1m0.076948558s): Trace[759626822]: [1m0.076948558s] [1m0.076917912s] END I1212 00:13:48.581612 7 trace.go:76] Trace[647924664]: "List /api/v1/endpoints" (started: 2018-12-12 00:12:48.503957566 +0000 UTC m=+6.212158503) (total time: 1m0.077645739s): Trace[647924664]: [1m0.077645739s] [1m0.077595063s] END E1212 00:13:49.455364 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:49.455402 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:49.455432 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} E1212 00:13:49.455591 7 storage_rbac.go:154] unable to initialize clusterroles: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterroles.rbac.authorization.k8s.io) E1212 00:13:49.455602 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"} W1212 00:13:49.455631 7 storage_scheduling.go:95] unable to get PriorityClass system-node-critical: the server was unable to return a response in the time allotted, but may still be processing the request (get priorityclasses.scheduling.k8s.io system-node-critical). Retrying... F1212 00:13:49.455641 7 hooks.go:188] PostStartHook "scheduling/bootstrap-system-priority-classes" failed: unable to add default system priority classes: the server was unable to return a response in the time allotted, but may still be processing the request (get priorityclasses.scheduling.k8s.io system-node-critical) E1212 00:13:49.489143 7 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"} E1212 00:13:49.500198 7 writers.go:168] apiserver was unable to write a JSON response: http: Handler timeout E1212 00:13:49.511157 7 client_ca_hook.go:72] Post https://[::1]:443/api/v1/namespaces: dial tcp [::1]:443: connect: connection refused

I was able to replicate this consistently, the one time I was able to do a full upgrade and had it succeed I then proceeded to rotate the cluster 1 more time with no updates at that point I once again saw the corruption.

To ensure that this was not an issue with the k8s and etcd versions I picked I once again created a new Kops cluster and the updated k8s and etcd to the versions mentioned above. This time however I set the etcd provisioner in Kops to legacy, the cluster upgrade succeeded with no issues and following cluster rotations have not caused any visible issues.

1 minute delay on Kubernetes masters rollout introduced by race condition in assignDevice

We're testing a new Kubernetes cluster on AWS built with kops 1.12, running etcd 3 with the etcd-manager. Each master node runs two instances of etcd-manager (main and events):

NAMESPACE     NAME                                                     READY   STATUS 
kube-system   etcd-manager-events-ip-172-22-129-234.ec2.internal       1/1     Running
kube-system   etcd-manager-main-ip-172-22-129-234.ec2.internal         1/1     Running

While testing the rollout of master nodes, we've observed that due to how assignDevice works it returns - at first try - the same device on both instances, introducing a 1 minute delay in the master rollout. To explain it, see the following (redacted) logs.

etcd-manager-events logs:

I0611 09:35:18.441963    9418 main.go:228] Mounting available etcd volumes matching tags [k8s.io/etcd/events k8s.io/role/master=1 kubernetes.io/cluster/REDACTED=owned]; nameTag=k8s.io/etcd/events
I0611 09:35:18.444481    9418 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:35:18.628872    9418 mounter.go:288] Trying to mount master volume: "vol-0033a5507d5546fe4"
I0611 09:35:18.628998    9418 volumes.go:85] AWS API Request: ec2/AttachVolume
I0611 09:35:18.890734    9418 volumes.go:339] AttachVolume request returned {
  AttachTime: 2019-06-11 09:35:18.866 +0000 UTC,
  Device: "/dev/xvdu",
  InstanceId: "i-0abc13d47bb0d19cd",
  State: "attaching",
  VolumeId: "vol-0033a5507d5546fe4"
}
I0611 09:35:18.890895    9418 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:35:18.983082    9418 mounter.go:302] Currently attached volumes: [0xc000238100]
I0611 09:35:18.983115    9418 mounter.go:64] Master volume "vol-0033a5507d5546fe4" is attached at "/dev/xvdu"

etcd-manager-main logs:

I0611 09:35:18.601952    9498 main.go:228] Mounting available etcd volumes matching tags [k8s.io/etcd/main k8s.io/role/master=1 kubernetes.io/cluster/REDACTED=owned]; nameTag=k8s.io/etcd/main
I0611 09:35:18.607188    9498 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:35:18.828121    9498 mounter.go:288] Trying to mount master volume: "vol-045d841d4ec069864"
I0611 09:35:18.828251    9498 volumes.go:85] AWS API Request: ec2/AttachVolume
W0611 09:35:19.114926    9498 mounter.go:293] Error attaching volume "vol-045d841d4ec069864": Error attaching EBS volume "vol-045d841d4ec069864": InvalidParameterValue: Invalid value '/dev/xvdu' for unixDevice. Attachment point /dev/xvdu is already in use
        status code: 400, request id: b668745c-64b9-46a0-af1a-61c6352daaed
I0611 09:35:19.114951    9498 mounter.go:302] Currently attached volumes: []
I0611 09:35:19.114966    9498 boot.go:49] waiting for volumes
I0611 09:36:19.115312    9498 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:36:19.204717    9498 mounter.go:288] Trying to mount master volume: "vol-045d841d4ec069864"
I0611 09:36:19.204841    9498 volumes.go:85] AWS API Request: ec2/AttachVolume
I0611 09:36:19.483306    9498 volumes.go:339] AttachVolume request returned {
  AttachTime: 2019-06-11 09:36:19.439 +0000 UTC,
  Device: "/dev/xvdv",
  InstanceId: "i-0abc13d47bb0d19cd",
  State: "attaching",
  VolumeId: "vol-045d841d4ec069864"
}
I0611 09:36:19.483477    9498 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0611 09:36:19.614795    9498 mounter.go:302] Currently attached volumes: [0xc000096080]
I0611 09:36:19.614822    9498 mounter.go:64] Master volume "vol-045d841d4ec069864" is attached at "/dev/xvdv"

As you can see, since the first EBS volume attachment has failed due to Attachment point /dev/xvdu is already in use it will reconcile after 60 seconds, introducing a 60 seconds delay in the bootstrapping of master nodes.

Few options / ideas to start the conversation:

We do accept it and do nothing (should we document it, in this case?)
We introduce the support for Volume.PreferredLocalDevice, which is populated by an optional volume tag read from the cloud provider, so that kops can set a different preferred local device for each volume. The Volume.PreferredLocalDevice is then passed to assignDevice(), which will return the preferred one if set and available, otherwise will fallback to the current logic
We randomize the first device tried by assignDevice(), which will then iterate of the next ones starting from it (this just reduces the likelihood, at the cost of having assignDevice() behave in a non deterministic way)

Enable encryption on gossip

We don't want this to be a backdoor

OpenStack API shows incorrect path for volume

OpenStack api shows path for volume:

/dev/vdd on master-zone-1-2-1-ownold-master-k8s-local

logs:

I0913 18:04:49.564932    1578 mounter.go:64] Master volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" is attached at "/dev/vdd"
I0913 18:04:49.564993    1578 mounter.go:78] Doing safe-format-and-mount of /dev/vdd to /mnt/master-2.etcd-main.ownold-master.k8s.local
I0913 18:04:49.565025    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:50.565341    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:51.565517    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:52.565714    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:53.565907    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:54.566067    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:55.566275    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:56.566473    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:57.566669    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:58.566883    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted
I0913 18:04:59.567076    1578 mounter.go:113] Waiting for volume "44a9c2a0-4648-4ce1-8f68-34a081521ba1" to be mounted

However, when I check devices:

$ ls /dev/vd
vda   vda1  vdb   vdc

So OpenStack API is reporting incorrect device paths, this is known problem from k8s side. It is solved using this function https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L331

When listing volumes by id

$ ls /dev/disk/by-id
virtio-44a9c2a0-4648-4ce1-8

so the disk is there, but the path is incorrect. Etcd manager thinks that it is in /dev/vdd instead of /dev/vdc

Do we need to ship all the etcd2 etcdctl versions in etcd-backup?

Maybe we can just ship the primary versions, or use symlinks where we know them to be compatible.

etcd-backup TLS support at container init

RE: this todo - Is there any work in flight for this or are you looking for a contributor?

Any considerations for initial support? Do you see MVP as providing paths to certificates for the init? Or providing certs inline?

Where should I run the etcd-manager-ctl commands?

Hi all.

I have been reviewing the disaster recovery documentation and it's not clear to me where should I execute the etcd-manager-ctl commands (list backups, restore backups...). so I have some questions:

Does etcd-manager-ctl needs the API keys with the right permissions to access the S3 bucket where the etcd backups live?

My current setup its an K8s cluster with 3 masters with etcd-manager (main and events) running in the master nodes using the manifests that are present in /etc/kubernetes/manifests/..

Happy to create a PR to improve the disaster recovery docs with this information.
Thanks

Documentation: how to add/remove single node from healthy etcd-manager cluster

We have situation that we have health etcd-manager cluster with 3 masters. However, we would like to move this running cluster to use different storage backend. I know that we can do that by using backup+restore but it means also downtime for kubernetes apis.

Also I have seen situations that for some reason one etcd member is broken. I have not found way to add (new) member back to cluster which do have 2/3 members healthy. The only way that I have found is to make new etcd-manager cluster from the backup. However, this is not perfect way to do things because it will always lead downtime and possible (small) data loss.

unable to determine cluster token (kops)

During an attempted migration to etcd-manager on a kops cluster, tailing the etcd.log on the first node to be updated shows the following:

I0102 16:18:55.621144    5320 controller.go:137] peers: [peer{id:"etcd-eu-west-1a" endpoints:"172.21.56.25:3996" } peer{id:"etcd-eu-west-1b" endpoints:"172.21.66.14:3996" } peer{id:"etcd-eu-west-1c" endpoints:"172.21.108.164:3996" }]
I0102 16:18:55.622677    5320 controller.go:232] etcd cluster state: etcdClusterState
  members:
  peers:
    etcdClusterPeerInfo{peer=peer{id:"etcd-eu-west-1a" endpoints:"172.21.56.25:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-eu-west-1a" peer_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:2380" client_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:4001" quarantined_client_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:3994" > }
    etcdClusterPeerInfo{peer=peer{id:"etcd-eu-west-1b" endpoints:"172.21.66.14:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-eu-west-1b" peer_urls:"http://etcd-eu-west-1b.internal.stg.mycluster:2380" client_urls:"http://etcd-eu-west-1b.internal.stg.mycluster:4001" quarantined_client_urls:"http://etcd-eu-west-1b.internal.stg.mycluster:3994" > }
    etcdClusterPeerInfo{peer=peer{id:"etcd-eu-west-1c" endpoints:"172.21.108.164:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-eu-west-1c" peer_urls:"http://etcd-eu-west-1c.internal.stg.mycluster:2380" client_urls:"http://etcd-eu-west-1c.internal.stg.mycluster:4001" quarantined_client_urls:"http://etcd-eu-west-1c.internal.stg.mycluster:3994" > }
I0102 16:18:55.622744    5320 controller.go:233] etcd cluster members: map[]
I0102 16:18:55.622753    5320 controller.go:516] sending member map to all peers: 
I0102 16:18:55.623935    5320 commands.go:22] not refreshing commands - TTL not hit
I0102 16:18:55.623955    5320 s3fs.go:210] Reading file "s3://kops-clusters.mycluster/stg.mycluster/backups/etcd/main/control/etcd-cluster-created"
I0102 16:18:55.647667    5320 controller.go:318] spec member_count:3 etcd_version:"2.2.1" 
I0102 16:18:55.647693    5320 controller.go:375] etcd has 0 members registered, we want 3; will try to expand cluster
W0102 16:18:55.647700    5320 controller.go:663] unable to do backup before adding peer - no members
I0102 16:18:55.647706    5320 controller.go:667] will try to start etcd on new peer: etcdClusterPeerInfo{peer=peer{id:"etcd-eu-west-1a" endpoints:"172.21.56.25:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-eu-west-1a" peer_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:2380" client_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:4001" quarantined_client_urls:"http://etcd-eu-west-1a.internal.stg.mycluster:3994" > }

Gossip appears to work, the node see each other, but the advertisements appear to be missing a field maybe?

Allowed cipher suite parameter

1. Describe IN DETAIL the feature/behavior/change you would like to see.
A flag to set all the allowed cipher suite, similar to the parameter "--tls-cipher-suites" used on kubelet.

This necessity showed up after a vulnerability scan on a Kubernetes environment configured by Kops. The Nessus scan revealed that the etcd-manager doesn't restrict the use of non-secure ciphers suite (ECDHE-RSA-DES-CBC3-SHA and DES-CBC3-SHA).

Bazel build fails

Hi Justin,
I am trying to build etcd-manager and I am getting the following error:

 $ bazel build //cmd/etcd-manager //cmd/etcd-manager-ctl
ERROR: /home/tamal/go/src/kope.io/etcd-manager/cmd/etcd-manager-ctl/BUILD.bazel:3:1: no such package 'vendor/github.com/golang/glog': BUILD file not found on package path and referenced by '//cmd/etcd-manager-ctl:go_default_library'
ERROR: Analysis of target '//cmd/etcd-manager-ctl:etcd-manager-ctl' failed; build aborted: no such package 'vendor/github.com/golang/glog': BUILD file not found on package path
INFO: Elapsed time: 0.311s
FAILED: Build did NOT complete successfully (0 packages loaded)
    currently loading: @io_bazel_rules_go//go/private

This is my first time using bazel. I installed bazel on a Ubuntu 16.04 machine following instructions here: https://docs.bazel.build/versions/master/install-ubuntu.html

Any idea how to fix this error ?

Update module version of kops

Currently etcd-manager depends on an very old kops version.

I'm currently working on make etct-manager support Alicloud, in this case, I need to update the vfs module in kops.

But I don't know what is the best way to upgrade kops dependency in etcd-manager.

Can you please take a look at this? @justinsb

This is blocking #269.

Thanks!

Don't buffer backup files in memory

Otherwise we could OOM on big backups / restores

Provide etcd-manager-ctl binaries

The documentation at https://github.com/kopeio/etcd-manager mentions etcd-manager-ctl as the tool for the management of the etcd backups/restores. Would it be possible to include that tool in https://hub.docker.com/r/kopeio/etcd-manager?

Moreover it would be useful to have described also a more realistic scenario of the usage of etcd-manager-ctl for backups/restores on an aws/other system managed by kops.

Support eu-north region

Tried to create cluster on eu-north with kops 1.12.0 using:

kops create cluster --state s3://292662267961-k8s.local-eu-north-1-kops-storage --zones eu-north-1a --master-size t3.small --node-size t3.small --name test2.k8s.local

All resources on AWS is created but the cluster isn't starting correctly.
Looking at the logs on the master node I can see that the docker images for etcd-manager fails with:

I0516 13:20:39.596608 3806 s3context.go:164] got region from metadata: "eu-north-1"
W0516 13:20:39.596655 3806 controller.go:149] unexpected error running etcd cluster reconciliation loop: error refreshing control store after leadership cha
nge: error reading s3://292662267961-k8s.local-eu-north-1-kops-storage/test2.k8s.local/backups/etcd/events/control: eu-north-1 is not a valid region
Please check that your region is formatted correctly (e.g. us-east-1)

So it would seem that even though kops supports the eu-north region the etcd-manager isn't?
I guess a simple update of the aws-sdk component in etcd-manager should solve this?

Also created an issue on kops