elotl / kip Goto Github PK
View Code? Open in Web Editor NEWVirtual-kubelet provider running pods in cloud instances
License: Apache License 2.0
Virtual-kubelet provider running pods in cloud instances
License: Apache License 2.0
This one is strange: It seems like we can tell GCE to terminate an instance, we get an OK back from the API and then the instance isn't terminated. The node inside kip is stuck in the "terminating" state
apiVersion: v1
kind: Node
metadata:
creationTimestamp: "2020-05-06T22:17:27.781711586Z"
labels: {}
name: d6385a15-5730-4d39-a8a2-f890344cca41
namespace: default
uid: b1e223a1-4742-4bf2-bbf5-89fa8cd1ba2d
spec:
bootImage: https://www.googleapis.com/compute/v1/projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20200414
instanceType: e2-highmem-4
placement: {}
resources:
cpu: "4.00"
memory: 20.00Gi
sustainedCPU: false
volumeSize: 10G
spot: false
status:
addresses: null
boundPodName: ""
instanceID: ""
phase: Terminating
I'm unsure if the most recent work to check the status of operations will help with ensuring the nodes are terminated. Might need to handle more cases in the Garbage Controller.
I disconnected and reconnected to wifi while running itzo from my laptop. After the reconnect, all GCE operations timed out. I'd like to see if we can detect these types of problems and get the client to reconnect in a timely manner.
Workflow:
replicas=1
)kubectl get pods
shows one nginx pod stuck in Terminating
state. There is no cell
corresponding to this pod.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-66f967f649-nlzr4 1/1 Terminating 0 42m
nginx-deployment-66f967f649-twqz9 1/1 Running 0 6m17s
$ kubectl get cells
NAME POD NAME POD NAMESPACE NODE LAUNCH TYPE INSTANCE TYPE INSTANCE ID IP
1581bec3-4551-47d7-8229-b3f31c780e69 nginx-deployment-66f967f649-twqz9 default virtual-kubelet On-Demand t3.nano i-09ff4cb4d55c7391b 10.0.28.184
4009e78e-bf75-4148-9e3b-409e80afeca3 registry-creds-gqkm9 kube-system virtual-kubelet On-Demand t3.nano i-09fc50a0280bc3be9 10.0.26.188
d0f95a9f-88a7-4b3f-82b2-6b9ba6324de4 kube-proxy-tw4k5 kube-system virtual-kubelet On-Demand t3.nano i-08d26143b91ad4c94 10.0.22.36
VK + KIP log:
I0310 20:13:45.449670 1 opencensus.go:138] Deleting pod in provider name=nginx-deployment-66f967f649-nlzr4 phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default
I0310 20:13:45.449885 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:45.450001 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:45.450187 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:45.460426 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:45.460513 1 opencensus.go:138] Deleting pod in provider phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default name=nginx-deployment-66f967f649-nlzr4
I0310 20:13:45.460604 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:45.460669 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:45.460732 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:45.480927 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:45.481014 1 opencensus.go:138] Deleting pod in provider reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running
I0310 20:13:45.481106 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:45.481197 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:45.481270 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:45.521513 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:45.521632 1 opencensus.go:138] Deleting pod in provider reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running
I0310 20:13:45.521779 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:45.521858 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:45.521940 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:45.602156 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:45.602261 1 opencensus.go:138] Deleting pod in provider phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default name=nginx-deployment-66f967f649-nlzr4
I0310 20:13:45.602376 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:45.602466 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:45.602563 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:45.762807 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:45.762889 1 opencensus.go:138] Deleting pod in provider reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running
I0310 20:13:45.763047 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:45.763137 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:45.763232 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:46.083583 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:46.083680 1 opencensus.go:138] Deleting pod in provider namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45
I0310 20:13:46.083770 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:46.083840 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:46.083919 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:46.724259 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:46.724378 1 opencensus.go:138] Deleting pod in provider name=nginx-deployment-66f967f649-nlzr4 phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default
I0310 20:13:46.724525 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:46.724591 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:46.724653 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:48.004860 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:48.004905 1 opencensus.go:138] Deleting pod in provider uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running reason=
I0310 20:13:48.004984 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:48.005048 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:48.005112 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:50.565386 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:50.565515 1 opencensus.go:138] Deleting pod in provider namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45
I0310 20:13:50.565632 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:50.565702 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:50.565873 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:13:55.686143 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:13:55.686330 1 opencensus.go:138] Deleting pod in provider namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45
I0310 20:13:55.686431 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:13:55.686500 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:13:55.686562 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:14:05.926792 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:14:05.926885 1 opencensus.go:138] Deleting pod in provider namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45
I0310 20:14:05.927025 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:14:05.927111 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:14:05.927193 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:14:26.407457 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:14:26.407524 1 opencensus.go:138] Deleting pod in provider uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running reason=
I0310 20:14:26.407605 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:14:26.407668 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:14:26.407738 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:15:07.368161 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:15:07.368299 1 opencensus.go:138] Deleting pod in provider namespace=default name=nginx-deployment-66f967f649-nlzr4 phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45
I0310 20:15:07.368418 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:15:07.368508 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:15:07.368576 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
I0310 20:16:29.288918 1 opencensus.go:138] sync handled key=default/nginx-deployment-66f967f649-nlzr4
I0310 20:16:29.289049 1 opencensus.go:138] Deleting pod in provider name=nginx-deployment-66f967f649-nlzr4 phase=Running reason= uid=5fb0f1d5-8b0f-47ed-b952-11dfb91eae45 namespace=default
I0310 20:16:29.289144 1 server.go:604] DeletePod "nginx-deployment-66f967f649-nlzr4"
E0310 20:16:29.289214 1 server.go:613] DeletePod "nginx-deployment-66f967f649-nlzr4": Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
W0310 20:16:29.289280 1 opencensus.go:175] requeuing "default/nginx-deployment-66f967f649-nlzr4" due to failed sync error=failed to delete pod "default/nginx-deployment-66f967f649-nlzr4" in the provider: Could not delete pod default_nginx-deployment-66f967f649-nlzr4: Key not found in store
Metrics-server is usually part of monitoring stacks for Kubernetes clusters. It queries kubelets via their metrics API, which is usually exposed via port 10255.
We've been running Kip in host network mode, thus moving the metrics endpoint to a different port. However, metrics-server can only set the metrics port on a per cluster basis, thus either monitor only virtual-kubelet instances or regular kubelet instances.
Thus, we need to move the metrics port to 10255.
A possible solution is to run Kip without host network mode. There are two caveats:
The goal here is to
Example pod spec with kube-proxy added as a sidecar:
containers:
- command:
- /bin/sh
- -c
- exec kube-proxy --master=https://34.XX.XX.XXX --kubeconfig=/var/lib/kube-proxy/kubeconfig
--cluster-cidr=10.25.0.0/16 --resource-container="" --oom-score-adj=-998
--v=2
image: gke.gcr.io/kube-proxy:v1.14.10-gke.36
imagePullPolicy: IfNotPresent
name: kube-proxy
resources:
requests:
cpu: 100m
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/ssl/certs
name: etc-ssl-certs
readOnly: true
- mountPath: /usr/share/ca-certificates
name: usr-ca-certs
readOnly: true
- mountPath: /var/lib/kube-proxy/kubeconfig
name: kube-proxy-kubeconfig
- mountPath: /run/xtables.lock
name: xtables-lock
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- command:
- /virtual-kubelet
- --provider
- kip
- --provider-config
- /etc/virtual-kubelet/provider.yaml
- --network-agent-secret
- kube-system/vk-network-agent
- --disable-taint
- --klog.logtostderr
- --klog.v=5
- --metrics-addr=:10255
- --debug-server
env:
- name: KUBELET_PORT
value: "10666"
- name: APISERVER_CERT_LOCATION
value: /opt/kip/data/kubelet-pki/virtual-kubelet.crt
- name: APISERVER_KEY_LOCATION
value: /opt/kip/data/kubelet-pki/virtual-kubelet.key
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: elotl/virtual-kubelet:v0.0.4-4-gc0c246b
imagePullPolicy: Always
name: virtual-kubelet
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 100m
memory: 100Mi
securityContext:
privileged: true
volumeMounts:
- mountPath: /opt/kip/data
name: data
- mountPath: /etc/virtual-kubelet
name: provider-yaml
- mountPath: /run/xtables.lock
name: xtables-lock
- mountPath: /lib/modules
name: lib-modules
readOnly: true
dnsPolicy: ClusterFirst
initContainers:
- command:
- bash
- -c
- mkdir -p $CERT_DIR && /opt/csr/get-cert.sh
env:
- name: NODE_NAME
value: virtual-kubelet
- name: CERT_DIR
value: /data/kubelet-pki
image: elotl/init-cert:latest
imagePullPolicy: Always
name: init-cert
volumeMounts:
- mountPath: /data
name: data
restartPolicy: Always
serviceAccount: virtual-kubelet
serviceAccountName: virtual-kubelet
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
volumes:
- name: data
persistentVolumeClaim:
claimName: provider-data
- configMap:
defaultMode: 420
items:
- key: cloudinit.yaml
mode: 384
path: cloudinit.yaml
- key: provider.yaml
mode: 384
path: provider.yaml
name: virtual-kubelet-config-799kbh2d6d
name: provider-yaml
- hostPath:
path: /run/xtables.lock
type: FileOrCreate
name: xtables-lock
- hostPath:
path: /lib/modules
type: ""
name: lib-modules
- hostPath:
path: /usr/share/ca-certificates
type: ""
name: usr-ca-certs
- hostPath:
path: /etc/ssl/certs
type: ""
name: etc-ssl-certs
- hostPath:
path: /var/lib/kube-proxy/kubeconfig
type: FileOrCreate
name: kube-proxy-kubeconfig
I have a private AWS setup an I want to change the AWS URL..How do I do that
Right now I guess its going to public AWS env..I see options of key ID and region can I explicitly set URL as well?
Error: error initializing provider kip: error configuring cloud client: Error setting up cloud client: Could not configure AWS cloud client authorization: Error validationg connection to AWS: RequestError: send request failed
caused by: Post https://ec2.us-east-1.amazonaws.com/: x509: certificate signed by unknown authority
2qs4mvn2pza6teedd6hfz5hiky
)72oxoqlbgrgjrleablteitvrxu
)kubectl get pods
shows zero pods.kubectl get cells
shows nginx and kube-proxy cells from incarnation1.$ kubectl get pods
No resources found in default namespace.
$ kubectl get cells
NAME POD NAME POD NAMESPACE NODE LAUNCH TYPE INSTANCE TYPE INSTANCE ID IP
0dfd31cf-8d5a-494d-98e7-d8de2f4b4c9d kube-proxy-rjspt kube-system virtual-kubelet On-Demand t3.nano i-0650833d57a649efd 172.31.74.58
61c44374-33aa-4636-8d09-23479025c3b4 kube-proxy-rjspt kube-system virtual-kubelet On-Demand t3.nano i-005c2fb81f90a1ede 172.31.77.142
66e5f0da-7f74-4a17-ac9b-fe6234ed8369 nginx-deployment-66f967f649-9b8zj default virtual-kubelet On-Demand t3.nano i-0ec6eabbae4ca4c6b 172.31.72.109
$ terraform --version
Terraform v0.12.11
+ provider.google v3.28.0
+ provider.null v2.1.2
+ provider.random v2.2.1
$ pwd
/Users/myechuri/src/github.com/elotl/kip/deploy/terraform-aws
$ terraform apply -var-file myenv.tfvars
...
Warning: "blacklisted_zone_ids": [DEPRECATED] use `exclude_zone_ids` instead
on main.tf line 12, in data "aws_availability_zones" "available-azs":
12: data "aws_availability_zones" "available-azs" {
blacklisted_zone_ids needs to be replaced with exclude_zone_ids
. Will submit PR shortly.
We have a lot of names in the codebase from when kip was called Milpa the software was a pretty cool stand-alone management controller. That time has passed and those references to Milpa need to be changed. I'm trying to keep track of those things here.
Easy to change
Harder to change but not as noticable to users
Right now Kip only supports configmap and secret sources:
// Projection that may be projected along with other supported volume types
type VolumeProjection struct {
// all types below are the supported types for projection into the same volume
// information about the secret data to project
// +optional
Secret *SecretProjection `json:"secret,omitempty"`
// // information about the downwardAPI data to project
// // +optional
// DownwardAPI *DownwardAPIProjection `json:"downwardAPI,omitempty"`
// information about the configMap data to project
// +optional
ConfigMap *ConfigMapProjection `json:"configMap,omitempty"`
// information about the serviceAccountToken data to project
// +optional
//ServiceAccountToken *ServiceAccountTokenProjection `json:"serviceAccountToken,omitempty"`
}
Once service account token rotation is enabled, service accounts will be added to pods via a ServiceAccountToken projected volume source. Example:
- name: kube-api-access-tz9tt
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3600
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
So in-cluster API server access configuration will break for pods.
Currently, ServiceAccountTokenVolumeProjection is beta in 1.12 and enabled by passing all of the following flags to the API server:
We need to implement DownwardAPIProjection and ServiceAccountTokenVolumeProjection.
Right now we only expose metrics in the old stats summary API. However, to integrate with Prometheus, kubelets use /metrics and the Prometheus metrics format.
We could expose metrics that are relevant and specific to Kip, for example data on what type of instances were scheduled and how long they ran (making it easier to expose cost information).
Right now we mount in the cert from the kubelet Kip runs on (from /var/lib/kubelet/pki). However, this is very inelegant and not something that should be used in production.
Instead, we could generate a certificate for Kip ourselves, either in an init container, or in Kip when it starts up and there's no certificate. The certificate could then be stored on the persistent volume. See https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/server.go#L978-L993 how the kubelet generates its own certificate when it's missing at startup.
Note: this is not the client cert kubelets use to communicate with the API server as their client cert. Kip uses its service account token for that purpose.
Kip keeps an internal storage of:
If a kip deployment is restarted without preserving the PersistentVolume kip's state is stored on, all pods will be restarted and the previous kip cells will be orphaned in the cloud since kip's identity will be recreated as well. We've talked about this in other discussions in slack.
Todo:
It would be helpful to user to have instructions similar to https://github.com/elotl/kip/blob/master/deploy/terraform/README.md for multi-cloud setup. Sample multi-cloud setup could include {minikube cluster + kip} on MacBook shipping pods to AWS.
Hello. I have a problem with run KIP on AWS eks cluster. First I have created eks with this instruction: https://docs.aws.amazon.com/eks/latest/userguide/getting-started-console.html. I tested cluster and a can run deployments. Next, I cloned KIP repo and added credentials (accessKeyID and secretAccessKey) to kip/base/provider.yaml file. Next I executed "kustomize build base/ | kubectl apply -f โ" command and I have problem with kip container. Do you know what is going on or do you have more detailed instruction? Maybe I forgot something. I also tried add minimum IAM Permissions to role created for eks. I tried version with minikube as well and it worked excellent.
log info:
F0825 08:20:15.743785 1 main.go:133] error initializing provider kip: error configuring cloud client: Error setting up cloud client: Could not configure AWS cloud client authorization: Error validationg connection to AWS: AuthFailure: AWS was not able to validate the provided access credentials
status code: 401, request id: 7205bc35-2b6a-48ff-baeb-0eddfd2d4824
description of the container with the problem
Name: kip-provider-0
Namespace: kube-system
Priority: 0
Node: ip-192-168-1-242.ec2.internal/192.168.1.242
Start Time: Wed, 26 Aug 2020 09:23:09 +0000
Labels: app=kip-provider
controller-revision-hash=kip-provider-6d97b44c7
statefulset.kubernetes.io/pod-name=kip-provider-0
Annotations: kubernetes.io/psp: eks.privileged
Status: Terminating (lasts 3h42m)
Termination Grace Period: 30s
IP: 192.168.1.126
IPs:
IP: 192.168.1.126
Controlled By: StatefulSet/kip-provider
Init Containers:
init-cert:
Container ID: docker://2f67aa09fb9204565ef8f8129e43e2179163f4b6e2e9df4c1a33e028c303754e
Image: elotl/init-cert:latest
Image ID: docker-pullable://elotl/init-cert@sha256:781e404f73ab2e78ba1de2aba9ed569fe9f8fe920c5aeb3cd4143b1eb39facc1
Port: <none>
Host Port: <none>
Command:
bash
-c
mkdir -p $(CERT_DIR) && /opt/csr/get-cert.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 26 Aug 2020 09:23:27 +0000
Finished: Wed, 26 Aug 2020 09:23:29 +0000
Ready: True
Restart Count: 0
Environment:
NODE_NAME: kip-provider-0 (v1:metadata.name)
CERT_DIR: /data/kubelet-pki
Mounts:
/data from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kip-provider-token-n64lj (ro)
Containers:
kip:
Container ID: docker://7adef4f167d4b756b048d3e6b0e20facb66a5ed90043bd30404e39e8bb6009c7
Image: elotl/kip:latest
Image ID: docker-pullable://elotl/kip@sha256:11508b91c7420e933b935f96d7235ca1d7133d4bd1e1b878935b23d9ab876143
Port: <none>
Host Port: <none>
Command:
/kip
--provider
kip
--provider-config
/etc/kip/provider.yaml
--network-agent-secret
kube-system/kip-network-agent
--disable-taint
--klog.logtostderr
--klog.v=2
--metrics-addr=:10255
--nodename=$(VKUBELET_NODE_NAME)
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Wed, 26 Aug 2020 09:26:27 +0000
Finished: Wed, 26 Aug 2020 09:26:29 +0000
Ready: False
Restart Count: 5
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 100Mi
Environment:
NODE_NAME: (v1:spec.nodeName)
VKUBELET_NODE_NAME: kip-provider-0 (v1:metadata.name)
APISERVER_CERT_LOCATION: /opt/kip/data/kubelet-pki/$(VKUBELET_NODE_NAME).crt
APISERVER_KEY_LOCATION: /opt/kip/data/kubelet-pki/$(VKUBELET_NODE_NAME).key
Mounts:
/etc/kip from provider-yaml (rw)
/lib/modules from lib-modules (ro)
/opt/kip/data from data (rw)
/run/xtables.lock from xtables-lock (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kip-provider-token-n64lj (ro)
kube-proxy:
Container ID: docker://b1057a53b6ba6e416f7c984082be86a8633fa075ac2c5a9a5812dd4f9acf67f7
Image: k8s.gcr.io/kube-proxy:v1.18.3
Image ID: docker-pullable://k8s.gcr.io/kube-proxy@sha256:6a093c22e305039b7bd6c3f8eab8f202ad8238066ed210857b25524443aa8aff
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
exec kube-proxy --oom-score-adj=-998 --bind-address=127.0.0.1 --v=2
State: Running
Started: Wed, 26 Aug 2020 09:23:30 +0000
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kip-provider-token-n64lj (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-kip-provider-0
ReadOnly: false
provider-yaml:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: kip-config-t56f8td654
Optional: false
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/kip-xtables.lock
HostPathType: FileOrCreate
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
kip-provider-token-n64lj:
Type: Secret (a volume populated by a Secret)
SecretName: kip-provider-token-n64lj
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
When we were building milpa, we used to keep pods around in the registry for 3 minutes after they had been terminated. This was simply to allow users to inspect old pods for debugging.
We need to get rid of this behavior and delete pods from the registry once they have been terminated.
Right now we're only exposing the bare minimum via /stats, e.g.:
"containers" : [
{
"cpu" : {
"time" : "2020-05-20T23:25:59Z",
"usageNanoCores" : 115094566
},
"memory" : {
"usageBytes" : 966656,
"workingSetBytes" : 839680,
"time" : "2020-05-20T23:25:59Z"
},
"startTime" : "2020-05-20T22:26:14Z",
"name" : "debug"
}
This should contain more metrics, e.g.:
{
"name": "virtual-kubelet",
"startTime": "2020-05-20T21:40:51Z",
"cpu": {
"time": "2020-05-20T21:42:14Z",
"usageNanoCores": 4406729,
"usageCoreNanoSeconds": 1272336128
},
"memory": {
"time": "2020-05-20T21:42:14Z",
"availableBytes": 838094848,
"usageBytes": 348979200,
"workingSetBytes": 235646976,
"rssBytes": 234954752,
"pageFaults": 61842,
"majorPageFaults": 0
},
"rootfs": {
"time": "2020-05-20T21:42:14Z",
"availableBytes": 93587988480,
"capacityBytes": 101241290752,
"usedBytes": 57344,
"inodesFree": 6139395,
"inodes": 6258720,
"inodesUsed": 15
},
"logs": {
"time": "2020-05-20T21:42:14Z",
"availableBytes": 93587988480,
"capacityBytes": 101241290752,
"usedBytes": 73728,
"inodesFree": 6139395,
"inodes": 6258720,
"inodesUsed": 119325
},
"userDefinedMetrics": null
}
We already use cgroups for collecting container level metrics on cells, so it should not be too hard to fill in the missing ones when we're sending them back to kip.
See k8s.io/kubernetes/pkg/kubelet/apis/stats/v1alpha1/types.go for ContainerStats and PodStats.
Applications running on the VM use the service account of the instance to call Google Cloud APIs. We need a service account for cells that has the permissions "logs.write" and "monitoring.write", so that we can run the monitoring and logging agents.
We need to create GCE functional tests that are similar to the AWS tests then we need to add GCE functional tests into the build.
The whole package is called cloudinitfile and inside the package there are 2 File types, one called CloudInitFile that isn't the CloudInitFile that the package is referring to. This is all a result of moving a bunch of types out of elotl/cloud-init and into kip to get around possible issues related to coreos's yaml library license.
In k8s, the standard way to encode registry secrets is:
kubectl create secret docker-registry regcred --docker-server=<server> --docker-username=<username> --docker-password=<password> --docker-email=<docker-email>
This produces a secret that looks like:
apiVersion: v1
kind: Secret
data:
.dockerconfigjson: <base64 encoded json>
The decoded json looks like:
{
"auths": {
"docker.io": {
"username": "<username>",
"password": "<password>",
"email": "<docker-email>",
"auth": "<base64 encoded username:password>"
}
}
}
Kip pulls out image secrets and sends them to cells in a structure containing {server, username, password}. We should try to pull that data out of a docker formatted secret.
Lets update kip to correctly pull out the necessary data from a docerconfigjson secret.
I've seen this happening a few times: I kill the VK pod, the deployment starts a new one. Usually it comes up fine and the pods running via VK are kept intact. However, once in a while (usually when it takes a bit longer for VK to start up), although VK can find the pods in the provider, when listing them in Kubernetes, they are not there. So it will think they are all dangling pods and will remove them from the provider.
Obviously this is not desirable behavior and we should investigate why this might happen, and how to fix it or work around it.
Provisioned Nodeless GKE cluster.
kubectl create -f nginx.yaml
nginx.yaml:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
nodeSelector:
type: virtual-kubelet
tolerations:
- key: virtual-kubelet.io/provider
operator: Exists
containers:
- name: nginx
image: nginx:1.7.9
resources:
requests:
cpu: "1"
memory: "2Gi"
ports:
- containerPort: 80
Pending
state. Kip log:E0707 16:57:43.437010 1 node_controller.go:293] Error in node start: startup error: googleapi: Error 400: Invalid value for field 'resource.machineType': 'https://www.googleapis.com/compute/v1/projects/myechuri-project1/zones/us-west2-a/machineTypes/e2-custom-1-2048'. Custom Machine type with name 'e2-custom-1-2048' does not exist., invalid
I0707 16:57:43.437244 1 node_registry.go:218] Purging node &{{Node v1} {819f9e10-bf22-4201-a3a3-4a93421b0060 map[] 2020-07-07 16:57:42.66800324 +0000 UTC <nil> map[] f1a025a9-5ec1-4d19-b79a-19dc68a2bba0 default} {e2-custom-1-2048 elotl-kip-latest false false {1.00 2.00Gi 10G false 0xc00326089c false <nil>} {}} {Creating [] default_nginx-deployment-6589b4cb45-rg6kz}}
E0707 16:57:43.441117 1 cell_controller.go:186] Error processing cell operation: cells.kip.elotl.co "819f9e10-bf22-4201-a3a3-4a93421b0060" not found
W0707 16:57:43.441401 1 queue.go:98] Dropping cells {'\x02' %!q(*api.Node=&{{Node v1} {819f9e10-bf22-4201-a3a3-4a93421b0060 map[] {{668003240 63729737862 <nil>}} <nil> map[] f1a025a9-5ec1-4d19-b79a-19dc68a2bba0 default} {e2-custom-1-2048 elotl-kip-latest false false {1.00 2.00Gi 10G false 0xc003261a88 false <nil>} {}} {Creating [] default_nginx-deployment-6589b4cb45-rg6kz}}) %!q(*api.Pod=<nil>)} out of the queue: cells.kip.elotl.co "819f9e10-bf22-4201-a3a3-4a93421b0060" not found
It would be helpful to the user to have a pictorial representation of the components deployed by terraform deployer similar to this picture.
See issue 159 comment for details. Top of tree deploy/manifests/kip/overlays/minikube
leads to CreateContainerConfigError
.
GKE worker node's pki certs are symlinks:
madhuri@gke-myechuri-vk-gke-test-default-pool-db7c47b8-fw89 ~ $ ls -ls /var/lib/kubelet/pki
total 8
4 -rw------- 1 root root 1110 May 5 06:52 kubelet-client-2020-05-05-06-52-14.pem
0 lrwxrwxrwx 1 root root 59 May 5 06:52 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2020-
05-05-06-52-14.pem
4 -rw------- 1 root root 1252 May 5 06:52 kubelet-server-2020-05-05-06-52-16.pem
0 lrwxrwxrwx 1 root root 59 May 5 06:52 kubelet-server-current.pem -> /var/lib/kubelet/pki/kubelet-server-2020-
05-05-06-52-16.pem
madhuri@gke-myechuri-vk-gke-test-default-pool-db7c47b8-fw89 ~ $
Using kubelet-client-current.pem
as cert location did not work. overlays/gke/deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: virtual-kubelet
namespace: kube-system
spec:
template:
spec:
containers:
- command:
- /virtual-kubelet
- --provider
- kip
- --provider-config
- /etc/virtual-kubelet/provider.yaml
- --network-agent-secret
- kube-system/vk-network-agent
- --disable-taint
- --klog.logtostderr
- --klog.v=2
image: elotl/virtual-kubelet:v0.0.2-37-gede5647
name: virtual-kubelet
env:
- name: APISERVER_CERT_LOCATION
value: /etc/kubelet-pki/kubelet-client-current.pem
- name: APISERVER_KEY_LOCATION
value: /etc/kubelet-pki/kubelet-client-current.pem
Above deployment results in vk+kip failed with below error:
F0506 06:00:54.493578 1 main.go:110] error loading tls certs: open /etc/kubelet-pki/kubelet-client-current.pem: no such file or directory
Workaround: updating deployment.yaml
with below helped me get past the error:
env:
- name: APISERVER_CERT_LOCATION
value: /etc/kubelet-pki/kubelet-client-2020-05-05-06-52-14.pem
- name: APISERVER_KEY_LOCATION
value: /etc/kubelet-pki/kubelet-client-2020-05-05-06-52-14.pem
There are resources created by Kip during runtime:
When a Kip instance is removed, the node will be removed soon by the control plane. However, all the VM instances, cell CRDs and SGs/firewalls that were active at that point will stay behind.
We need a way to detect when the Kip controller instance is gone, and remove all the resources it left.
Hi, thanks for open sourcing the project. We have a use case where the cluster is running in a private cloud but we want to extend the workload to the cloud. Specifically, we want to train a model using Cloud GPU. The data already present in S3 and we just need to train the model and save the model back to S3, so I'd expect we won't need most of the features in Kubernetes. Does kip support the use case? If not, we have to run the cluster in the Cloud where we launch the GPU instances.
Provisioned test cluster using https://github.com/elotl/kip/tree/master/deploy/terraform . The cluster came up fine, but registry-cred
pod is stuck in PodInitializing
state.
$ kubectl cluster-info
Kubernetes master is running at https://10.0.17.195:6443
KubeDNS is running at https://10.0.17.195:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-17-195.ec2.internal Ready master 24m v1.18.1
virtual-kubelet Ready agent 23m
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
aws-node-tq9z6 1/1 Running 0 21m
coredns-66bff467f8-6ml6c 1/1 Running 0 21m
coredns-66bff467f8-xr29g 1/1 Running 0 21m
etcd-ip-10-0-17-195.ec2.internal 1/1 Running 0 21m
kube-apiserver-ip-10-0-17-195.ec2.internal 1/1 Running 0 21m
kube-controller-manager-ip-10-0-17-195.ec2.internal 1/1 Running 0 21m
kube-proxy-pclwn 1/1 Running 0 21m
kube-scheduler-ip-10-0-17-195.ec2.internal 1/1 Running 0 21m
registry-creds-l5kzs 0/1 PodInitializing 0 21m
virtual-kubelet-db6f48888-dk74s 1/1 Running 0 21m
$ kubectl -n kube-system logs registry-creds-l5kzs
ERROR: logging before flag.Parse: E0415 07:43:53.734772 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:43:54.736143 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:43:55.737450 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:43:56.738964 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:43:57.740308 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:43:58.741763 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:43:59.746607 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:44:00.748238 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:44:01.749908 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
ERROR: logging before flag.Parse: E0415 07:44:02.751224 8 reflector.go:199] github.com/upmc-enterprises/registry-creds/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Namespace: namespaces is forbidden: User "system:serviceaccount:kube-system:default" cannot list resource "namespaces" in API group "" at the cluster scope
The GCE cloud provider in k8s does not like the fact that there's no cloud instance called virtual-kubelet:
default 3s Normal EnsuringLoadBalancer service/nginx Ensuring load balancer
default 3s Warning CreatingLoadBalancerFailed service/nginx Error creating load balancer (will retry): failed to ensure load balancer for service default/nginx: instance not found
Whether we use host network mode or not does not matter though. It's hard-coded in the GCE cloud plugin to list all nodes and bail if any of them is not a GCE VM instance: https://github.com/kubernetes/legacy-cloud-providers/blob/master/gce/gce_loadbalancer_external.go#L59-L62
Certain cloud providers, like GCP, offer creation of custom VM sizes (ex: 1vcpu, 13GB) instead of picking from boilerplate VM sizes (ex: c2-standard-4
, n1-standard-8
). If a kip user's application resource form factor is not a close fit for a boilerplate VM size, using boilerplate VM size leads to wasted resources (and compute spend). It would be useful to such kip users to consume custom VM size feature on a cloud provider, when available, so that their applications automatically consume the most resource and cost efficient compute.
Specifically, there is an ask from a kip user for custom VM size support on GCP because custom VM sizes are more resource and cost efficient for the applications they would like to deploy via kip on GCP. Lets track custom VM size for GCP in this issue. If and when there is a need for extending the feature to cloud providers beyond GCP, we will create a separate issue.
If we fail to dispatch a pod to a node (maybe the user specified a security group that doesn't exist), the node is returned to the NodeController but the node remains in the "Cleaning" state for 2 minutes. It appears kip is requesting the node logs but the logs call is not returning.
General cleanup:
kipctl
for debugging but that shouldn't use github.com/elotl/kip/pkg/labels, switch to using k8s implementation if possible.util.WrapError
to the standard methods for wrapping errors. We would need to make sure that everyone is using go >= 1.13. That shouldn't be an issue.cloudClient.StartNode
and cloudClient.StartSpotNode
into a single function.Our check for ConnectWithPublicIPs()
is metadata.OnGCE()
. That isn't sufficient. We also need to make sure the controller is inside the same GCE private network as the client is configured to use.
To fix this, we can use gceClient.detectCurrentVPC()
and ensure that matches the configured VPC.
Kubelet's behavior is to display > 1000 log lines when getting logs with no options. Virtual Kubelet, however, asks for 10 lines by default. This ends up truncating log lines when running conformance tests and tests fail.
I believe the following tests are affected by this issue:
We should (1) investigate exactly how the kubelet limits log lines it returns and (2) make an issue, followed by a PR to Virtual Kubelet to bring the behavior of logs into alignment with kubelet.
The Compute Engine virtual network interface provides a more efficient delivery network for sending traffic to and from GCE VM instances. The Compute Engine virtual network interface is required to support higher network bandwidths such as the 50-100 Gbps speeds that can be used for distributed workloads on instances that have attached GPUs.
https://cloud.google.com/compute/docs/instances/create-vm-with-gvnic
In order to brand and sell kip on marketplaces, we should rename our images and other resources "kip" instead of virtual-kubelet.
Hi, I just came across kip and wanted to try it out with minikube. I have gone trough the installation instructions several times and I still can not get it to run correctly.
It looks like some pods and or nodes does not get created properly when staring kip.
First I add the aws credentials in deploy/manifests/kip/base/provider.yaml
then.
minikube start
kustomize build deploy/manifests/kip/base | kubectl apply -f
I get the output
serviceaccount/kip-network-agent created
serviceaccount/kip-provider created
clusterrole.rbac.authorization.k8s.io/kip-provider created
clusterrole.rbac.authorization.k8s.io/kip-network-agent created
clusterrolebinding.rbac.authorization.k8s.io/kip-provider created
clusterrolebinding.rbac.authorization.k8s.io/kip-network-agent created
configmap/kip-config-8gf89h865f created
secret/kip-network-agent created
service/kip-provider created
statefulset.apps/kip-provider created
persistentvolumeclaim/kip-provider-data created
I do not see any pods:
kubectl get pods
No resources found in default namespace.
or nodes related to kip
kubectl get nodes
NAME STATUS ROLES AGE VERSION
minikube Ready master 7m3s v1.18.3
as mentioned in the readm: "After applying, you should see a new kip pod in the kube-system namespace and a new node named "kip-0" in the cluster."
and not unsuprisingly:
kubectl -nkube-system logs kip-0 -c kip -f
Error from server (NotFound): pods "kip-0" not found
When trying to deploy a basic ngnx service
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
nodeSelector:
type: virtual-kubelet
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
with kubectl apply -f nginx-deployment-virt-kub.yaml
the job get stuck as pending:
kubectl describe pod nginx-deployment-79cbb8c99-9xptz
Name: nginx-deployment-79cbb8c99-9xptz
Namespace: default
Priority: 0
Node: <none>
Labels: app=nginx
pod-template-hash=79cbb8c99
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/nginx-deployment-79cbb8c99
Containers:
nginx:
Image: nginx:1.14.2
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-vdmvx (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-vdmvx:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-vdmvx
Optional: false
QoS Class: BestEffort
Node-Selectors: type=virtual-kubelet
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3s (x3 over 77s) default-scheduler 0/1 nodes are available: 1 node(s) didn't match node selector.
I tried this both on the latest master (hash 891adef) and v0.0.17 and v0.0.15
kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"archive", BuildDate:"2020-07-01T16:28:46Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
minikube version
minikube version: v1.11.0
commit: 57e2f55f47effe9ce396cea42a1e0eb4f611ebbd
From watching a cluster with lots of pods starting and stopping, it appears that buffered nodes are started then occationally stopped without running a pod. This should not happen unless a pod is deleted before it can run.
I noticed this happening when running the conformance tests in parallel and warrants a bit of an investigation into why those nodes are being shut down. Might take a good bit of logging and tracing but it would be worth it.
The control loop for the PodController has a lot of cases with multiple tickers that can fire. Some of the ticker case statements could take seconds to run under very heavy loads and there are two tickers that run frequently (controlTicker
and statusTicker
) If one of the cases blocks is slow to run, that case could run repeatedly, starving other cases from running. That would be bad.
Lets switch to using Timers instead of tickers and reset the timer at the end of each case. For a sample of using Timers instead of Tickers, check virtual-kubelet's NodeController.controlLoop
Now that we've opened images up to allow the user to bring any image they want (assuming it has cloud-init installed), we need to start our instances with the correct minimum root volume size. The root volume can't be smaller than the root disk image.
When querying for the boot image, we probably also want to store the retrieved image size and use that as the minimum specified size when booting the image.
Right now, "xvda" is hardcoded when we call RunInstances(). However, the name of the root device depends on the AMI. We need to check and save BlockDeviceMappings[0].DeviceName from DescribeImages(), and use it when creating a new instance.
Example result from DescribeImages():
{
"VirtualizationType": "hvm",
"Name": "ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20161010",
"PlatformDetails": "Linux/UNIX",
"Hypervisor": "xen",
"State": "available",
"SriovNetSupport": "simple",
"ImageId": "ami-feb6fee9",
"UsageOperation": "RunInstances",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"SnapshotId": "snap-fe547de8",
"DeleteOnTermination": true,
"VolumeType": "gp2",
"VolumeSize": 8,
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"Architecture": "x86_64",
"ImageLocation": "099720109477/ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20161010",
"RootDeviceType": "ebs",
"OwnerId": "099720109477",
"RootDeviceName": "/dev/sda1",
"CreationDate": "2016-10-11T01:17:16.000Z",
"Public": true,
"ImageType": "machine"
}
Setup: minikube on MacBook with VK shipping pods to GCE.
kubectl get pods
does not show any nginx
pods, but kubectl get cells
shows nginx-deployment-66f967f649-9b8zj
.
Madhuris-MacBook-Pro:gke myechuri$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
m01 Ready master 58d v1.17.3
virtual-kubelet Ready agent 57d
Madhuris-MacBook-Pro:gke myechuri$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-8glnz 0/1 ImagePullBackOff 0 57d
kube-system coredns-6955765f44-8ljwj 1/1 Running 4 58d
kube-system coredns-6955765f44-kkbtq 1/1 Running 4 58d
kube-system etcd-m01 1/1 Running 4 58d
kube-system kube-apiserver-m01 1/1 Running 4 58d
kube-system kube-controller-manager-m01 1/1 Running 11 58d
kube-system kube-proxy-qx9rr 1/1 Running 0 14d
kube-system kube-proxy-ww8d9 1/1 Running 4 58d
kube-system kube-scheduler-m01 1/1 Running 11 58d
kube-system registry-creds-xth4n 1/1 Running 4 57d
kube-system storage-provisioner 1/1 Running 6 58d
kube-system virtual-kubelet-6685c575f9-hvdwd 1/1 Running 0 10m
Madhuris-MacBook-Pro:gke myechuri$ kubectl get cells
NAME POD NAME POD NAMESPACE NODE LAUNCH TYPE INSTANCE TYPE INSTANCE ID IP
0dfd31cf-8d5a-494d-98e7-d8de2f4b4c9d kube-proxy-rjspt kube-system virtual-kubelet On-Demand t3.nano i-0650833d57a649efd 172.31.74.58
373b3a15-a5e7-4715-9ead-c74d64cd5e84 kube-proxy-rjspt kube-system virtual-kubelet On-Demand t3.nano i-0f11c8ae89280f63e 172.31.74.83
61c44374-33aa-4636-8d09-23479025c3b4 kube-proxy-rjspt kube-system virtual-kubelet On-Demand t3.nano i-005c2fb81f90a1ede 172.31.77.142
66e5f0da-7f74-4a17-ac9b-fe6234ed8369 nginx-deployment-66f967f649-9b8zj default virtual-kubelet On-Demand t3.nano i-0ec6eabbae4ca4c6b 172.31.72.109
9543d72d-3325-4356-8df0-c9067435b295 kube-proxy-qx9rr kube-system virtual-kubelet On-Demand t3.nano i-0df366d083e4d1507 172.31.68.151
ca2f2682-0a60-40e8-96f5-98a0972d3e61 kube-proxy-qx9rr kube-system virtual-kubelet On-Demand e2-small kip-lbom5vl74zhfzolegk3sbzkqiu-zixsnaqkmbaorfxvtcqjolj6me 10.168.15.202
Cell for nginx-deployment-66f967f649-9b8zj
is listed with instance id i-0ec6eabbae4ca4c6b
which corresponds to a cell running on AWS from a previous incarnation of vk which shipped pods to AWS.
Verified that the current setup is indeed shipping pods to GCE and not AWS:
Madhuris-MacBook-Pro:gke myechuri$ kubectl create deployment nginx --image=nginx
deployment.apps/nginx created
Madhuris-MacBook-Pro:gke myechuri$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-86c57db685-mzbjf 0/1 Pending 0 4s
Madhuris-MacBook-Pro:gke myechuri$ kubectl get cells
NAME POD NAME POD NAMESPACE NODE LAUNCH TYPE INSTANCE TYPE INSTANCE ID IP
0dfd31cf-8d5a-494d-98e7-d8de2f4b4c9d kube-proxy-rjspt kube-system virtual-kubelet On-Demand t3.nano i-0650833d57a649efd 172.31.74.58
373b3a15-a5e7-4715-9ead-c74d64cd5e84 kube-proxy-rjspt kube-system virtual-kubelet On-Demand t3.nano i-0f11c8ae89280f63e 172.31.74.83
61c44374-33aa-4636-8d09-23479025c3b4 kube-proxy-rjspt kube-system virtual-kubelet On-Demand t3.nano i-005c2fb81f90a1ede 172.31.77.142
66e5f0da-7f74-4a17-ac9b-fe6234ed8369 nginx-deployment-66f967f649-9b8zj default virtual-kubelet On-Demand t3.nano i-0ec6eabbae4ca4c6b 172.31.72.109
9543d72d-3325-4356-8df0-c9067435b295 kube-proxy-qx9rr kube-system virtual-kubelet On-Demand t3.nano i-0df366d083e4d1507 172.31.68.151
d3abeb7c-243a-465d-8386-99fbbe05b7a7 kube-proxy-qx9rr kube-system virtual-kubelet On-Demand e2-small kip-lbom5vl74zhfzolegk3sbzkqiu-2ov6w7behjdf3a4gth534bnxu4
ea3085c8-63c7-4cd6-9500-a946e8cc0480 nginx-86c57db685-mzbjf default virtual-kubelet On-Demand e2-small kip-lbom5vl74zhfzolegk3sbzkqiu-5iyilsddy5gnnfiavfdortaeqa
System pods in GKE expect the following metadata labels to be present for instances:
Even though we don't strictly need the system pods to be running via Kip, it's a bad UX that they start up automatically and will keep crashing due to the lack of these metadata on their instances. It should be easy for kip to check these labels for its own instance, and add them to each instance it starts.
When starting a pod with an annotation:
annotations:
pod.elotl.co/volume-size: "40G"
Starting the pod fails:
I0810 20:38:05.128662 1 instances.go:183] Resizing volume on xxxxxx-xxxx-xxxx-xxxxxxxxxx: currently 10GiB, requested 38GiB
<repeated 3x>
W0810 20:38:08.725469 1 pod_controller.go:1071] Previously dispatching pod kip_test-8639e5f9-tnir0 is not finished dispatching
E0810 20:38:20.539437 1 pod_controller.go:462] Error resizing volume on node xxxxxx-xxxx-xxxx-xxxxxxxxxx pod kip_test-8639e5f9-tnir0: Error getting response for resize request: Server responded with status code 500. Response body: 500 Server Error: no resizing performed; does /dev/nvme0n1p1 have new capacity?```
Currently spot instance/launch-type annotation isn't working.
I built kip locally with go v1.15 and I'm mounting it as a volume, as described here: https://github.com/elotl/kip-minikube/
It starts normally, but when I tried to run nginx on virtual node, it seems that node_controller cannot (first few tries result in connection refused, but I guess it's expected as EC2 instance is booting) ping the cell:
W0914 14:20:52.134617 7 node_controller.go:509] Heartbeat error from node d222e754-34f0-4667-b117-2d52e1f1e2b6: Get "https://3.227.243.231:6421/rest/v1/ping": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0
It's not an issue for me right now, as I can always use v1.13
, but I guess we need to tackle it sooner or later.
Found this discussion: golang/go#39568
Looks like the instanceSelector creates invalid custom instances in GCE. When running a pod in GCE with the following resource requests/limits:
resources:
limits:
cpu: "1"
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi
We get the following error:
E0827 23:32:46.151305 1 node_controller.go:287] Error in node start: startup error: googleapi: Error 400: Invalid value for field 'resource.machineType': 'https://www.googleapis.com/compute/v1/projects/elotl-dev/zones/us-west1-b/machineTypes/n1-custom-1-921'. Memory should be a multiple of 256MiB, while 921MiB is requested, invalid
I0827 23:32:46.151387 1 node_registry.go:218] Purging node &{{Node v1} {69ed9da5-87e0-4ac2-a5fc-b14bc7208def map[] 2020-08-27 23:32:45.33021287 +0000 UTC <nil> map[] 3bee1f5f-0c4e-49cd-9410-ee3ea9102288 default} {n1-custom-1-921 elotl-kip-latest false false {1.00 0.49Gi 10G false 0xc00080b2cc false <nil>}} {Creating [] kube-system_fluentd-gke-2x2bt}}
I've confirmed this with the following test case in TestGCEResourcesToInstanceType
:
{
Resources: api.ResourceSpec{Memory: "0.5Gi", CPU: "1.0"},
instanceType: "n1-custom-1-921",
},
This issue is a problem since one of the daemonSets in GKE creates a pod with these resource limits.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.