rancher / rke2 Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://docs.rke2.io/
License: Apache License 2.0
Home Page: https://docs.rke2.io/
License: Apache License 2.0
Hi! I am using
curl https://raw.githubusercontent.com/rancher/rke2/master/install.sh | INSTALL_RKE2_VERSION=v0.0.1-alpha.5 sh -
command to start single-node cluster on Ubuntu 20.04 VM (1 CPU, 2GB RAM) .
When canal helm charts is deployed CPU usage is 100% and logs full of
Jul 03 13:32:49 rke2-1 systemd-udevd[3692]: calico_tmp_A: Failed to get link config: No such device
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: ERROR:Unknown interface index 150 seen even after reload
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: WARNING:Unknown index 151 seen, reloading interface list
Jul 03 13:32:49 rke2-1 systemd-udevd[3698]: calico_tmp_B: Failed to get link config: No such device
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: ERROR:Unknown interface index 151 seen even after reload
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: WARNING:Unknown index 151 seen, reloading interface list
Jul 03 13:32:49 rke2-1 systemd-networkd[569]: calico_tmp_B: Could not find device, waiting for device initialization: No such device
Jul 03 13:32:49 rke2-1 systemd-networkd[569]: calico_tmp_A: Could not find device, waiting for device initialization: No such device
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: ERROR:Unknown interface index 151 seen even after reload
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: WARNING:Unknown index 150 seen, reloading interface list
Jul 03 13:32:49 rke2-1 systemd-networkd[569]: calico_tmp_B: Could not find device, waiting for device initialization: No such device
Jul 03 13:32:49 rke2-1 systemd-networkd[569]: calico_tmp_A: Could not find device, waiting for device initialization: No such device
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: ERROR:Unknown interface index 150 seen even after reload
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: WARNING:Unknown index 152 seen, reloading interface list
Jul 03 13:32:49 rke2-1 systemd-networkd[569]: calico_tmp_B: Could not find device, waiting for device initialization: No such device
Jul 03 13:32:49 rke2-1 systemd-networkd[569]: calico_tmp_A: Could not find device, waiting for device initialization: No such device
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: ERROR:Unknown interface index 152 seen even after reload
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: WARNING:Unknown index 153 seen, reloading interface list
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: ERROR:Unknown interface index 153 seen even after reload
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: WARNING:Unknown index 153 seen, reloading interface list
Jul 03 13:32:49 rke2-1 systemd-udevd[3692]: calico_tmp_A: Failed to get link config: No such device
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: ERROR:Unknown interface index 153 seen even after reload
Jul 03 13:32:49 rke2-1 networkd-dispatcher[697]: WARNING:Unknown index 152 seen, reloading interface list
Jul 03 13:32:50 rke2-1 systemd-udevd[3698]: calico_tmp_B: Failed to get link config: No such device
In calico-node container's log i see
2020-07-03T13:43:51.219280859Z stdout F 2020-07-03 13:43:51.217 [INFO][25660] int_dataplane.go 1258: Applying XDP actions did not succeed, disabling XDP error=failed to resync: cannot find XDP object "/usr/lib/calico/bpf/filter.o"
2020-07-03T13:43:51.311974047Z stdout F 2020-07-03 13:43:51.298 [INFO][25660] int_dataplane.go 778: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="calico_tmp_B"
2020-07-03T13:43:51.312013473Z stdout F 2020-07-03 13:43:51.300 [INFO][25660] int_dataplane.go 778: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="calico_tmp_A"
2020-07-03T13:43:51.325926965Z stdout F 2020-07-03 13:43:51.325 [WARNING][25660] int_dataplane.go 981: failed to wipe the XDP state error=cannot find XDP object "/usr/lib/calico/bpf/filter.o" try=0
2020-07-03T13:43:51.510189347Z stdout F 2020-07-03 13:43:51.509 [WARNING][25660] int_dataplane.go 981: failed to wipe the XDP state error=cannot find XDP object "/usr/lib/calico/bpf/filter.o" try=1
2020-07-03T13:43:51.724108398Z stdout F 2020-07-03 13:43:51.718 [WARNING][25660] int_dataplane.go 981: failed to wipe the XDP state error=cannot find XDP object "/usr/lib/calico/bpf/filter.o" try=2
2020-07-03T13:43:51.927056232Z stdout F 2020-07-03 13:43:51.913 [WARNING][25660] int_dataplane.go 981: failed to wipe the XDP state error=cannot find XDP object "/usr/lib/calico/bpf/filter.o" try=3
2020-07-03T13:43:52.102314442Z stdout F 2020-07-03 13:43:52.097 [WARNING][25660] int_dataplane.go 981: failed to wipe the XDP state error=cannot find XDP object "/usr/lib/calico/bpf/filter.o" try=4
kubectl -n kube-system set env ds/canal -c calico-node FELIX_XDPENABLED=false
and reboot is fixing the problem. Looks like in ranchertest/calico:v3.13.3
docker image is missing /usr/lib/calico/bpf/
directory:
docker run --rm ranchertest/calico:v3.13.3 ls -l /usr/lib/calico
ls: cannot access /usr/lib/calico: No such file or directory
We need to migrate from ranchertest over to our rancher docker hub repo.
FIPS-140 does not permit some algorithms from being used. For example, MD5 may not be allowed.
We should determine a solution that allows us to either parse and scan thru go code and alert on an invalid algorithm or perhaps we just want to halt the build process and panic when an invalid algorithm is detected.
May just involve updating our shim or something to this effect? (As we cannot touch GoBoring library).
We will need to create a unique set of benchmarks for RKE2.
The hardening guide effort is tracked in #84
Issue from k3s-io/k3s#1504
Epic covering building drone pipeline and images and the following below.
Note: Rancher Federal team to take this and STIG these images. Then, via their own private repo/pipeline publish the STIG'ed images.
Related K3s issue: k3s-io/k3s#1503
Autodetect binaries, if they are not available int he container bind-mount to host w/ chroot.
Rationale:
UBI8 seems to be missing many common binaries needed to get a kubernetes cluster up and running. This is our workaround for this.
Version:
Rke v0.0.1-alpha.7
Describe the bug:
Install rke2 using commit id.
INSTALL_RKE2_COMMIT= ./install.sh
# INSTALL_RKE2_COMMIT=4ccaa37d20b38e7d95a1ccd577894d4689b36a84 ./install.sh
[INFO] using commit 4ccaa37d20b38e7d95a1ccd577894d4689b36a84 as release
[INFO] downloading hash https://storage.googleapis.com/rke2-ci-builds/rke2-4ccaa37d20b38e7d95a1ccd577894d4689b36a84.sha256sum
#
Effort to get a helm chart for the nginx controller.
Nginx will not be FIPS-compiled. After much research this is a large effort and not feasible for MVP release.
Depends on #42 completion.
Add support for updating config of the nodes.
Consider CoreDNS autoscaler as well.
A new argument has been added to kubelet that needs to be set to true
to comply with CIS 1.5 requires. The work to accomplish this was done in #87 .
To test
grep protect /var/lib/rancher/rke2//logs/kubelet.log
This command should return a string with the argument and it being set to false.
Version:
rke2 version v0.0.1-alpha.6
Node OS:
Ubuntu 20.04
Issue:
Jul 07 23:07:16 ip-172-31-15-215 systemd[1]: Failed to start Rancher Kubernetes Engine v2.
Jul 07 23:07:21 ip-172-31-15-215 systemd[1]: rke2.service: Scheduled restart job, restart counter is at 6.
Jul 07 23:07:21 ip-172-31-15-215 systemd[1]: Stopped Rancher Kubernetes Engine v2.
Jul 07 23:07:21 ip-172-31-15-215 systemd[1]: Starting Rancher Kubernetes Engine v2...
Jul 07 23:07:21 ip-172-31-15-215 rke2[2299]: time="2020-07-07T23:07:21Z" level=warning msg="not running in CIS 1.5 mode"
Jul 07 23:07:21 ip-172-31-15-215 rke2[2299]: time="2020-07-07T23:07:21Z" level=info msg="Starting rke2 v0.0.1-alpha.6 (HEAD)"
Jul 07 23:07:21 ip-172-31-15-215 rke2[2299]: time="2020-07-07T23:07:21Z" level=fatal msg="starting kubernetes: preparing server: start cluster and https: user: unknown user etcd"
Add support for RKE2 snapshot, backup, and restore via Rancher via CLI.
--cluster-reset
- You restore by triggering a cluster-reset with a restore path arg specififiedWe should be building/publishing RPM's to https://rpm.rancher.io for consumption by EL systems.
Support for FIPS TLS
Much like with K3s, add support for recognizing an imported RKE2 cluster.
RKE2 works only with embedded etcd driver defined in k3s repo https://github.com/rancher/k3s/blob/master/pkg/etcd/etcd.go, the etcd driver is responsible for the following:
To join the cluster you just need to start a new rke2 server and it will join the cluster automatically.
To remove an etcd member all you need to do is just remove the node from the cluster using kubectl:
kubectl delete node <node-name>
In case of any quorum loss you can reset the cluster with the same data on the server by passing --cluster-reset to rke2, after it resets the cluster you should remove --cluster-reset flag and restart rke2 again.
A new node should join the cluster and etcd member should be added
You can exec to any etcd pod running in kube-system and verify using the following command:
etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member list
A node should be removed from k8s cluster as well as from etcd cluster as a member
you can verify by exec-ing to any of the etcd pods left and run the following command to list the members
etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member list
cluster should come back up again
cluster should restore quorum and come back up again
Version:
Rke2 v0.0.1-alpha.4
Describe the issue:
2. After installing kubectl, default path of kubeconfig need to be passed explicitly.
kubectl get nodes --kubeconfig=/etc/rancher/rke2/rke2.yaml
NAME STATUS ROLES AGE VERSION
ip-172-31-13-222 Ready etcd,master 9m9s v1.18.4
As of rke2 version v0.0.1-alpha.6 we do not have node-external-ip flag with rke2.
Ingress creation fails while importing rke2 into rancher.
Just going to pull directly from the tweet that brought it up
https://twitter.com/uncontainer/status/1194987600185974786?s=20
By default, #Kubernetes disables the Docker default Seccomp profile that jessfraz worked so hard on.
Several K8s cloud providers don’t override that setting, making their containers completely insecure by default, requiring pod level config.
We should do some load/stress testing of RKE2 as there may be a performance impact due to implemented crypto. It would be a good idea to do the following:
While testing consider that we may have a performance impact compared to k3s tests we have done. We should test when we have an alpha available and try to complete this within a couple of weeks well before beta release.
This is a high level task that encompasses the work required to support RHEL8. It expands on work done as a part of #2
Autodetect binaries, if they are not available in the container bind-mount to host w/ chroot.
Separated out from original issue #1
CIS Hardening is complete. Next step here is to create a guide with details.
Version:
(all)
latest validated on version v0.0.1-alpha.6
Issue:
There is no mention of --cluster-reset
when using rke2 server --help
This tracks any work that may be required to get SELinux support into RKE2.
Version:
rke2 v0.0.1-alpha.4
Ubuntu 20.04
Issue:
Console get flooded with below msgs. Node is successfully added.
To reproduce:
Additional info:
ERRO[1700] Failed to connect to proxy error="dial tcp 172.31.33.122:9345: connect: connection refused"
ERRO[1700] Remotedialer proxy error error="dial tcp 172.31.33.122:9345: connect: connection refused"
INFO[1705] Connecting to proxy url="wss://172.31.33.122:9345/v1-rke2/connect"
ERRO[1705] Failed to connect to proxy error="dial tcp 172.31.33.122:9345: connect: connection refused"
ERRO[1705] Remotedialer proxy error error="dial tcp 172.31.33.122:9345: connect: connection refused"
INFO[1710] Connecting to proxy url="wss://172.31.33.122:9345/v1-rke2/connect"
Version:
rke2 v0.0.1-alpha.4
Issue:
nginx-ingress-controller service is in pending state. Since we dont have servicelb it should not expected to be of type LoadBalancer
kube-system nginx-ingress-controller LoadBalancer 10.43.5.161 <pending> 80:30782/TCP,443:30488/TCP 4h32m
Standalone upgrade controller integration and integration with Rancher.
Base off of rancher/system-upgrade-controller framework.
Config of nodes: #43 (to be done later)
Work should start on k3s first, then port this into RKE2. k3s-io/k3s#1505
Add support for flat config file which specifies flags to run binary with. Please reference the K3s issue for details. This issue is for tracking the work to port this over to RKE2
We might close this later in favor of #142. We decided to provide flags to the user that allows them to set up cloud providers.
However, it's possible we may need this later. Such as for vmware external ccm
We'll need to have this sorted out before we make a call on if this issue needs to be closed or not.
RKE2 version:
v0.0.1-alpha.4
Node OS:
Ubuntu 18.04:
Describe the issue:
rke2 installation errors on ubuntu18.04.
Logs:
Using binary for rke2. now seeing this error `rke2: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.28' not found (required by rke2)
RKE2 version:
v0.0.1-alpha.4
Describe the issue:
I0629 22:43:19.270728 1995 log.go:181] http: TLS handshake error from 172.31.7.168:48004: EOF
I0629 22:43:19.813270 1995 log.go:181] http: TLS handshake error from 172.31.7.168:46580: EOF
I0629 22:43:20.260827 1995 log.go:181] http: TLS handshake error from 172.31.7.168:60068: EOF
I0629 22:43:20.265857 1995 log.go:181] http: TLS handshake error from 172.31.7.168:41756: EOF
I0629 22:43:20.572638 1995 log.go:181] http: TLS handshake error from 172.31.7.168:63126: EOF
I0629 22:43:20.868685 1995 log.go:181] http: TLS handshake error from 172.31.7.168:48299: EOF
I0629 22:43:21.174996 1995 log.go:181] http: TLS handshake error from 172.31.7.168:28564: EOF
I0629 22:43:22.825398 1995 log.go:181] http: TLS handshake error from 172.31.7.168:24861: EOF
Hello!
I'm excited about the future of RKE, though the current version does not yet fit into our use case.
I work at a french state hospital called APHP for "Assistance Publique - Hôpitaux de Paris".
We are a little lightweight on system administration and development resources, so the ease of use of RKE was a great fit to us. The support of Airgap installations which are also really important for us is there, so that's good too. We run behind an HTTP proxy for anything that goes outside and our base systems run CentOS FYI.
The current project is about making environments available for remote computation such as JupyterHub with strict confidentiality requirements.
We have users with different rights to a big data warehouse, so they must not step onto each other's permissions and access data they're not allowed to.
So we determined that Kubernetes orchestration mechanisms were great for resource management and fast to spin up and down as well but the isolation between users in their own pods is insufficient.
We could study that the Kubernetes eco-system is currently evolving towards better security mainly with efforts driven by Red Hat, Google and Intel/OpenStack.
And where's RKE in all this? Well RKE depends on Docker so it can't use gVisor, Kata Containers or any other custom runtime such as cri-o.
So here's me saying that I'd really love if RKE2 could support non-Docker deployments while keeping the ease of use!
Thanks a lot for the awesome work!
By the way I wish we could fund you in some way but the process that leads to such a thing is complicated, but if I have a working prototype it'll be easier for me to justify it to the people that can do it. But again, it's not your responsability to drive us to a working prototype, but know that with my rather limited knowledge of Golang (but I'm a fast learner), I'm happy to help in any way I can.
And by the way x2, I'm currently experimenting with k3s with multi master HA deployments with dqlite but it's not quite there yet. I could also get Kata Containers running with k3s so that's good!
Leo
Few issues found during the initial install of rke2
provider "aws" {
region = "us-east-2"
profile = "rancher-eng"
}
Warning: Interpolation-only expressions are deprecated
on main.tf line 238, in resource "aws_lb_target_group_attachment" "rke2-nlb-attachement":
238: target_group_arn = "${aws_lb_target_group.rke2-master-nlb-tg.arn}"
Are you sure you want to continue connecting (yes/no/[fingerprint])?
null_resource.get-kubeconfig (local-exec): Host key verification failed.
Version:
Rke v0.0.1-alpha.4
Describe the bug:
Install first node
INSTALL_RKE2_VERSION=v0.0.1-alpha.4 ./install.sh
Join second master
.Node is available but not joined to master
INSTALL_RKE2_VERSION=v0.0.1-alpha.4 INSTALL_RKE2_EXEC='server' RKE2_URL='MasterIP:9345' RKE2_TOKEN='<TOKEN>' ./install.sh
Logs:
Jun 30 23:56:22 ip-172-31-1-120 rke2[448714]: time="2020-06-30T23:56:22Z" level=info msg="Shutting down /v1, Kind=Node workers"
Jun 30 23:56:22 ip-172-31-1-120 rke2[448714]: time="2020-06-30T23:56:22Z" level=info msg="Shutting down /v1, Kind=Secret workers"
Jun 30 23:56:22 ip-172-31-1-120 rke2[448714]: time="2020-06-30T23:56:22Z" level=fatal msg="server stopped: http: Server closed"
As per discussion with @ibuildthecloud:
rke2 will integrate helm charts as CRs manifests in the manifest directory, however since rke2 is using different supervisor port the helm controller will not be able to download the charts, so the following changes will be added:
chartContent
Review Go's use of BoringCrypto. Determine what needs to be done to get a FIPS-Compliant go build going.
Node OS: Centos, RHEL
Issue:
rpms are not available. as mentioned in the issue #49
Additional info:
rpm -i https://rpm.rancher.io/rke2-selinux-0.1.1-rc1.el7.noarch.rpm
curl: (22) The requested URL returned error: 404 Not Found
error: skipping https://rpm.rancher.io/rke2-selinux-0.1.1-rc1.el7.noarch.rpm - transfer failed
Based on 6/10/20 call with Rancher Federal team, there was some concern that image pull secrets do not work with containerd. We believe this is not the case but proposed to have QA briefly check this area to verify.
Functionality introduced in PR needs to be tested and verified. #58
Verify the etcd user has been created:
grep etcd /etc/passwd
Verify kernel parameters have been updated, run the commands below:
sysctl vm.panic_on_oom
sysctl kernel.panic
sysctl kernel.panic_on_oops
sysctl kernel.keys.root_maxbytes
Expected values:
vm.panic_on_oom=0
sysctl kernel.panic=10
sysctl kernel.panic_on_oops=1
sysctl kernel.keys.root_maxbytes=25000000
PR #32 sets the flag "secrets-encryption" to true by default and passed down to k3s. The same tests that verify this flag in k3s can be used for rke2.
Version:
version v0.0.1-alpha.6
Issue:
etcd user persists after running rke2-uninstall.sh, thus failing re-install of rke2 on the same node.
rke2 -v
-bash: /usr/local/bin/rke2: No such file or directory
cat /etc/passwd|grep etcd
etcd:x:997:997:ETCD Service User:/var/lib/rancher/rke2:/usr/sbin/nologin
Make sure static pods go into the default logging v2 and also the supervisor process log.
Basically we need to ensure all logs can get into Rancher log v2.
Based on recent discussions with the Rancher Federal team and Will, a full RPM installer is a must for MVP.
If a supplemental RPM is needed such as for SELinux policy this is okay. Best case one RPM does everything (is this possible?)
Is waiting on internal eio issue #36 to be completed.
The functionality introduced in PR: #56 needs to be validated. This work is in conjunction to the install script for adding CIS mode.
This can be done by:
./rke2 --profile=cis-1.5 server
If it runs, it thinks it succeeded.
To get the etcd process, run the command below.
ps aux | grep etcd
Check the pod manifest for a security context section that has the etcd user id and group id. Those id's can be references from the output from cat /etc/passwd | grep etcd
. To see the manifest:
cat /var/lib/rancher/rke2/agent/pod-manifests/etcd.yaml
Installation using commit id fails at download
INSTALL_RKE2_COMMIT=1dd8d99d86daac97b2cf2a060288c86e0059e7b6 ./install.sh
[INFO] using commit 1dd8d99d86daac97b2cf2a060288c86e0059e7b6 as release
[INFO] downloading hash https://storage.googleapis.com/rke2-ci-builds/rke2-1dd8d99d86daac97b2cf2a060288c86e0059e7b6.sha256sum
root@ip-172-31-4-195:~#
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.