squat / kilo Goto Github PK

View Code? Open in Web Editor NEW

2.0K 26.0 119.0 14.8 MB

Kilo is a multi-cloud network overlay built on WireGuard and designed for Kubernetes (k8s + wg = kg)

Home Page: https://kilo.squat.ai

License: Apache License 2.0

Dockerfile 0.26% Makefile 3.22% Go 90.85% JavaScript 1.36% CSS 0.37% Shell 3.94%

kubernetes networking vpn wireguard multi-cloud multi-cluster federation cni

kilo's People

Stargazers

Watchers

Forkers

awesome-nfv th3architect hien fire everesio serg1i serialvelocity infrastlabs avaussant ycliuhw linus5 rowhit seokho-son macduff23 frbncis notespeak skurfuerst carlosrmendes p-hash suisrc mauilion jpatel86 therockstardba fernando-mendoza eddiewang sprybts laijingli vvalorous isgasho lukelucode zhangyangisme rektide hansbogert mrmm pratikbin syseleven antonofthewoods clix-dev-llc zurlys marvel-works devopstoday11 castai polymath-is swipswaps davar-playgrounds cyancow ickelin sitedata paulokinho danieldin95 fabiofi julienvdg mrueg gaurav-magassian kelvl drscat pecigonzalo mgunjan gtriggiano ehco1996 leonnicolas stv0g codingcoffee lk26 shawnhank locoz666 makhov deamonluck celest-io evelynmitchell premroshannair jthan24 tuurki hhstu rigzba21 clive-jevons computenomics rouxantoine andrew-x-d anrg-laas piano-wow skirsten nobidev ubbo-sathla arpagon adminsharmasecureservicescausa geoeducator ttvuo cloudnepal showsmall mdedonno1337 baurmatt jcazorla90 awesomegolang evansung huangwind205 sysadminxxx dwarburt iq-scm boedy

kilo's Issues

WG in userspace / Running on systems without Wireguard kernel support?

Hello,

I was wondering if it is possible to run kilo with wireguard-go or boringtun, in userspace (instead of relying on kernel module). Is this on the roadmap?

Thanks!

kilo with flannel does not set wg endpoints on servers creation

I'm using k3s with flannel and kilo as an add-on, using in-cluster authorization without mounting kubeconfig and with the following args:

--cni=false
--local=false
--encapsulate=never
--subnet=10.40.0.0/16
--hostname=$(NODE_NAME)
--compatibility=flannel

I have a master server in location A and a node in location B.
The master has a public IP and the node is behind a NAT.

I'm setting the following annotations on master:

kilo.squat.ai/force-endpoint: <master_public_ip>:51820
kilo.squat.ai/leader: "true"
kilo.squat.ai/location: A

And these on node:

kilo.squat.ai/force-endpoint: <nat_public_ip>:51820
kilo.squat.ai/persistent-keepalive=5
kilo.squat.ai/location: B

I'm booting the master, then the node and after both servers become ready I don't see the related endpoints of each other on each host (using wg command).

Checking the logs in the kilo on master I see at the end those connection timed out :

{"caller":"mesh.go:219","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-02-26T10:06:11.347867897Z"}      {"caller":"main.go:217","msg":"Starting Kilo network mesh '12220b790da5ab7fbdcfb1db9d899bec9602261e-dirty'.","ts":"2020-02-26T10:06:12.017759649Z"}
E0226 10:06:12.145623       1 reflector.go:126] pkg/k8s/backend.go:391: Failed to list *v1alpha1.Peer: the server could not find the requested resource (get peers.kilo.squat.ai)
{"caller":"mesh.go:664","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-02-26T10:06:13.609772774Z"}
E0226 10:12:51.783433       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.30.2.83:48064->10.43.0.1:443: read: connection timed out
E0226 10:13:20.455365       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.30.2.83:48062->10.43.0.1:443: read: connection timed out

The :1107/health in kilo on master responds with 200 OK.
The logs in the kilo on node seems ok:

{"caller":"mesh.go:219","component":"kilo","level":"warn","msg":"no private key found on disk; generating one now","ts":"2020-02-26T10:18:01.36172736Z"}
{"caller":"main.go:217","msg":"Starting Kilo network mesh '12220b790da5ab7fbdcfb1db9d899bec9602261e-dirty'.","ts":"2020-02-26T10:18:01.387390299Z"}
{"caller":"mesh.go:664","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-02-26T10:18:01.853855818Z"}

To resolve this I had to delete the kilo pod on master and the new pod could configured correctly the wg endpoint.

kilo interrupts existing connections every 30s between a public master and a node behind NAT

Hi,
I have one master node in region A with a public ip and a worker node in region B behind a NAT (two separate networks).

After deploy Kilo I annotated both nodes to force external ip (master with own public ip and worker with NAT public ip) and to set the related location on each (master: region-a, worker: region-b).

Checking the wireguard peers in the master, with wg command, I can see the peer of the worker, with the NAT public ip as endpoint, but the port is different than the wireguard listen port set on the worker node.

I can also see that the an handshake was made successfully, but after 30s approximately, the Kilo recreate the peer because it detects differences on configuration (log: 'WireGuard configurations are different'), due to the endpoint port and interrupting existing connections.

How can I solve this?
Thanks in advance.

What is Kilo in comparison to Service Mesh softwares like Istio?

Hi,
I was wondering, what is Kilo in comparison to Service Mesh softwares such as Linkerd, Linkerd2 (formerly Conduit), Consul and Istio?

Service Mesh softwares list source: https://kubedex.com/istio-vs-linkerd-vs-linkerd2-vs-consul/

How do I know about Kilo?

I heard about Kilo by reading the schedule of the 2019 Open Networking Summit Europe conference:
Connecting Kubernetes Clusters Across Clouds With Kilo
Source: https://events.linuxfoundation.org/events/open-networking-summit-europe-2019/

What is Kilo?

I think I understand the difference between Kilo and the Container Network Interface (CNI like Calico and Flannel): the scope of a CNI is one node. CNI is a network plugin for the containers/pods running in one node.

But I don't understand the difference between Kilo and Service Mesh softwares like Istio.

My initial understanding is that Kilo is different than Istio & co. Kilo connects together K8s nodes that can span across data centers and public clouds via WireGuard VPN.
Therefore, thanks to Kilo, it is like if the apps and services running on these distributed K8s nodes within these different Cloud operators where running on the same "virtual (overlay) cloud".

So maybe Kilo is working at a lower level than Service Mesh solutions like Istio?

But I am not sure about this.

Thank you for any input you can share on this question!
Sorry if this post/question does not suit the github issues topics.

--
Nicop311

Err info `kilo.squat.ai/internal-ip`, got `127.0.0.1`

Problem:

I'm using squat/kilo:amd64-c93fa1e5b194e5d0a847f0775033bed92251f4d6

Add a new node ten-vm1got err info kilo.squat.ai/internal-ip with 127.0.0.1, cause err route in other node:

[root@hw-vm1 ~]# route -n |grep kilo
127.0.0.1       10.4.0.3        255.255.255.255 UGH   0      0        0 kilo0

Resolv:

Cur I just use kubectl annotate node ten-vm1 kilo.squat.ai/internal-ip="172.21.0.xx/32" --overwrite=true to fix this err info.

VPN only: failed to read CNI config list file

Hello,

I am trying to set up a VPN only kilo installation on a Rancher RKE / Canal based cluster.

I ended up using the manifest suggested in #30, making sure it is set to go to a single node. It installs fine, and the kilo pod is running.

When I add a peer using the suggested manifest (filling in public key), the kilo pod has an error as follows:

{"caller":"main.go:217","msg":"Starting Kilo network mesh 'ba00b6c180d40bd73fc94af5be3bbf8f85789bf9'.","ts":"2020-02-22T20:54:17.234829806Z"}
{"caller":"cni.go:58","component":"kilo","err":"failed to read CNI config list file: error reading /etc/cni/net.d/10-kilo.conflist: open /etc/cni/net.d/10-kilo.conflist: no such file or directory","level":"warn","msg":"failed to get CIDR from CNI file; overwriting it","ts":"2020-02-22T20:54:17.336024857Z"}
{"caller":"cni.go:66","component":"kilo","level":"info","msg":"CIDR in CNI file is empty","ts":"2020-02-22T20:54:17.336108147Z"}
{"CIDR":"10.42.3.0/24","caller":"cni.go:71","component":"kilo","level":"info","msg":"setting CIDR in CNI file","ts":"2020-02-22T20:54:17.336140049Z"}
{"caller":"cni.go:73","component":"kilo","err":"failed to read CNI config list file: open /etc/cni/net.d/10-kilo.conflist: no such file or directory","level":"warn","msg":"failed to set CIDR in CNI file","ts":"2020-02-22T20:54:17.33616623Z"}
{"caller":"mesh.go:482","component":"kilo","event":"add","level":"info","peer":{"AllowedIPs":[{"IP":"10.5.0.1","Mask":"/////w=="}],"Endpoint":null,"PersistentKeepalive":10,"PublicKey":"<...>","Name":"squat"},"ts":"2020-02-22T20:54:17.486646157Z"}
{"caller":"mesh.go:717","component":"kilo","error":"failed to delete configuration file: remove /var/lib/kilo/conf: no such file or directory","level":"error","ts":"2020-02-22T20:54:17.486913746Z"}
{"caller":"mesh.go:727","component":"kilo","error":"failed to clean up node backend: failed to patch node: the server rejected our request due to an error in our request","level":"error","ts":"2020-02-22T20:54:17.493099325Z"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x11f6048]
goroutine 31 [running]:
github.com/squat/kilo/pkg/mesh.(*Mesh).resolveEndpoints(0xc00023c580, 0x16, 0x0)
	/kilo/pkg/mesh/mesh.go:764 +0x2f8
github.com/squat/kilo/pkg/mesh.(*Mesh).applyTopology(0xc00023c580)
	/kilo/pkg/mesh/mesh.go:560 +0xc7
github.com/squat/kilo/pkg/mesh.(*Mesh).syncPeers(0xc00023c580, 0xc00035c060)
	/kilo/pkg/mesh/mesh.go:483 +0x51d
github.com/squat/kilo/pkg/mesh.(*Mesh).Run(0xc00023c580, 0x0, 0x0)
	/kilo/pkg/mesh/mesh.go:350 +0x632
main.Main.func4(0x0, 0x0)
	/kilo/cmd/kg/main.go:218 +0x14a
github.com/oklog/run.(*Group).Run.func1(0xc00049f740, 0xc0003a7ca0, 0xc0002aa860)
	/kilo/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/oklog/run.(*Group).Run
	/kilo/vendor/github.com/oklog/run/group.go:37 +0xbb

According to the manifest I am using, CNI is not enabled (--cni=false) but it is still looking for this file. Any suggestions on how to get this working?

https://github.com/squat/kilo/blob/5a45c1f85b0ab2230961ecd652c76587b023b8d9/manifests/kilo-vpn-only-example.yaml

Thanks,

Ben

How to create a K8s cluster with nodes in different cloud service providers

Hi squat,

In the issue #9 that I posted,
I first made a WireGuard connection between VmOnAWS and VmOnGCP,
then created a K8s cluster with those nodes,
and applied Kilo as a final step.

But in the issue #8, you said that

For Kilo to pick up an existing Wireguard interface on the host is not supported.

So I wanted to re-create a K8s cluster with nodes in different cloud service providers,
and here is the point where my questions arise.

When I follow the Kilo installation guide..

Step 1: install WireGuard

I am using Ubuntu as an guest VM's OS, so I installed WireGuard with apt install wireguard.
root@VmOnAWS# which wg

/usr/bin/wg

And for the clean state, I destroyed the existing WireGuard connection between VmOnAWS and VmOnGCP.

# wg-quick down ./wg0.conf

[#] wg showconf wg0
[#] ip link delete dev wg0

Step 2: open WireGuard port

I opened UDP port 51820 for my AWS SecurityGroup and GCP SecurityGroup.

Step 3: specify topology

The instruction makes me kubectl annotate k8s nodes,
but I have no k8s cluster since I started from the clean state,
neither can I make a k8s cluster (without using WireGuard or sth) since one node is in AWS and another is in GCP.

One of the my guesses is that

Create a k8s cluster on AWS.
Enable Kilo VPN between the k8s cluster on AWS and a VM on GCP.
Make the VM on GCP a worker node of the k8s cluster on AWS.
which I have not tried yet.

Do you think this will work / make sense..?

Amazing project! using with k3s sharing.

==[vm-route]===============

[root@ali-vm1 v070]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.16.175.253  0.0.0.0         UG    0      0        0 eth0
2.3.0.0         0.0.0.0         255.255.255.0   U     0      0        0 br-6d0b493fbd9b
7.0.0.0         0.0.0.0         255.255.255.0   U     0      0        0 cni0
7.0.1.0         10.4.0.2        255.255.255.0   UG    0      0        0 kilo0
10.4.0.0        0.0.0.0         255.255.0.0     U     0      0        0 kilo0
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
172.16.160.0    0.0.0.0         255.255.240.0   U     0      0        0 eth0
172.17.91.0     0.0.0.0         255.255.255.0   U     0      0        0 docker0
172.19.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-6cec6875d930
192.168.0.105   10.4.0.2        255.255.255.255 UGH   0      0        0 kilo0

[root@hw-vm1 v070]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.0.1     0.0.0.0         UG    0      0        0 eth0
7.0.0.0         10.4.0.1        255.255.255.0   UG    0      0        0 kilo0
10.4.0.0        0.0.0.0         255.255.0.0     U     0      0        0 kilo0
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1003   0        0 eth1
169.254.0.0     0.0.0.0         255.255.0.0     U     1004   0        0 eth2
169.254.169.254 192.168.0.254   255.255.255.255 UGH   0      0        0 eth0
172.16.168.255  10.4.0.1        255.255.255.255 UGH   0      0        0 kilo0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.18.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-f5da666f520e
172.19.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-873639ae95ca
172.20.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-d9e7fbf26b47
172.21.0.0      0.0.0.0         255.255.0.0     U     0      0        0 br-2be4dd4a63ad
192.168.0.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.0.0     0.0.0.0         255.255.255.0   U     0      0        0 eth2
192.168.2.0     0.0.0.0         255.255.255.0   U     0      0        0 eth1

==[podOK]==================
kube-system rbac-manager-79bdb8757d-hdhgq 1/1 Running 1 34h 7.0.0.58 ali-vm1

[root@ali-vm1 v070]# ping 7.0.0.58
PING 7.0.0.58 (7.0.0.58) 56(84) bytes of data.
64 bytes from 7.0.0.58: icmp_seq=1 ttl=64 time=0.054 ms
^C

[root@hw-vm1 v070]# ping 7.0.0.58
PING 7.0.0.58 (7.0.0.58) 56(84) bytes of data.
64 bytes from 7.0.0.58: icmp_seq=1 ttl=63 time=26.6 ms
64 bytes from 7.0.0.58: icmp_seq=2 ttl=63 time=26.6 ms

==[svc]================

[root@(⎈ |default:kube-system) t-nat]$ kc get svc -A
	NAMESPACE        NAME                TYPE        CLUSTER-IP   EXTERNAL-IP      PORT(S)                  AGE
	default          agola-gateway       NodePort    6.7.9.161    <none>           8000:30002/TCP           3d21h

[root@ali-vm1 v070]# yum install nc

[root@ali-vm1 v070]# nc -vz 6.7.9.161 8000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 6.7.9.161:8000.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.
[root@ali-vm1 v070]# curl 6.7.9.161:8000
<!DOCTYPE html><html lang=en><head><meta
	[root@ali-vm1 v070]# curl 6.7.8.1:443
	Client sent an HTTP request to an HTTPS server.

[root@hw-vm1 v070]# nc -vz 6.7.9.161 8000
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 6.7.9.161:8000.
Ncat: 0 bytes sent, 0 bytes received in 0.04 seconds.
[root@hw-vm1 v070]# curl 6.7.9.161:8000
<!DOCTYPE html><html lang=en><head><meta
	[root@hw-vm1 v070]# curl 6.7.8.1:443
	^C

kilo question - informational

installing kilo on 4 nodes, 2 different public network space, does kilo encrypt comms between nodes ?

how does one validate this encryption, meaning if it "automagically" encrypts node comunications how can i verify it from node to node?

also, can my remote "workstation" (laptop) from far away be a client to the cluster, and also utilize the vpn for internet access along with management. Sorry the docs werent so clear to me. kilo is installed.

feature request: nodes check connection to each other

feature: nodes check connection to each other, for example by ping, get the ttl show in kgctl and for metrics

getting ip6tables error, both Ubuntu and CentOS hosts

Using kilo as cni in kubeadm (kilo/manifests/kilo-kubeadm.yaml file), I'm getting the following error in all nodes (ubuntu18.04 and centos7):

{"caller":"mesh.go:618","component":"kilo","error":"failed to add rule: failed to add iptables chain: running [/sbin/ip6tables -t nat -N KILO-NAT --wait]: exit status 3: modprobe: can't change directory to '/lib/modules': No such file or directory\nip6tables v1.8.4 (legacy): can't initialize ip6tables table `nat': Table does not exist (do you need to insmod?)\nPerhaps ip6tables or your kernel needs to be upgraded.\n","level":"error","ts":"2020-05-07T13:59:26.463594297Z"}

But after than I execute the command /sbin/ip6tables -t nat -N KILO-NAT --wait directly on the node kilo starts working, configuring correctly the pod network and wireguard conf.

Publish kgctl binaries for every linux, macos, and windows with every release

feature request:
kgctl under releases for linux/macosx/win

meanwhile:
kgctl as docker image => https://hub.docker.com/r/mrhein/kgctl

Rules in the wrong order

It doesn't look like there is any way to hit the last 6 rules:

-A KILO-NAT -d 172.30.12.0/22 -m comment --comment "Kilo: do not NAT packets destined for the local Pod subnet" -j RETURN
-A KILO-NAT -d 172.28.128.0/24 -m comment --comment "Kilo: do not NAT packets destined for the Kilo subnet" -j RETURN
-A KILO-NAT -d 10.255.255.254/32 -m comment --comment "Kilo: do not NAT packets destined for the local private IP" -j RETURN
-A KILO-NAT -m comment --comment "Kilo: NAT remaining packets" -j MASQUERADE
-A KILO-NAT -s 172.30.12.0/22 -d 172.28.129.1/32 -m comment --comment "Kilo: do not NAT packets from local pod subnet to peers" -j RETURN
-A KILO-NAT -s 172.30.12.0/22 -d 192.168.1.0/24 -m comment --comment "Kilo: do not NAT packets from local pod subnet to peers" -j RETURN
-A KILO-NAT -s 172.30.12.0/22 -d 172.30.4.0/22 -m comment --comment "Kilo: do not NAT packets from local pod subnet to remote pod subnets" -j RETURN
-A KILO-NAT -s 172.30.12.0/22 -d 172.30.0.0/22 -m comment --comment "Kilo: do not NAT packets from local pod subnet to remote pod subnets" -j RETURN
-A KILO-NAT -s 172.30.12.0/22 -d 172.30.8.0/22 -m comment --comment "Kilo: do not NAT packets from local pod subnet to remote pod subnets" -j RETURN
-A KILO-NAT -s 172.30.12.0/22 -d 172.30.16.0/22 -m comment --comment "Kilo: do not NAT packets from local pod subnet to remote pod subnets" -j RETURN

Kilo panicking when iptables fails

Here's the stacktrace:

{"caller":"mesh.go:617","component":"kilo","error":"failed to delete rule: failed to clear iptables chain: running [/sbin/iptables -t filter -F KILO-IPIP --wait]: exit status 4: iptables: Resource temporarily unavailable.\n","level":"error","ts":"2020-04-19T18:33:48.316238028Z"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x104fe51]

goroutine 42 [running]:
github.com/squat/kilo/pkg/iptables.(*Controller).reconcile(0xc00011b810, 0x0, 0x0)
	/kilo/pkg/iptables/iptables.go:246 +0xe1
github.com/squat/kilo/pkg/iptables.(*Controller).Run.func1(0xc00011b810, 0xc000088180)
	/kilo/pkg/iptables/iptables.go:230 +0x11f
created by github.com/squat/kilo/pkg/iptables.(*Controller).Run
	/kilo/pkg/iptables/iptables.go:222 +0xc3

Connection refused from istiod

After setting up kilo on k3s, i noticed that although Kilo is running, istio-ingressgateway pod can't seem to reach the istiod pod. in the logs you'll notice the the connection is refused.

I'm using coredns with kilo's CNI (no flannel).
Any advice would be greatly appreciated!

2020-06-12T23:29:50.578577Z	warning	envoy config	[bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:54] Unable to establish new stream
2020-06-12T23:29:50.989714Z	warn	cache	resource:default request:54d7ab7d-1db1-4fb5-ae9b-1b33c85912ee CSR failed with error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.43.0.10:53: read udp 10.42.2.9:51000->10.43.0.10:53: read: connection refused", retry in 6400 millisec
2020-06-12T23:29:50.991787Z	warning	envoy config	[bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamSecrets gRPC config stream closed: 14, connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.43.0.10:53: read udp 10.42.2.9:51000->10.43.0.10:53: read: connection refused"
2020-06-12T23:29:50.990179Z	error	citadelclient	Failed to create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.43.0.10:53: read udp 10.42.2.9:51000->10.43.0.10:53: read: connection refused"
2020-06-12T23:29:50.990266Z	error	cache	resource:default request:54d7ab7d-1db1-4fb5-ae9b-1b33c85912ee CSR retrial timed out: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.43.0.10:53: read udp 10.42.2.9:51000->10.43.0.10:53: read: connection refused"
2020-06-12T23:29:50.990327Z	error	cache	resource:default failed to generate secret for proxy: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.43.0.10:53: read udp 10.42.2.9:51000->10.43.0.10:53: read: connection refused"
2020-06-12T23:29:50.990366Z	error	sds	resource:default Close connection. Failed to get secret for proxy "router~10.42.2.9~istio-ingressgateway-74d4d8d459-wlt8p.istio-system~istio-system.svc.cluster.local" from secret cache: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.43.0.10:53: read udp 10.42.2.9:51000->10.43.0.10:53: read: connection refused"

address already in use

Trying to setup kilo on a k3s cluster with three local nodes and one remote node.

All four nodes are showing this in the logs over and over:

{"caller":"mesh.go:639","component":"kilo","error":"address already in use","level":"error","ts":"2020-04-08T02:37:36.742997397Z"}
{"caller":"mesh.go:639","component":"kilo","error":"address already in use","level":"error","ts":"2020-04-08T02:37:44.982264977Z"}
{"caller":"mesh.go:639","component":"kilo","error":"address already in use","level":"error","ts":"2020-04-08T02:39:57.25050876Z"}
{"caller":"mesh.go:639","component":"kilo","error":"address already in use","level":"error","ts":"2020-04-08T02:39:57.402066518Z"}
{"caller":"mesh.go:639","component":"kilo","error":"address already in use","level":"error","ts":"2020-04-08T02:39:59.314872301Z"}
{"caller":"mesh.go:639","component":"kilo","error":"address already in use","level":"error","ts":"2020-04-08T02:40:09.406198759Z"}
{"caller":"mesh.go:639","component":"kilo","error":"address already in use","level":"error","ts":"2020-04-08T02:40:16.246194963Z"}
{"caller":"mesh.go:639","component":"kilo","error":"address already in use","level":"error","ts":"2020-04-08T02:40:19.455699414Z"}

I'm not really sure what address is even in use so I can't debug on my end.

Any guidance?

getting "received incomplete node" on the kilo leader and peer is not being created

Hi @squat,
I have a node on a location B with all the peer created correctly, but its peer is not being created on the leader side, on location A, with this message showing on the kilo leader logs:

{"caller":"mesh.go:382","component":"kilo","event":"update","level":"debug","msg":"received incomplete node","node":{"Endpoint":{"DNS":"","IP":"192.168.50.13","Port":51820},"Key":"MHFPNW0zR3oxMFBlRkV0UUNRNGcxNTNvcTFnLzZnbE15WUJ1K2Q2d1JDTT0=","InternalIP":{"IP":"192.168.50.13","Mask":"////AA=="},"LastSeen":1594277590,"Leader":false,"Location":"admin-c7z87-oce-systel-ne-1","Name":"admin-c7z87-oce-systel-ne-1","PersistentKeepalive":5,"Subnet":{"IP":"10.42.39.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.40.0.2","Mask":"//8AAA=="}},"ts":"2020-07-11T10:52:59.048513869Z"}

Do you know what may be causing this?
Thanks in advance.

Unable to access local nodes when using topologies

Hi Lucas!

Thank you for this great project!

I am trying to set up a multi provider k3s cluster using kilo. The machines roughly look like:

oci location - 2 machines (both only has local ip address assigned to the local interfaces, ext ips are managed via the internet gateway of the cloud provider)
gcp location - 1 machine

I haven't got to doing a multi provider setup yet. I am still trying to get the 2 machines in oci to talk to each other.

I am trying to use kilo as CNI directly, the network configuration is as follows:
oci-master - internal ip 10.1.20.3, external (using placeholders here)
oci-worker - internal ip 10.1.20.2, external

The machines can ping each other directly using the 10.1.20.x addresses.

My issue is that, once they come up, I can't get the pods launched on each machine to talk to each other.
I can ping pods on the machines that runs it, but not from master -> worker and vice versa.

on my laptop

> kubectl get po -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP          NODE         NOMINATED NODE   READINESS GATES
my-nginx-74f94c7795-j7kzv   1/1     Running   0          99m   10.42.1.5   oci-worker   <none>           <none>

but on oci-master

> ping 10.42.1.5
PING 10.42.1.5 (10.42.1.5): 56 data bytes
^C--- 10.42.1.5 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss

I think i should be able to reach every pod from any node on the cluster (AFAIK).

Please let me know if there is additional info that would be helpful to include!

My setup details are below

I provisioned the machines as follows:
oci-master:

k3sup install \
    --ip <ext-master-ip> \
    --k3s-version 'v1.17.0+k3s.1' \
    --k3s-extra-args '--no-flannel --no-deploy metrics-server --no-deploy servicelb --no-deploy traefik --default-local-storage-path /k3s-local-storage --node-name oci-master --node-external-ip <ext-master-ip> --node-ip 10.1.20.3'

kubectl annotate node oci-master \
    kilo.squat.ai/force-external-ip="<ext-ip-master>/32" \
    kilo.squat.ai/force-internal-ip="10.1.20.3/24" \
    kilo.squat.ai/location="oci" \
    kilo.squat.ai/leader="true"

oci-worker:

k3sup join \
    --ip <ext-worker-ip> \
    --server-ip <ext-master-ip> \
    --k3s-version 'v1.17.0+k3s.1' \
    --k3s-extra-args '--no-flannel --node-name oci-worker --node-external-ip <ext-worker-ip> --node-ip 10.1.20.2''

kubectl annotate node oci-worker \
    kilo.squat.ai/force-external-ip="<ext-worker-ip>/32" \
    kilo.squat.ai/force-internal-ip="10.1.20.2/24" \
    kilo.squat.ai/location="oci"

Finally setting up kilo

kubectl apply -f k3s-kilo.yaml

I had to do the same changes suggested in #11 and #27 to make sure that kilo pods has the correct permissions, but I was able to get the pods to come up correctly.

I am able see logs like these when taking pod logs (with log-level=debug)
on oci-master

{"caller":"mesh.go:410","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-02-09T09:12:46.095414595Z"}
{"caller":"mesh.go:412","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"ExternalIP":{"IP":"<ext-ip-master>","Mask":"/////w=="},"Key":"<key>","InternalIP":{"IP":"10.1.20.3","Mask":"////AA=="},"LastSeen":1581239566,"Leader":true,"Location":"oci","Name":"oci-master","Subnet":{"IP":"10.42.0.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.1","Mask":"//8AAA=="}},"ts":"2020-02-09T09:12:46.095454981Z"}

on oci-worker

{"caller":"mesh.go:410","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-02-09T10:44:48.564218597Z"}
{"caller":"mesh.go:508","component":"kilo","level":"debug","msg":"successfully checked in local node in backend","ts":"2020-02-09T10:45:18.478913052Z"}
{"caller":"mesh.go:675","component":"kilo","level":"debug","msg":"local node is not the leader","ts":"2020-02-09T10:45:18.4804814Z"}      
{"caller":"mesh.go:410","component":"kilo","event":"update","level":"debug","msg":"syncing nodes","ts":"2020-02-09T10:45:18.481320232Z"}  
{"caller":"mesh.go:412","component":"kilo","event":"update","level":"debug","msg":"processing local node","node":{"ExternalIP":{"IP":"<ext-ip-worker>","Mask":"/////w=="},"Key":"<key>","InternalIP":{"IP":"10.1.20.2","Mask":"////AA=="},"LastSeen":1581245118,"Leader":false,"Location":"oci","Name":"oci-worker","Subnet":{"IP":"10.42.1.0","Mask":"////AA=="},"WireGuardIP":null},"ts":"2020-02-09T10:45:18.481367592Z"}

oci-master

> ifconfig
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.1.20.3  netmask 255.255.255.0  broadcast 10.1.20.255
        inet6 fe80::200:17ff:fe02:2f31  prefixlen 64  scopeid 0x20<link>
        ether 00:00:17:02:2f:31  txqueuelen 1000  (Ethernet)
        RX packets 945623  bytes 2361330833 (2.3 GB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 851708  bytes 304538145 (304.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

kilo0: flags=209<UP,POINTOPOINT,RUNNING,NOARP>  mtu 1420
        inet 10.4.0.1  netmask 255.255.0.0  destination 10.4.0.1
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 1000  (UNSPEC)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1354843  bytes 457783326 (457.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1354843  bytes 457783326 (457.7 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tunl0: flags=193<UP,RUNNING,NOARP>  mtu 8980
        inet 10.42.0.1  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5  bytes 420 (420.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

> ip route
default via 10.1.20.1 dev ens3
default via 10.1.20.1 dev ens3 proto dhcp src 10.1.20.3 metric 100
10.1.20.0/24 dev ens3 proto kernel scope link src 10.1.20.3
10.4.0.0/16 dev kilo0 proto kernel scope link src 10.4.0.1
10.42.1.0/24 via 10.1.20.2 dev tunl0 proto static onlink

oci-worker

> ifconfig
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.1.20.2  netmask 255.255.255.0  broadcast 10.1.20.255
        inet6 fe80::200:17ff:fe02:1682  prefixlen 64  scopeid 0x20<link>
        ether 00:00:17:02:16:82  txqueuelen 1000  (Ethernet)
        RX packets 231380  bytes 781401888 (781.4 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 221393  bytes 29979034 (29.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

kube-bridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.42.1.1  netmask 255.255.255.0  broadcast 0.0.0.0
        inet6 fe80::38f7:34ff:fed9:897e  prefixlen 64  scopeid 0x20<link>
        ether 26:d7:aa:ce:37:f8  txqueuelen 1000  (Ethernet)
        RX packets 21865  bytes 10732037 (10.7 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19269  bytes 7046706 (7.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 78258  bytes 29977684 (29.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 78258  bytes 29977684 (29.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tunl0: flags=193<UP,RUNNING,NOARP>  mtu 8980
        inet 10.42.1.1  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10  bytes 840 (840.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth5ee1a633: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::24d7:aaff:fece:37f8  prefixlen 64  scopeid 0x20<link>
        ether 26:d7:aa:ce:37:f8  txqueuelen 0  (Ethernet)
        RX packets 12748  bytes 10219673 (10.2 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9890  bytes 4818258 (4.8 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth965708c2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::9cfc:9dff:fef1:dc7a  prefixlen 64  scopeid 0x20<link>
        ether 9e:fc:9d:f1:dc:7a  txqueuelen 0  (Ethernet)
        RX packets 22  bytes 1636 (1.6 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 21  bytes 1754 (1.7 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethd34408af: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::5077:76ff:fe3a:1b01  prefixlen 64  scopeid 0x20<link>
        ether 52:77:76:3a:1b:01  txqueuelen 0  (Ethernet)
        RX packets 9091  bytes 816526 (816.5 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9442  bytes 2233086 (2.2 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

> ip route
default via 10.1.20.1 dev ens3
default via 10.1.20.1 dev ens3 proto dhcp src 10.1.20.2 metric 100
10.1.20.0/24 dev ens3 proto kernel scope link src 10.1.20.2
10.4.0.1 via 10.1.20.3 dev tunl0 proto static onlink
10.42.0.0/24 via 10.1.20.3 dev tunl0 proto static onlink
10.42.1.0/24 dev kube-bridge proto kernel scope link src 10.42.1.1
169.254.0.0/16 dev ens3 proto dhcp scope link src 10.1.20.2 metric 100

Other things that I've tried

Interestingly setting up another machine with a different region, I was able to see that the wireguard interfaces come up with correct allowed-ips and was even able to ping 10.1.20.2 (oci-worker) directly over wireguard.
Presumably that is going from gcp-worker -> oci-master (leader for oci location) -> oci-worker

Crash when route event dest nil (and Kilo fails to work after crashing)

Hi squat!

I encountered a crash on latest develop (c93fa1e):

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x123ae3e]

goroutine 64 [running]:
github.com/squat/kilo/pkg/route.(*Table).Run.func1(0xc000420390, 0xc00007f2c0, 0xc00007e780)
        /kilo/pkg/route/route.go:96 +0x24e
created by github.com/squat/kilo/pkg/route.(*Table).Run
        /kilo/pkg/route/route.go:80 +0x1d9

It happens when I run:

sudo systemctl restart systemd-networkd

The event that is breaking kilo is (with my ipv6 gateway redacted):

{Ifindex: 2 Dst: <nil> Src: <nil> Gw: 0123:4567:89ab:cdef:: Flags: [] Table: 254}

Once kilo restarts, the old interface still exists so kilo tries to create a second interface and fails miserably unless I intervene:

{"caller":"mesh.go:666","component":"kilo","error":"address already in use","level":"error","ts":"2019-08-27T15:31:41.993388636Z"}

There's two issues here:

The crash occurred. I guess you just need to filter out nil dst entries?
When a crash occurs, Kilo restarts but tries to use a new interface. That results in a lot of errors and networking not working. Is there a reason why you don't just pick kilo0 every time?

Logical Groups mesh granularity not working

I have two subnets in two different locations I would like to connect. Each subnet has two nodes each. I use k3s, so I tried applying the kilo-k3s.yaml manifest to connect one node in each location. When I tried running pods that pinged each other on different nodes, the pods could not connect to other nodes. So, I tried enabling the --mesh-granularity=full flag and applied the manifest again. It went through without any problems and my pods could talk to each other.

Any idea how I can debug the problem? I would like to only have one node in each logical group connected.

Allow disabling private IP

Hi squat!

Is there a way to completely disable the private IP? I have hosts that do not have a private interface. Currently I'm forcing the private IP to a random IP that doesn't exist but it still adds it to the allowed ips list.

Include video in readme.

Is it correct that the video, https://www.youtube.com/watch?v=iPz_DAOOCKA, isn't referenced anywhere?
If so, that would be a shame, shall I make a PR?

Gravity compatibility

Gravity is a platform that allows us to build K8s clusters declaratively, and is a pretty powerful tool I've started experimenting with as part of my devops toolkit.

It has its own implementation of wireguard (wormhole) that helps create a mesh, similar to kilo, but kilo provides easy peering functionality with kgctl.

I'd love to start a conversation about how we can make a .yaml deployment for gravity clusters. I'm able to pretty seamlessly get kilo up and running on gravity. the only issue right now is that although the wireguard kilo interface shows up, it appears that kilo/kgctl is never able to pull the nodes and properly apply the wireguard config.

Support running kg as a peer

Hi Squat!

It would be great if it was possible to run kg as a peer. Basically, I want a process watching kubernetes for updates and auto-updating my wireguard interface with the new routes.

I've hacked it into the codebase here:
master...SerialVelocity:kg-peer
but it isn't complete or very pretty code!

Applying `kilo.yaml` gives error

I am trying to apply Kilo on K8s cluster, but an error appears when I kubectl apply -f kilo.yaml.

What I want to do

One node is on AWS.
Another node is on GCP.
I want to make a K8s cluster, with one+ master node (on AWS) and one+ worker node (on GCP).

WireGuard

root@VmOnAWS# wg show all (WireGuard Server, 10.10.10.10)

interface: wg0
  public key: (masked)
  private key: (hidden)
  listening port: 51820

peer: (masked)
  endpoint: 34.97.48.xxx:51820
  allowed ips: 10.10.10.12/32
  latest handshake: 1 minute, 30 seconds ago
  transfer: 297.40 MiB received, 1.47 GiB sent

root@VmOnGCP# wg show all (WireGuard Client, 10.10.10.12)

interface: wg0
  public key: (masked)
  private key: (hidden)
  listening port: 51820

peer: (masked)
  endpoint: 15.164.170.xxx:51820
  allowed ips: 10.10.10.10/32
  latest handshake: 1 minute, 24 seconds ago
  transfer: 1.47 GiB received, 297.44 MiB sent
  persistent keepalive: every 25 seconds

K8s cluster

# kubectl get nodes -o wide

NAME              STATUS     ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
ip-172-31-17-xx   NotReady   master   24m   v1.15.0   172.31.17.xxx   <none>        Ubuntu 18.04.2 LTS   4.15.0-1044-aws   docker://18.9.7
nodes-wg2         NotReady   <none>   21m   v1.15.0   10.174.0.xxx    <none>        Ubuntu 18.04.2 LTS   4.15.0-1037-gcp   docker://18.9.7

Node ip-172-31-17-xx is on AWS.
Node nodes-wg2 is on GCP.
Since I have not installed a pod network add-on yet, the status of nodes is NotReady.

Applying Kilo

# kubectl annotate node ip-172-31-17-xx kilo.squat.ai/location="aws"

node/ip-172-31-17-xx annotated

# kubectl annotate node nodes-wg2 kilo.squat.ai/location="gcp"

node/nodes-wg2 annotated

# kubectl apply -f https://raw.githubusercontent.com/squat/kilo/master/manifests/kilo-kubeadm.yaml

configmap/kilo created
serviceaccount/kilo created
clusterrole.rbac.authorization.k8s.io/kilo created
clusterrolebinding.rbac.authorization.k8s.io/kilo created
error: error validating "https://raw.githubusercontent.com/squat/kilo/master/manifests/kilo-kubeadm.yaml": error validating data: ValidationError(DaemonSet.spec): missing required field "selector" in
io.k8s.api.apps.v1.DaemonSetSpec; if you choose to ignore these errors, turn validation off with --validate=false

# kubectl apply -f https://raw.githubusercontent.com/squat/kilo/master/manifests/kilo-kubeadm.yaml --validate=false

configmap/kilo unchanged
serviceaccount/kilo unchanged
clusterrole.rbac.authorization.k8s.io/kilo unchanged
clusterrolebinding.rbac.authorization.k8s.io/kilo unchanged
The DaemonSet "kilo" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"app.kubernetes.io/name":"kilo"}: `selector` does not match template `labels`

Is there a way to resolve this issue/problem?

persistent-keepalive is not being updated if it's changed at runtime

If I add the kilo.squat.ai/persistent-keepalive annotation, or update the value of it, on an existing node running kilo already, the wireguard conf is not updated.

k3s kilo pods crashlooping

sudo kubectl apply -f https://raw.githubusercontent.com/squat/kilo/master/manifests/kilo-k3s-flannel.yaml

serviceaccount/kilo created
clusterrole.rbac.authorization.k8s.io/kilo created
clusterrolebinding.rbac.authorization.k8s.io/kilo created
daemonset.apps/kilo created

sudo kubectl logs -f kilo-cz64w -n kube-system

failed to create Kubernetes config: Error loading config file "/etc/kubernetes/kubeconfig": read /etc/kubernetes/kubeconfig: is a directory

I think the problem is with kilo-k3s-flannel.yaml:99.

Running kilo on RKE deployed clusters

Hi
First of all thank you for your awesome work with this project, much appreciated.
We test kilo currently with our clusters that we deploy with RKE and import to rancher later on. We use it as CNI provider and in a full-mesh layout.
We used the kilo-k3.yaml as our reference and had to lower the mtu setting in the cni-conf.json configmap to 1300. The rancher-node-agent tries to open a wss:// connection to the rancher server which did not succeed with the original 1420 setting. While the 1300 was just our first lucky shot, it might be worth further testing to be as high as possible but we had no problems with this setting so far. Do you think this is worth documenting in this project? If yes, could you suggest a good place (maybe another file in manifests) so that I can suggest a PR?

Why does kilo get cluster config using kubeconfig (or API server URL flag) when it has a service account?

While setting up kilo on a k3s cluster I noticed that it uses -kubeconfig, or -master to get the config that is used when interfacing with the cluster. This code can be seen here.

This seems like a security problem - why should kilo require access to my kubeconfig, which contains credentials that have the power to do anything to the cluster? Moreover, it seems redundant: I looked through kilo-k3s-flannel.yaml (which is what I used to get it working) and noticed that a service account is created for kilo with all of the permissions it should need.

This example (see main.go) uses this function to get the config. Can kilo not use this function instead?

I'm new to interfacing applications with kubernetes clusters, so if I'm missing something my apologies. If it's be welcome I'd be happy to submit a pull request for this.

Use existing interface?

Will Kilo pick up an existing Wireguard interface on the host?

To be clear, I'd consider it a feature (though it's conceivably an anti-feature I suppose) - it would mean having it bound to a physical interface in a different network namespace was automatically supported.

Docker image doesn't work in raspberry pi 3

Hi
I've seen this commit which adds support to arm architectures, but it doesn't work. It returns this:

no matching manifest for linux/arm/v7 in the manifest list entries

I'm doing anything wrong?

Force internal IP?

Is it possible to specify the internal IP that a node should use?

Or, if not that granular (or not possible/desirable for some reason?) - a CIDR block for the location?

Project looks great by the way - I'm currently running Wireguard on the hosts and using kube-router for CNI, so tempted to collapse that into just kilo.

no matches for kind "DaemonSet" in version "extensions/v1beta1"

kubectl apply -f https://raw.githubusercontent.com/squat/modulus/master/wireguard/daemonset.yaml
error: unable to recognize "https://raw.githubusercontent.com/squat/modulus/master/wireguard/daemonset.yaml": no matches for kind "DaemonSet" in version "extensions/v1beta1"

k3s with kilo, using items feed back

@squat Hi, there with new feed back:

1. How the node's kilo public key save or when it will change?
In my home use wg-vpn conn to the cluster, after I've moved my cluster's master from ali-vm1 to hw-vm1(the etcd data no change, just mv), the both nodes been recreated(k3s in docker).
Then i found the kilo public key of node changed, while the other info all be the same.

the former key:

[Peer]
AllowedIPs = 7.0.0.0/24, 172.16.168.255/32, 10.4.0.1/32
Endpoint = 47.98.xxx.xxx:51820
PersistentKeepalive = 10
PublicKey = nOWT---------N1GnXXz+0UiseSOYOrq14Nz4=

[Peer]
AllowedIPs = 7.0.1.0/24, 192.168.0.105/32, 10.4.0.2/32
Endpoint = 139.159.xxx.xxx:51820
PersistentKeepalive = 10
PublicKey = fQzIcE5--------------MZJdH9wzq9eKGogDO9fWmc=

the new key:

[Peer]
AllowedIPs = 7.0.0.0/24, 172.16.168.255/32, 10.4.0.1/32
Endpoint = 47.98.xxx.xxx:51820
PersistentKeepalive = 10
PublicKey = b/q4LgcU--------------Hpj/fJVzjfn3bNygZE2cwqwE=

[Peer]
AllowedIPs = 7.0.1.0/24, 192.168.0.105/32, 10.4.0.2/32
Endpoint = 139.159.xxx.xxx:51820
PersistentKeepalive = 10
PublicKey = 9lKn---------2zwlrCJjcXXthXnKWWRrp8LWlU=

2. kilo's net didn't auto clean, when the node restart/recreate or just kilo redeploy?
When the node reboot/recreate, that's way kilo will recreate. but the former kilo0 net device not cleand, and the route exist too. this will cause conn between nodes in failure.

1).I just use this by hand to clean.

ifconfig kilo0 down
ip link delete dev kilo0

2).but the progress of wg still exist. How to clean progress?

[root@ali-vm1 ~]# ps -ef |grep wg
root      7659     2  0 14:17 ?        00:00:00 [wg-crypt-kilo0]
root     10670     2  0 19:20 ?        00:00:00 [kworker/u2:1-wg]
root     15086     2  0 16:15 ?        00:00:00 [wg-crypt-kilo2]
root     20223     2  0 19:40 ?        00:00:01 [kworker/0:2-wg-]
root     22389     2  0 19:44 ?        00:00:01 [kworker/0:4-wg-]
root     25012     2  0 19:50 ?        00:00:01 [kworker/0:5-wg-]
root     27801 26236  0 19:55 pts/0    00:00:00 grep --color=auto wg

Crash when deleting rules from index

Stacktrace:

panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x10985f3]

goroutine 11 [running]:
github.com/squat/kilo/pkg/iptables.deleteFromIndex(0x0, 0xc00051dee0, 0xc00051ded8, 0xc0003660c0)
	/kilo/pkg/iptables/iptables.go:226 +0x93
github.com/squat/kilo/pkg/iptables.(*Controller).CleanUp(0xc00051dec0, 0x0, 0x0)
	/kilo/pkg/iptables/iptables.go:265 +0x83
github.com/squat/kilo/pkg/mesh.(*Mesh).cleanUp(0xc0002066e0)
	/kilo/pkg/mesh/mesh.go:708 +0x47
panic(0x138a7a0, 0x26d5c10)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/squat/kilo/pkg/iptables.(*Controller).Set(0xc00051dec0, 0xc000232f00, 0x13, 0x14, 0x0, 0x0)
	/kilo/pkg/iptables/iptables.go:243 +0x364
github.com/squat/kilo/pkg/mesh.(*Mesh).applyTopology(0xc0002066e0)
	/kilo/pkg/mesh/mesh.go:635 +0x12e0
github.com/squat/kilo/pkg/mesh.(*Mesh).syncNodes(0xc0002066e0, 0xc00032bc40)
	/kilo/pkg/mesh/mesh.go:445 +0x751
github.com/squat/kilo/pkg/mesh.(*Mesh).Run(0xc0002066e0, 0x0, 0x0)
	/kilo/pkg/mesh/mesh.go:350 +0x735
main.Main.func4(0x0, 0x0)
	/kilo/cmd/kg/main.go:218 +0x168
github.com/oklog/run.(*Group).Run.func1(0xc0000a8300, 0xc0003d1220, 0xc000414c30)
	/kilo/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/oklog/run.(*Group).Run
	/kilo/vendor/github.com/oklog/run/group.go:37 +0xbe

Looks like this is wrong:

kilo/pkg/iptables/iptables.go

Lines 225 to 230 in 3facc9f

    
           for j := range (*rules)[i:] { 
        
           	if err := (*rules)[j].Delete(); err != nil { 
        
           		return fmt.Errorf("failed to delete rule: %v", err) 
        
           	} 
        
           	(*rules)[j] = nil 
        
           }

I think (*rules)[j] should be (*rules)[i+j]. You've got this in multiple places in the file. It might be worth switching to a standard for loop to make it a little clearer.

features

What if nodes have dynamic public IP? is kilo inteligent and auto refresh nodes configuration with the new node public IP?
Any way to auto detect datacenters? so many nodes in the same DC can directly route without WG?(ip route?)
This only works on kubernetes I supose.
Check my project, maybe can help you with ideas ;) Good work!!
https://github.com/segator/wireguard-dynamic

Node public IP?

Hi - came across your project, I was curious if it would be possible to run the nodes in client mode without an exposed port? I know this is possible in wireguard, and it surprises me I can't seem to do it with this or wormhole.

Can Kilo Work With Nodes Behind NAT?

I've been looking into using Kubernetes in an at-edge setting. In this type of deployment I'd be setting up nodes behind other people's NAT'ed networks. Kubernete's API and CRDs make a lot of things I need to do very easy to accomplish (daemonsets, service meshes, and config management, etc) very simple. Wireguard would provide a transparent security layer. In my application I don't mind the high latency with communications to the API server and other things. One thing that I don't control in my deployment is the router at the locations of deployment. I can guarantee there will be a network connection that will be able to speak to my api server but I cannot forward ports.

I noticed in your documentation you must provide at least one public IP to each region. Is there some way to use kilo to avoid this constraint? What does this constraint come from? Is this some inherent feature of WG?

feat issue: Web view of topology graph

Set a controller with http for web view of topology graph.

AWS VPC peering

Is it possible to use AWS VPC peering between two region? I have 2 separate k8s cluster on different region and both VPC is connected using peering network.

Can I use node internal IP for wireguard connectivity ?

kgctl release

No releases for kgctl on Github. I suppose it has to be compiled?

if I add a new server to the cluster with leader annotation the current connections between locations are dropped

I have a cluster in two locations, A and B, with one server on each location successfully connected with kilo.

Each server has on pod and both pods are connected to each other.

The server on location A has the kilo.squat.ai/leader=true annotation, but if I add a new server at location A also with this annotation, kilo elects the new server as a new leader, updating the wg configuration on the server on location B, dropping the current connection between the pod on old leader server and the pod on the server on location B.

microk8s compatibility

Is in the roadmap the microk8s compatibility? Microk8s uses flannel as cni by default, I've tested on it but with no success.
Kilo pods starts well and no errors are printed in the logs (w/ log-level=all) but the sudo wg command don't show anything, no public key nor endpoint... and in the node the kilo.squat.ai/wireguard-ip annotation shows no ip.

Can you please take a look on microk8s? I think is interesting now that microk8s stable version has the clustering option. Thanks in advance.

not all nodes see each other

so it appears somethings changed with k8s 1.8.1 or flannel ....
running in full mesh one node sees the two others, the two others only see one another and not all 3
all 3 nodes have a public ip interface, and after applying kilo-kubeadm-flannel it appears pods loose valid running state and become "unreachable"

default mailu-roundcube-7b49b94446-ltlcz 0/1 OOMKilled 0 175m
default mailu-rspamd-685df75db8-85thh 0/1 Running 0 175m

Will tail 3 logs...
kilo-djr4w
kilo-ds78k
kilo-fxw8w
[kilo-ds78k] {"caller":"mesh.go:631","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-04-22T03:34:40.710140705Z"}
[kilo-ds78k] {"caller":"mesh.go:631","component":"kilo","level":"info","msg":"WireGuard configurations are different","ts":"2020-04-22T03:35:10.781458565Z"}
[kilo-djr4w] {"caller":"main.go:236","msg":"caught interrupt; gracefully cleaning up; see you next time!","ts":"2020-04-22T03:37:13.310050264Z"}
[kilo-djr4w] {"caller":"mesh.go:696","component":"kilo","error":"failed to clean up node backend: failed to patch node: nodes "node2" is forbidden: User "system:serviceaccount:kube-system:kilo" cannot patch resource "nodes" in API group "" at the cluster scope","level":"error","ts":"2020-04-22T03:37:13.311905223Z"}
[kilo-ds78k] {"caller":"main.go:236","msg":"caught interrupt; gracefully cleaning up; see you next time!","ts":"2020-04-22T03:35:40.370479773Z"}
[kilo-fxw8w] {"caller":"main.go:236","msg":"caught interrupt; gracefully cleaning up; see you next time!","ts":"2020-04-22T03:37:13.280341465Z"}
[kilo-fxw8w] {"caller":"mesh.go:696","component":"kilo","error":"failed to clean up node backend: failed to patch node: nodes "node3" is forbidden: User "system:serviceaccount:kube-system:kilo" cannot patch resource "nodes" in API group "" at the cluster scope","level":"error","ts":"2020-04-22T03:37:13.284032892Z"}
[kilo-ds78k] {"caller":"mesh.go:696","component":"kilo","error":"failed to clean up node backend: failed to patch node: Unauthorized","level":"error","ts":"2020-04-22T03:35:40.383739615Z"}

Running it along with Calico

Is it possible to run it along with Calico? Anyone has tried it?

Services not exposed via Kilo VPN

Hey,

first, thanks for this awesome project! I am just getting started with it, and I was able to get it up and running quite easily.

I only want to use it within a Rancher 2 / Canal (Flannel + Calico) setup, to access cluster-local services via a VPN from an external site.

What I did

I ran the image with the following options:

        image: squat/kilo
        args:
        - --hostname=$(NODE_NAME)
        - --cni=false
        - --encapsulation=never
        - --compatibility=flannel
        - --local=false

(as I am running in-cluster, I don't need the kubeconfig option).

Then, I added a new peer and extracted its config via kgctl.

Problem Description

-> Observation: in my case the allowed IPs look like: 10.4.0.1/32, 10.19.0.5/32, 10.19.0.6/32, 10.19.0.7/32, 10.42.0.0/24, 10.42.1.0/24, 10.42.2.0/24, 10.43.0.0/24

However, the services are in the 10.43 IP range; so they are not included in the allowed IPs.

Do you have any hint how to debug further why the Service IP range is not included in the allowed IPs?

Thank you and all the best ❤️
Sebastian

Help, vpn peer usage in nat environment.

I've follow README.MD, docs/vpn.md for the following settings.

vpn-sam.yml and the node's routes

apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
  name: sam
spec:
  allowedIPs:
  - 10.5.0.1/32 # Example IP address on the peer's interface.
  publicKey: FLS------hzpNFbJ/JUiN4He8pTxLmFC5ZtQLK5Oc0A= #- replace 6 char
  persistentKeepalive: 10

[root@ali-vm1 ~]# route -n |grep kilo
7.0.1.0         10.4.0.2        255.255.255.0   UG    0      0        0 kilo0
10.4.0.0        0.0.0.0         255.255.0.0     U     0      0        0 kilo0
10.5.0.1        0.0.0.0         255.255.255.255 UH    0      0        0 kilo0
192.168.0.105   10.4.0.2        255.255.255.255 UGH   0      0        0 kilo0

[root@hw-vm1 ~]# route -n |grep kilo
7.0.0.0         10.4.0.1        255.255.255.0   UG    0      0        0 kilo0
10.4.0.0        0.0.0.0         255.255.0.0     U     0      0        0 kilo0
10.5.0.1        0.0.0.0         255.255.255.255 UH    0      0        0 kilo0
172.16.168.255  10.4.0.1        255.255.255.255 UGH   0      0        0 kilo0

my nat working-vm: (debian 10)

root@deb10:/home/sam# lsmod |grep wire
wireguard             221184  0
ip6_udp_tunnel         16384  2 wireguard,vxlan
udp_tunnel             16384  2 wireguard,vxlan
root@deb10:/home/sam# ip a |grep wg
4455: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 10.5.0.1/32 scope global wg0

root@deb10:/home/sam# route -n |grep wg
7.0.1.0         0.0.0.0         255.255.255.0   U     0      0        0 wg0
10.4.0.1        0.0.0.0         255.255.255.255 UH    0      0        0 wg0
10.4.0.2        0.0.0.0         255.255.255.255 UH    0      0        0 wg0
172.16.168.255  0.0.0.0         255.255.255.255 UH    0      0        0 wg0
192.168.0.105   0.0.0.0         255.255.255.255 UH    0      0        0 wg0

dev wg0:  (ListenPort = 5555, nofirewall run in deb10)
root@deb10:/home/sam# cat /etc/wireguard/wg0.conf 
[Interface]
Address = 10.5.0.1/32
PrivateKey = +Dsm------FVL3e83lTIVC9dI1rYwjEI7ljI9wbyFWk=  #replace 6 char
ListenPort = 5555

peer.ini:
[root@(⎈ |default:default) ~]$ kgctl showconf peer sam
[Peer]
AllowedIPs = 7.0.0.0/24, 172.16.168.255/32, 10.4.0.1/32
Endpoint = 47.98.xxx.xxx:51820
PersistentKeepalive = 0
PublicKey = nOW------dKxE0NDuCxN1GnXXz+0UiseSOYOrq14Nz4=

[Peer]
AllowedIPs = 7.0.1.0/24, 192.168.0.105/32, 10.4.0.2/32
Endpoint = 139.159.xxx.xxx:51820
PersistentKeepalive = 0
PublicKey = fQz------H70oWHUWzSGiMZJdH9wzq9eKGogDO9fWmc=

IFACE=wg0
wg-quick up $IFACE
wg setconf $IFACE peer.ini

ip route add 10.4.0.2/32 dev wg0
...

annotation force-external-ip with alternative dns domain usage

Current anno: kilo.squat.ai/force-external-ip="a.b.c.d/32"

But sometimes we just use a dymanic ip from network provider, thus we use a ddns domain to get the correct wan-ip.

Status of project?

How ready is this project to be used in production environments?

remote VPN client

ok enlighten me again! docs are sketchy .... for a vpn now that kilo is running on 3 nodes.... id like to
connnect my remote laptop for administration and web access, where are we deriving the keys from ? i imagine this is the "client" public key ? and i can get the server/cluster side with wg showconf ?
I do appreciate the assitance, but yes i will say docs are a bit lacking, that being said point me in the right direction, and i can help with documentation once i get the connection sorted.

VPN

Kilo also enables peers outside of a Kubernetes cluster to connect to the VPN, allowing cluster applications to securely access external services and permitting developers and support to securely debug cluster resources. In order to declare a peer, start by defining a Kilo peer resource:

cat <<'EOF' | kubectl apply -f -
apiVersion: kilo.squat.ai/v1alpha1
kind: Peer
metadata:
name: squat
spec:
allowedIPs:

10.5.0.1/32
publicKey: GY5aT1N9dTR/nJnT1N2f4ClZWVj0jOAld0r8ysWLyjg=
persistentKeepalive: 10
EOF

	for j := range (*rules)[i:] {
	if err := (*rules)[j].Delete(); err != nil {
	return fmt.Errorf("failed to delete rule: %v", err)
	}
	(*rules)[j] = nil
	}