Coder Social home page Coder Social logo

poseidon / typhoon Goto Github PK

View Code? Open in Web Editor NEW
1.9K 42.0 318.0 8.65 MB

Minimal and free Kubernetes distribution with Terraform

Home Page: https://typhoon.psdn.io/

License: MIT License

HCL 99.85% HTML 0.15%
kubernetes google-cloud bare-metal digitalocean terraform aws coreos azure fedora-coreos flatcar-linux

typhoon's Introduction

Typhoon

Release Stars Sponsors Mastodon

Typhoon is a minimal and free Kubernetes distribution.

  • Minimal, stable base Kubernetes distribution
  • Declarative infrastructure and configuration
  • Free (freedom and cost) and privacy-respecting
  • Practical for labs, datacenters, and clouds

Typhoon distributes upstream Kubernetes, architectural conventions, and cluster addons, much like a GNU/Linux distribution provides the Linux kernel and userspace components.

Features

Modules

Typhoon provides a Terraform Module for defining a Kubernetes cluster on each supported operating system and platform.

Typhoon is available for Fedora CoreOS.

Platform Operating System Terraform Module Status
AWS Fedora CoreOS aws/fedora-coreos/kubernetes stable
Azure Fedora CoreOS azure/fedora-coreos/kubernetes alpha
Bare-Metal Fedora CoreOS bare-metal/fedora-coreos/kubernetes stable
DigitalOcean Fedora CoreOS digital-ocean/fedora-coreos/kubernetes beta
Google Cloud Fedora CoreOS google-cloud/fedora-coreos/kubernetes stable
Platform Operating System Terraform Module Status
AWS Fedora CoreOS (ARM64) aws/fedora-coreos/kubernetes alpha

Typhoon is available for Flatcar Linux.

Platform Operating System Terraform Module Status
AWS Flatcar Linux aws/flatcar-linux/kubernetes stable
Azure Flatcar Linux azure/flatcar-linux/kubernetes alpha
Bare-Metal Flatcar Linux bare-metal/flatcar-linux/kubernetes stable
DigitalOcean Flatcar Linux digital-ocean/flatcar-linux/kubernetes beta
Google Cloud Flatcar Linux google-cloud/flatcar-linux/kubernetes stable
Platform Operating System Terraform Module Status
AWS Flatcar Linux (ARM64) aws/flatcar-linux/kubernetes alpha
Azure Flatcar Linux (ARM64) azure/flatcar-linux/kubernetes alpha

Typhoon also provides Terraform Modules for optionally managing individual components applied onto clusters.

Name Terraform Module Status
CoreDNS addons/coredns beta
Cilium addons/cilium beta
flannel addons/flannel beta

Documentation

Usage

Define a Kubernetes cluster by using the Terraform module for your chosen platform and operating system. Here's a minimal example:

module "yavin" {
  source = "git::https://github.com/poseidon/typhoon//google-cloud/fedora-coreos/kubernetes?ref=v1.30.2"

  # Google Cloud
  cluster_name  = "yavin"
  region        = "us-central1"
  dns_zone      = "example.com"
  dns_zone_name = "example-zone"

  # configuration
  ssh_authorized_key = "ssh-ed25519 AAAAB3Nz..."

  # optional
  worker_count = 2
  worker_preemptible = true
}

# Obtain cluster kubeconfig
resource "local_file" "kubeconfig-yavin" {
  content  = module.yavin.kubeconfig-admin
  filename = "/home/user/.kube/configs/yavin-config"
}

Initialize modules, plan the changes to be made, and apply the changes.

$ terraform init
$ terraform plan
Plan: 62 to add, 0 to change, 0 to destroy.
$ terraform apply
Apply complete! Resources: 62 added, 0 changed, 0 destroyed.

In 4-8 minutes (varies by platform), the cluster will be ready. This Google Cloud example creates a yavin.example.com DNS record to resolve to a network load balancer across controller nodes.

$ export KUBECONFIG=/home/user/.kube/configs/yavin-config
$ kubectl get nodes
NAME                                       ROLES    STATUS  AGE  VERSION
yavin-controller-0.c.example-com.internal  <none>   Ready   6m   v1.30.2
yavin-worker-jrbf.c.example-com.internal   <none>   Ready   5m   v1.30.2
yavin-worker-mzdm.c.example-com.internal   <none>   Ready   5m   v1.30.2

List the pods.

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                      READY  STATUS    RESTARTS  AGE
kube-system   calico-node-1cs8z                         2/2    Running   0         6m
kube-system   calico-node-d1l5b                         2/2    Running   0         6m
kube-system   calico-node-sp9ps                         2/2    Running   0         6m
kube-system   coredns-1187388186-zj5dl                  1/1    Running   0         6m
kube-system   coredns-1187388186-dkh3o                  1/1    Running   0         6m
kube-system   kube-apiserver-controller-0               1/1    Running   0         6m
kube-system   kube-controller-manager-controller-0      1/1    Running   0         6m
kube-system   kube-proxy-117v6                          1/1    Running   0         6m
kube-system   kube-proxy-9886n                          1/1    Running   0         6m
kube-system   kube-proxy-njn47                          1/1    Running   0         6m
kube-system   kube-scheduler-controller-0               1/1    Running   0         6m

Non-Goals

Typhoon is strict about minimalism, maturity, and scope. These are not in scope:

  • In-place Kubernetes Upgrades
  • Adding every possible option
  • Openstack or Mesos platforms

Help

Schedule a meeting via Github Sponsors to discuss your use case.

Motivation

Typhoon powers the author's cloud and colocation clusters. The project has evolved through operational experience and Kubernetes changes. Typhoon is shared under a free license to allow others to use the work freely and contribute to its upkeep.

Typhoon addresses real world needs, which you may share. It is honest about limitations or areas that aren't mature yet. It avoids buzzword bingo and hype. It does not aim to be the one-solution-fits-all distro. An ecosystem of Kubernetes distributions is healthy.

Social Contract

Typhoon is not a product, trial, or free-tier. Typhoon does not offer support, services, or charge money. And Typhoon is independent of operating system or platform vendors.

Typhoon clusters will contain only free components. Cluster components will not collect data on users without their permission.

Sponsors

Poseidon's Github Sponsors support the infrastructure and operational costs of providing Typhoon.





If you'd like your company here, please contact dghubble at psdn.io.

typhoon's People

Contributors

8ball030 avatar a7pr4z avatar ajrpayne avatar barakmich avatar bendrucker avatar bkcsfi avatar bzub avatar cuppett avatar dependabot[bot] avatar dghubble avatar dghubble-renovate[bot] avatar irontoby avatar itspngu avatar jordanp avatar justaugustus avatar kmarquardsen avatar mholttech avatar pms1969 avatar redeux avatar schu avatar sdemos avatar sendhil avatar shift avatar squat avatar surajssd avatar tuhtah avatar valer-cara avatar warmchang avatar woneill avatar yokhahn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

typhoon's Issues

Ubiquity EdgeMax Router docs

I'm looking to set this up at home using a Ubiquity EdgeRouter X. It would be great if there was some info on router setup for the Bare Metal install.

CRI?

Just wondering if it's on the roadmap to enable and/or select between various CRI runtimes?

kubeproxy unable to list endpoints

Bug

Environment

  • Platform: digital-ocean
  • OS: container-linux
  • Terraform: 0.11.1
  • Plugins: digital ocean 0.1.3
  • Ref: Git SHA (if applicable)

Problem

Describe the problem.
Unable to connect using kubernetes service.
kubeproxy logs shows unable to get endpoints, system:serviceaccount is forbidden.

Desired Behavior

Describe the goal.
Able to connect using kubernetes service name.

Steps to Reproduce

run terraform apply, using the kubernetes/cl/digital ocean module v1.9.2

After the cluster is up, check the logs of kubeproxy, you will see logs such as below.
E0127 06:23:01.776988 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Endpoints: Get https://btc.geek.per.sg:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 159.65.12.213:443: getsockopt: connection refused E0127 06:23:01.777090 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Service: Get https://btc.geek.per.sg:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 159.65.12.213:443: getsockopt: connection refused E0127 06:23:02.779158 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Endpoints: Get https://btc.geek.per.sg:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 159.65.12.213:443: getsockopt: connection refused E0127 06:23:02.781536 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Service: Get https://btc.geek.per.sg:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 159.65.12.213:443: getsockopt: connection refused E0127 06:23:12.194181 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Service: services is forbidden: User "system:serviceaccount:kube-system:kube-proxy" cannot list services at the cluster scope E0127 06:23:12.194467 1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Endpoints: endpoints is forbidden: User "system:serviceaccount:kube-system:kube-proxy" cannot list endpoints at the cluster scope

Could be related to this?
kubernetes/kubernetes#58882

Thanks!

Officially support custom networkd units for bare-metal

Bug

Environment

  • Platform: bare-metal
  • OS: container-linux 1576.5.0
  • Terraform: Terraform v0.11.2

Problem

Creating a new cluster, pods have no network connectivity across hosts. This means that pods also fail at DNS resolution since kube-dns runs on a master and most other pods don't.

Desired Behavior

Pods have can communicate with pods on other hosts.

Steps to Reproduce

Create a new bare-metal cluster. Start a couple of pods (eg https://kubernetes.io/docs/user-guide/walkthrough/#pods) and unless they run on the same host, there is no network connection between the pods.

Addition information.

No doubt this is something strange with my setup. I have 3 hosts, each with 4 network devices. 2 are connected to the same network (I hope to bond them at some point in the future) that grants static leases from a DHCP server. The other two are currently disconnected.
I have tried using the undocumented controller_networkds and worker_networkds units to disable the DHCP of the 2nd interface, but that doesn't fix the issue at hand.

cluster.tf


Terraform always reports firewall changes are needed

Bug

Environment

Platform: Digital Ocean
OS: Container Linux
Terraform: v0.10.1-dev
Ref: f044113

Actual Behavior

Terraform always reports that firewall changes need to be made when running terraform plan after successful cluster creation, even though the correct firewall rules exist. This is an issue with the upstream provider plugin: https://github.com/terraform-providers/terraform-provider-digitalocean/issues/30

$ terraform plan
  ~ module.mycluster.digitalocean_firewall.rules
      inbound_rule.3.port_range:  "0" => "all"
      inbound_rule.3.protocol:    "tcp" => "udp"
      inbound_rule.4.port_range:  "0" => "all"
      inbound_rule.4.protocol:    "udp" => "tcp"
      outbound_rule.0.port_range: "0" => ""
      outbound_rule.0.protocol:   "udp" => "icmp"
      outbound_rule.1.port_range: "0" => "all"
      outbound_rule.1.protocol:   "icmp" => "udp"
      outbound_rule.2.port_range: "0" => "all"

While annoying, its better to keep the newly added firewall rules and wait for the provider to improve than to revert the cluster firewall rules. Users can terraform apply and it won't harm the rules.

Stop using auto-scaling groups or instance groups for controllers

AWS and GCE modules use auto-scaling groups and managed instance groups for controller instances, although these abstractions are generally only suitable for fungible components. etcd data resides on these nodes so it is not safe to scale or roll these groups in most cases.

Migrate away from this architecture as part of the shift away from self-hosted etcd.

cni config: No networks found

Bug

Environment

  • Platform: bare-metal

  • OS: container-linux

  • Terraform: v0.10.7

  • Plugins: Provider plugin versions
    provider.local: version = "~> 1.0"
    provider.null: version = "~> 1.0"
    provider.template: version = "~> 1.0"
    provider.tls: version = "~> 1.0"
    terraform-provider-matchbox v0.2.2

  • Ref: Git SHA (if applicable)*
    ref=1bc25c103654a497bcc0c2486104426f09ea2456

Problem

Temporary Kubernetes control plane API fails to start

Log entries show issues relating to missing cni config

cni.go:189] Unable to update cni config: No networks found in /etc/kubernetes/cni/net.d
kubelet.go:2136] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Desired Behavior

bootkube API server starts and cluster is provisioned

Steps to Reproduce

cluster.tf

module "bare-metal-mercury" {
  source = "git::https://github.com/poseidon/typhoon//bare-metal/container-linux/kubernetes?ref=1bc25c103654a497bcc0c2486104426f09ea2456"

  # install
  matchbox_http_endpoint  = "${var.matchbox_http_endpoint}"
  container_linux_channel = "${var.container_linux_channel}"
  container_linux_version = "${var.container_linux_version}"
  ssh_authorized_key      = "${var.ssh_authorized_key}"

  # cluster
  cluster_name    = "${var.cluster_name}"
  k8s_domain_name = "${var.k8s_domain_name}"

  # machines
  controller_names   = "${var.controller_names}"
  controller_macs    = "${var.controller_macs}"
  controller_domains = "${var.controller_domains}"
  worker_names       = "${var.worker_names}"
  worker_macs        = "${var.worker_macs}"
  worker_domains     = "${var.worker_domains}"

  # bootkube assets
  asset_dir = "${var.asset_dir}"

  # Optional
  networking                    = "${var.networking}"
  cached_install                = "${var.cached_install}"
  install_disk                  = "${var.install_disk}"
  container_linux_oem           = "${var.container_linux_oem}"
  pod_cidr                      = "${var.pod_cidr}"
  service_cidr                  = "${var.service_cidr}"
}

terraform.tfvars

matchbox_http_endpoint = "http://matchbox.example.com:8080"
matchbox_rpc_endpoint = "matchbox.example.com:8081"
ssh_authorized_key = "ssh-rsa ..."

cluster_name = "example"
container_linux_version = "1465.6.0"
container_linux_channel = "stable"

# Machines
controller_names = ["m0", "m1"]
controller_domains = ["m0.example.com", "m1.example.com"]
controller_macs = ["MAC1", "MAC2"]
worker_names = ["n0", "n1"]
worker_domains = ["n0.example.com", "n1.example.com"]
worker_macs = ["MAC1", "MAC2"]

# Bootkube
k8s_domain_name = "m0.example.com"
asset_dir = "assets_dir"

# Optional (defaults)
cached_install = "true"
install_disk = "/dev/sda"
#container_linux_oem = ""
networking = "calico"
pod_cidr = "10.2.0.0/16"
service_cidr = "10.3.0.0/16"

Run

terraform plan
terraform apply

VMware vSphere module?

Hello,

I am about to start to write a VMware vSphere module for Typhoon.

I just would like to know if you planned to support this platform, if you have already started this module, and if have any suggestions or recommendations?

Thank you.

More then one ssh key

Feature Request

Feature

Make ssh_authorized_key, ssh_authorized_keys and allow a list of keys. Our organization discourages shared keys.

Tradeoffs

Unknown.

Explore using an NLB instead of an ELB

NLBs are the newer network load balancer option on AWS. Support has been maturing in newer version of the terraform-provider-aws provider and we may be able to start exploring switching the ELBs to NLBs.

Potential benefits:

  • AWS claims NLBs have better performance and throughput.

There are a few things to look out for:

  • Just as reliable in practice?
  • No increase in time to provision an AWS cluster
  • Would require bumping the minimum terraform-provider-aws plugin verison we allow

We can investigate this separately for the apiserver and ingress ELBs.

why bootkube?

From readme, it says in-place upgrades is non-goal. Then why bootkube or self hosted Kubernetes is needed? Why not just use systemd+containers to avoid complexity, and achieve the minimalism goal more?

Kubernetes control plane is designed to allow maintenance windows. With HA setup, the maintenance window can even be minimized. The worker plane is tricker, but it has nothing to do with bootkube anyway. I do not think in-place upgrades will bring anything significant anyway to typhoon's goal. So I wont worry about the future plan to put bootkube in here now.

Re-applying breaks existing GCE clusters

Bug

Environment

Problem

Recently, I've noticed that running terraform plan will show a diff that recreates instance templates (both controllers and workers) because its shows the naming of disks should change. Running terraform apply will recreate all controllers and workers, which effectively destroys cluster state and Kubernetes will no longer be running.

terraform plan
...
      disk.0.device_name:                         "persistent-disk-0" => ""         
...                                                                                                                               
Plan: 2 to add, 2 to change, 2 to destroy.

Mitigations

  • If running terraform plan shows a diff that will recreate controllers, do not run apply
  • Check back here for updates, still investigating.

Action Items

  • Add documentation note to make clear the importance of pinning the provider version
  • Consider demoting Google Cloud platform to beta.
  • Continue planning to change controllers to not be part of a managed instance group. Scaling and rolling edits are not well suited for the control plane. They are suited for workers.

related:

Expose kernel arguments as an input variable

Lots of useful things can be passed as kernel arguments. Work has started in #28 however this will take a little extra thought for Container Linux installed to disk.

We will need to modify /usr/share/oem/grub.cfg with Ignition. Specifically, if kernel_args is set, they will need to be added to that file in the form of

linux_append="${join(" ", var.kernel_args}"

My proposal is to keep a copy of each provider's grub.cfg in each module and use it as a template. The rendered template will be added to the Container Linux Config files section to replace /usr/share/oem/grub.cfg. The caveat to this is that a change to this config from providers might break things. I think they are rarely changed, though. Would be good to find a source of information that would notify us of changes though.

To get things started, here's the default grub.conf for a few providers.

GCE

# CoreOS GRUB settings                               

set oem_id="gce"                                     

# GCE only has a serial console.                     
set linux_console="console=ttyS0,115200n8"           
serial com0 --speed=115200 --word=8 --parity=no      
terminal_input serial_com0                           
terminal_output serial_com0

DigitalOcean

# CoreOS GRUB settings                               

set oem_id="digitalocean"

AWS

# CoreOS GRUB settings for EC2

set oem_id="ec2"

# Blacklist the Xen framebuffer module so it doesn't get loaded at boot
# Disable `ens3` style names, so eth0 is used for both ixgbevf or xen.
set linux_append="modprobe.blacklist=xen_fbfront net.ifnames=0"

if [ "$grub_platform" = pc ]; then
        set linux_console="console=ttyS0,115200n8"
        serial com0 --speed=115200 --word=8 --parity=no
        terminal_input serial_com0
        terminal_output serial_com0
fi

digitalocean provider token not found

Hey folks,
I'm following the tutorial from here
https://typhoon.psdn.io/digital-ocean/
but can't get the digital ocean token to be fetched from a local file

Environment

  • Platform: digital-ocean
  • OS: container-linux
  • Terraform: v 0.11.1
  • Plugins: Provider plugin versions

Problem

In terraform plan it asks to enter a value for provider.digitalocean.tokn

In terraform validate I get
Error: module.digital-ocean-nemo.provider.digitalocean: "token": required field is not set

Desired Behavior

The token is fetched from the ~/.config/digital-ocean/token file

Steps to Reproduce

Folder containing these files:

providers.tf

 provider "digitalocean" {
   token = "${chomp(file("~/.config/digital-ocean/token"))}"
 }

main.tf

 module "digital-ocean-nemo" {
   source = "git::https://github.com/poseidon/typhoon//digital-ocean/container-linux/kubernetes"

    region   = "fra1"
   dns_zone = "digital-ocean.example.com"

   cluster_name     = "nemo"
   image            = "coreos-stable"
   controller_count = 1
   controller_type  = "2gb"
   worker_count     = 2
   worker_type      = "512mb"
   ssh_fingerprints = ["d7:9d:79:ae:56:32:73:79:95:88:e3:a2:ab:5d:45:e7"]

   # output assets dir
   asset_dir = "~/.secrets/clusters/nemo"
 }

Thanks !

Deprecate control plane self-hosted etcd

Feature Request

Problem

Self-hosted etcd is still marked as experimental by bootkube, and the actual pivot between the on host etcd and self-hosted etcd a significant chunk of the startup time for bootkube clusters (~20%). Anecdotally, I've also seen self-hosted clusters fail, which was part of the inspiration for the bootkube recover sub-command.

Desired Behavior

Support on-host etcd clusters. Possibly either by provisioning separate machines, or having the etcd instances living on the masters. Etcd instances on the master nodes has an attractive aspect of reduced overhead, but might not be optimal due to masters living in an auto-scaling group.

Tradeoffs

Etcd doesn't behave well in an auto-scaling group.[1] This might be alleviated by talking to the cloud provider or not supporting expanding the number of masters, but is worth investigation on its own. Bootstrapping needs a concrete, cross cloud solution, and re-balancing must survive reboots and adding new members after the initial cluster is formed.

A large chunk of this feature is determining how best to bootstrap and manage the etcd cluster.

[1] https://crewjam.com/etcd-aws/

DigitalOcean: invalid key identifiers for Droplet creation

Bug

Environment

  • Platform:digital-ocean
  • OS: OSX
  • Terraform: v0.11.2
  • Plugins: provider.digitalocean v0.1.3

Problem

3 error(s) occurred:

* module.digital-ocean-dev.digitalocean_droplet.workers[0]: 1 error(s) occurred:

* digitalocean_droplet.workers.0: Error creating droplet: POST https://api.digitalocean.com/v2/droplets: 422 cd:e8:2a:10:50:ae:eb:6a:91:b9:e8:9f:e0:04:70:7f are invalid key identifiers for Droplet creation.
* module.digital-ocean-dev.digitalocean_droplet.workers[1]: 1 error(s) occurred:

* digitalocean_droplet.workers.1: Error creating droplet: POST https://api.digitalocean.com/v2/droplets: 422 cd:e8:2a:10:50:ae:eb:6a:91:b9:e8:9f:e0:04:70:7f are invalid key identifiers for Droplet creation.
* module.digital-ocean-dev.digitalocean_droplet.controllers: 1 error(s) occurred:

* digitalocean_droplet.controllers: Error creating droplet: POST https://api.digitalocean.com/v2/droplets: 422 cd:e8:2a:10:50:ae:eb:6a:91:b9:e8:9f:e0:04:70:7f are invalid key identifiers for Droplet creation.

When I'm trying to terraform apply fingerprints not recognized. I've tried sshkey ID, few diferent rsa keys

module.digital-ocean-dev.digitalocean_droplet.workers[0]: Creating...
  disk:                 "" => "<computed>"
  image:                "" => "coreos-stable"
  ipv4_address:         "" => "<computed>"
  ipv4_address_private: "" => "<computed>"
  ipv6:                 "" => "true"
  ipv6_address:         "" => "<computed>"
  ipv6_address_private: "" => "<computed>"
  locked:               "" => "<computed>"
  name:                 "" => "dev-worker-0"
  price_hourly:         "" => "<computed>"
  price_monthly:        "" => "<computed>"
  private_networking:   "" => "true"
  region:               "" => "lon1"
  resize_disk:          "" => "true"
  size:                 "" => "s-1vcpu-1gb"
  ssh_keys.#:           "" => "1"
  ssh_keys.0:           "" => "cd:e8:2a:10:50:ae:eb:6a:91:b9:e8:9f:e0:04:70:7f"
  status:               "" => "<computed>"
  tags.#:               "" => "1"
  tags.0:               "" => "dev-worker"

Steps to Reproduce

Fresh install. terraform init, terraform apply

FQDN For Cluster Nodes

Is there any way to get around the FQDN requirement for the nodes in the cluster? I'm trying set this up at home and can't get static IPs from my ISP. I'm using a Ubiquity Edge Router and plan to setup the PXE boot through that. Any advice here or am I barking up the wrong tree?

kube-flannel going into CrashLoop

Bug

Kube-flannel goes into crash loop. This happens not on all flannel pods.
Error is unknown host. It seems like it could not contact the api server.

Environment

  • Platform: digital-ocean
  • OS: container-linux
  • Terraform: 0.11.1
  • Plugins: Provider plugin versions
  • Ref: Git SHA (if applicable)v 1.9.3

Problem

Describe the problem.
When i bootstrap a kubernetes cluster on DIgital Ocean, kube-flannel goes into crash loop.
This does not happen on all pods. For example I bootstrapped a 4 worker nodes, either 1 or 2 pods goes into crash loop. The error reported as "unknown host" when it is trying to connect to the api server.

DUe to this nginx add on does not work anymore. Other pods' status are Running.

Desired Behavior

Flannel in stable or running state.

Describe the goal.

Steps to Reproduce

I simply follow the steps described in digital ocean distribution of Typhoon.

use etcd 3.2.12 instead of 3.2.0

Probably should always use the latest patch release, which contains mostly bug fixes from previous patch releases with very low risk.

Minor omission in docs: "export KUBECONFIG"

Bug

Environment

  • Platform:digital-ocean
  • OS: container-linux, fedora-cloud
  • Terraform: terraform version
  • Plugins: Provider plugin versions
  • Ref: Git SHA (if applicable)

Problem

Minor omission in the documentation says:
https://github.com/poseidon/typhoon/blob/master/docs/digital-ocean.md

KUBECONFIG=/home/user/.secrets/clusters/nemo/auth/kubeconfig

needs to be:

export KUBECONFIG=/home/user/.secrets/clusters/nemo/auth/kubeconfig

Desired Behavior

Without the export, the KUBECONFIG assignment is not accessible to kubectl and the command timesout. kubectl should be correctly pointed to the cluster config defined in the env variable and work

Steps to Reproduce

KUBECONFIG=$PWD/clusters/nemo/auth/kubeconfig
kubectl get nodes
Unable to connect to the server: dial tcp XX.XX.XX.XX:443: i/o timeout

but:

export KUBECONFIG=$PWD/clusters/nemo/auth/kubeconfig
kubectl get nodes
NAME             STATUS    ROLES     AGE       VERSION
10.138.XX.X1     Ready     node      7m        v1.8.6
10.138.XX.X2     Ready     master    7m        v1.8.6
10.138.XX.X3     Ready     node      7m        v1.8.6

failed to copy assets, bootkube does not start

Bug

Environment

  • Platform: bare-metal
  • OS: container-linux
  • Terraform: 0.11.2
  • Plugins:

Problem

The terraform apply exits with the following:

module.bare-metal-myk8s.null_resource.bootkube-start (remote-exec): mv: cannot move '/home/core/assets' to '/opt/bootkube/assets': Directory not empty
module.bare-metal-myk8s.null_resource.bootkube-start (remote-exec): Job for bootkube.service failed because the control process exited with error code.
module.bare-metal-myk8s.null_resource.bootkube-start (remote-exec): See "systemctl  status bootkube.service" and "journalctl  -xe" for details.

Desired Behavior

The directory on the controller node should not exist. If I remove the directory '/opt/bootkube/assets' and run terraform apply again, then does the 'mv' but the service 'bootkube' does not start.

core@c1-k8s /opt $ sudo systemctl  status bootkube.service
โ— bootkube.service - Bootstrap a Kubernetes cluster
   Loaded: loaded (/etc/systemd/system/bootkube.service; static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2018-01-17 13:44:42 UTC; 11min ago
  Process: 4871 ExecStart=/usr/bin/bash /opt/tectonic/bootkube.sh (code=exited, status=200/CHDIR)
 Main PID: 4871 (code=exited, status=200/CHDIR)
      CPU: 1ms

Jan 17 13:44:42 c1-k8s.server.com systemd[1]: Starting Bootstrap a Kubernetes cluster...
Jan 17 13:44:42 c1-k8s.server.com systemd[1]: bootkube.service: Main process exited, code=exited, status=200/CHD
Jan 17 13:44:42 c1-k8s.server.com systemd[1]: Failed to start Bootstrap a Kubernetes cluster.
Jan 17 13:44:42 c1-k8s.server.com systemd[1]: bootkube.service: Unit entered failed state.
Jan 17 13:44:42 c1-k8s.server.com systemd[1]: bootkube.service: Failed with result 'exit-code'.
core@c1-k8s /opt $

Furthermore, the directory '/opt/tectonic/' does not exist.

Steps to Reproduce

Follow the steps of a bare-metal installation.

Allow other KubeDNS domains than cluster.local

Feature Request

Feature

While cluster.local is a fine default for internal Kubernetes DNS names, it would be a useful option to allow arbitrary FQDNs controlled by the user.

The advantage being that, perhaps, k8s.example.com can be the subdomain of record for the cluster, and resolvable to all, with appropriate DNS.

As an optional variable for the various providers, allow setting cluster_dns_fqdn or similar

Tradeoffs

It makes things a bit uglier.

ASG's for Controllers

Is there some reason that the AWS controllers can't be in an auto scaling group?

I ask because I'd like to convert the Classic Load Balancers to NLB's to allow for the passthru of websockets at the very least, and the only way that I can see to associate the instances to the targetgroup is via an ASG.

It seems sensible that the controllers should be able to heal themselves if they get blown away, I'm just not sure if any problems might manifest themselves because of this.

If I get it working I'll feed back a PR.

Cheers,
Paul.

On Google Cloud, multi-controller setups only have 1 controller

Bug

Environment

  • Platform: google-cloud
  • OS: container-linux

Problem

Google Cloud network load balancers map a single regional IP to a target pool of health checked nodes. From a load balanced node, a Google NLB bug results in requests always being sent to the node itself, even if the health checks are failing.

As a result, launching a multi-controller cluster (i.e. controller_count = 3) will create 3 controllers, run bootkube start on the first, and the other 2 controllers will never be able to connect to the bootstrapped controller because the network load balancer routes their requests to themselves, even if you write a proper health check based on the apiserver availability on each node. In effect, you will only ever see the first controller in kubectl get nodes.

Workarounds

There are several workarounds, but the tradeoffs are poor.

  1. Kubernetes requires a single DNS FQDN, create DNS records for each controller. This is effectively the same round-robin DNS setup used on platforms that don't support load balancing. Bleh.
  2. SSH to additional controllers, temporarily add an /etc/hosts record to point them directly at the 0th controller to register and bootstrap themselves. Then remove the record. Manual.
  3. Use a Google Cloud global TCP load balancer, instance group, etc. This creates a lot more infrastructure, slows down provisioning time, introduces timeouts to kubectl log and exec commands, and isn't ideal. You can check the google-load-balancing branch, but note that I don't expect to merge it, its below the bar.

Recommendation

For now, I recommend folks keep deploying single controller clusters on Google Cloud.

This only affects Google Cloud. Multi-controller setups on all other platforms are supported.

Multiple worker types

Feature Request

Feature

There are some situations in which the complete cluster has no sense with the same instance types, as different services requires different specs.

It would be awesome to be able to define different "worker pools" (as GKE defines them) with different instance types.

Tradeoffs

The pros are the ability to create workers with different specifications, so a better usage of the machines can be made.

The cons, probably the difficulty to match the specs when refreshing state of the cloud infrastructure and, probable, that it is a feature not everyone is going to use. But it would be nice to have that.

(And thanks to all contributors for this awesome project โค๏ธ )

Muliple network interfaces break Calico pod network

Bug

Environment

  • Platform: bare-metal

  • OS: container-linux v1576.5.0

  • Terraform:

    Terraform v0.11.2
    + provider.local v1.1.0
    + provider.matchbox (unversioned)
    + provider.null v1.0.0
    + provider.template v1.0.0
    + provider.tls v1.0.1
    
  • Plugins: N\A

  • Ref: v1.9.2

Problem

Network connectivity between pods fails when creating a new cluster.

I have a lab with 3 nodes, each with 4 network adaptors. Two are connected (to the same network, to be bonded in the future) and two are reserved for future use (dedicated to rook/ceph replication).

When I create a cluster it comes up cleanly, with all pods running

$ kubectl get pods --all-namespaces -o wide
NAMESPACE     NAME                                         READY     STATUS    RESTARTS   AGE       IP               NODE
kube-system   calico-node-95rt9                            2/2       Running   0          7m        192.168.10.153   node-03.home.es.tnv
kube-system   calico-node-chcnb                            2/2       Running   0          7m        192.168.10.152   node-02.home.es.tnv
kube-system   calico-node-mkdfp                            2/2       Running   0          7m        192.168.10.151   node-01.home.es.tnv
kube-system   kube-apiserver-8wf87                         1/1       Running   4          7m        192.168.10.151   node-01.home.es.tnv
kube-system   kube-controller-manager-7596474f54-ptv5v     1/1       Running   0          7m        192.168.192.3    node-01.home.es.tnv
kube-system   kube-controller-manager-7596474f54-t69jb     1/1       Running   0          7m        192.168.192.2    node-01.home.es.tnv
kube-system   kube-dns-ddff48c6d-wt8t9                     3/3       Running   0          7m        192.168.192.6    node-01.home.es.tnv
kube-system   kube-proxy-78whj                             1/1       Running   0          7m        192.168.10.152   node-02.home.es.tnv
kube-system   kube-proxy-l4w6l                             1/1       Running   0          7m        192.168.10.153   node-03.home.es.tnv
kube-system   kube-proxy-scbjx                             1/1       Running   0          7m        192.168.10.151   node-01.home.es.tnv
kube-system   kube-scheduler-7f9bb9d97f-rpqsf              1/1       Running   0          7m        192.168.192.4    node-01.home.es.tnv
kube-system   kube-scheduler-7f9bb9d97f-x2xtf              1/1       Running   0          7m        192.168.192.5    node-01.home.es.tnv
kube-system   pod-checkpointer-nqbzk                       1/1       Running   0          7m        192.168.10.151   node-01.home.es.tnv
kube-system   pod-checkpointer-nqbzk-node-01.home.es.tnv   1/1       Running   0          6m        192.168.10.151   node-01.home.es.tnv

but creating the following pod (on a master to ensure cross-node-communication test)

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  nodeSelector:
    node-role.kubernetes.io/master: ""
  tolerations:
    - key: node-role.kubernetes.io/master
      operator: Exists
      effect: NoSchedule
  containers:
  - name: nginx
    image: nginx:1.7.9
    ports:
    - containerPort: 80

and running a simple busybox pod:

$ kubectl run busybox --image=busybox --restart=Never --tty -i --generator=run-pod/v1 --env "POD_IP=$(kubectl get pod nginx -o go-template='{{.status.podIP}}')"
If you don't see a command prompt, try pressing enter.
/ # wget -qO- http://$POD_IP
wget: can't connect to remote host (192.168.193.7): Connection timed out
/ # exit

I just get a timeout.

On the nodes themselves, 2 interfaces with DHCP seems to produce multiple default routes which may be causing issues:

ssh [email protected]
Update Strategy: No Reboots
core@node-01 ~ $ ip r
default via 192.168.0.1 dev ens4 proto dhcp src 192.168.15.61 metric 1024
default via 192.168.0.1 dev ens3 proto dhcp src 192.168.10.151 metric 1024
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev ens4 proto kernel scope link src 192.168.15.61
192.168.0.0/20 dev ens3 proto kernel scope link src 192.168.10.151
192.168.0.1 dev ens4 proto dhcp scope link src 192.168.15.61 metric 1024
192.168.0.1 dev ens3 proto dhcp scope link src 192.168.10.151 metric 1024
192.168.192.0/24 via 192.168.10.152 dev tunl0 proto bird onlink
blackhole 192.168.193.0/24 proto bird
192.168.193.2 dev calif76201d51d9 scope link
192.168.193.3 dev calidcbf1f5b010 scope link
192.168.193.4 dev cali9ac99a5c324 scope link
192.168.193.5 dev calie293158b83a scope link
192.168.193.6 dev cali7f12f7a7d30 scope link
192.168.194.0/24 via 192.168.10.153 dev tunl0 proto bird onlink

If I add the unofficial, undocumented, unsupported, temporary controller/worker networkd overrides with the following:

  # cluster.tf
  controller_networkds = [ "${file("default.network")}" ]
  worker_networkds = [
    "${file("default.network")}",
    "${file("default.network")}",
  ]
# default.network
units:
  - name: 01-admin.network
    contents: |
      [Match]
      MACAddress={{ .mac }}
      [Network]
      DHCP=yes
  - name: 02-non-admin.network
    contents: |
      [Match]
      Name=*
      [Link]
      Unmanaged=true

Then the cluster comes up without issue & pods have network connectivity even between nodes, but I would really like to be able to use the other network interfaces :)

Desired Behavior

Network connectivity between pods regardless of underlying network interfaces (as long as hardware connectivity exists).

Steps to Reproduce

I will try and create a Kubernetes Vagrant to reproduce the issue.

bootkube-start does not complete

Bug

Environment

  • Platform: aws
  • OS: container-linux
  • Terraform: v0.11.3
  • Plugins:
    • provider.aws v1.5.0
    • provider.ct (unversioned)
    • provider.local v1.1.0
    • provider.null v1.0.0
    • provider.template v1.0.0
    • provider.tls v1.0.1

Problem

bootkube-start is taking forever to complete, not sure if it will even complete. The last time I got it working was yesterday, but today I have tried multiple times with no success.

module.MODULENAME.null_resource.bootkube-start: Still creating... (1h2m0s elapsed)

Desired Behavior

To provision a Kubernetes cluster on AWS with one controller and two workers within 30 minutes at least. According to the docs I should be up and running within 10 minutes.

Steps to Reproduce

Basically I just followed these steps.

AWS credentials are working, DNS Zone is added, terraform-provider-ct is installed and the ssh key in my module is also added to my ssh-agent. Worth mentioning is that I have successfully provisioned clusters on DigitalOcean several times, but seem to have the same problem over there as well now.

Scape etcd targets in Prometheus addon

Feature

Configure the prometheus addon to scape Typhoon etcd targets on controller nodes. Then, metrics from etcd will be available in Prometheus. Alert rules for etcd will fire during incidents. The etcd dashboard provided with the grafana addon will be populated.

Invariants:

  • Users still only need to choose to kubectl apply the addon manifests. Nothing more.
  • Users never need to fiddle with listing etcd nodes on any platform.

Background

The prometheus addon manifests setup Prometheus 2.1 (#113) to scape apiservers, kubelets, services, endpoints, cAdvisor, and exporters (kube-state-metrics and node_exporter). Alerting rules and Grafana graphs in addons correspond to these metrics. However, etcd rules and graphs currently aren't active/populated.

Situation

Prometheus's can be configured (via the ConfigMap) to scrape the secured :2379/metrics endpoints of etcd nodes just like any other target. The etcd cluster runs on-host, across controllers with systemd, it is a lower-level component on which Kubernetes relies (not atop k8s), and it handles its own client authentication already.

  • Typhoon runs etcd on-host, across controllers, on all platforms
  • Typhoon requires etcd be setup with TLS on all platforms
  • Typhoon creates etcd client certs, but only places them on controller nodes

To perform the scrapes, Prometheus needs the etcd client certificates to write the tls_config section in a new scape job.

Options

  • Add etcd client materials in a kube-system secret. We did this back when self-hosted etcd was explored.
    • Pro: Allows prometheus pod to be scheduled on any node
    • Con: Opens up the possibility of escalation attacks (i.e. read kube-system secrets == read everything)
  • Mount the etcd client materials from a controller host. (current most viable)
    • Pro: Avoid keeping etcd client materials in a Kubernetes secret
    • Con: Restricts prometheus pod itself to run on controller nodes
  • Explore whether its possible to create (or invent) "metrics-only" etcd certificates
    • Likely not on the roadmap
  • Metrics whitelist proxy
    • Con: I use some whitelist proxies for some internal things. They're gross though.

How to: Customise deployed services/pods/etc

Although I'm pretty sure I've got my head around the physical kit (in AWS anyway), I'm not sure I quite understand yet how to customise the kubernetes setup, so that after a terraform apply, I have a setup that's ready to go as one of my environments. I'd like to make prometheus/nginx-ingress/grafana/heapster and cluo all part of the pre-deployed architecture, and I'm thinking about adding weave and maybe swapping out nginx-ingress for traefic, and I'm just starting to get around to certi-manager, and god knows what else, I think of/come across between now and the end my POC work with Kubernetes. There's so much it's mind boggling.

Is there a simple way to do this? Am I correct in thinking that I'll need to fork and amend the poseidon/terraform-render-bootkube repo to support this? Or is there a better way of attacking the problem?

Thanks for starting this repo/product too. Easily the best I've seen to date for a terraform template. The tectonic one doesn't feel like it's getting any love at present, and it's too generic to be easily followed. Especially for non-terraform ninjas.

Digital Ocean firewalls don't allow IP-tunneling protocol for Calico

Bug

Environment

  • Platform: digital-ocean
  • OS: container-linux
  • Terraform: any
  • Rel: #10

Problem

Typhoon for Digital Ocean provides a networking variable which can be set to "flannel" or "calico". Calico must use IP in IP encapsulation on Digital Ocean, but Digital Ocean's cloud firewalls don't support the protocol, AWS and GCP do.

For quick testing, users can remove the firewall entirely and verify pod to pod connectivity works correctly with Calico, but this isn't a safe way to run a cluster and we cannot allow a configuration that behaves that way.

In practice, on Digital Ocean, Calico networking cannot be used.

Dependency

I've filed a ticket with Digital Ocean to ask when support for the IPIP protocol might be added. I'll update here if I receive a response.

The best course of action right now is to choose networking: "flannel" and deploy the Calico Network Policy addon. This is sometimes referred to as "canal". Connectivity is provided by flannel and policy is provided by Calico.

Kubernetes v1.8.1 kube-apiserver memory leak

Bug

The Kubernetes hyperkube v1.8.1 apiserver (latest at time of writing) leaks memory (reported by kubernetes/kubernetes#53485).

Fix

This is slated to be fixed in v1.8.2 possibly with:

Notice the Google v1.8.2-beta.0 is not new enough.

Meanwhile

If your clusters are running v1.7.7, favor waiting a few days for v1.8.2.

Otherwise:

  • Use the heapster addon and kubectl top pods -n kube-system to keep an eye on usage.
  • You may set a reasonable memory limit on the apiserver (relative to your available memory) with kubectl edit daemonset kube-apiserver -n kube-system (needs about 250Mi minimum). The pod will restart every few hours, but imo its safer than growing to the node memory max.
  name: kube-apiserver                                                                                                                                                                                   
  resources:                                                                                                                                                                                             
    limits:                                                                                                                                                                                              
      memory: 400Mi     

Visualization

Prometheus / Grafan kube-apiserver on a v1.8.1.

screenshot from 2017-10-20 22-34-00

With a 400Mi memory limit set via edit, the apiserver will get restarted every so often. Spikes are due to pod restarts, its not graceful.

screenshot from 2017-10-21 07-40-42

digital-ocean: module.digital-ocean-nemo.null_resource.copy-secrets keep running

Bug

Environment

  • Platform: digital-ocean
  • OS: Container Linux
  • Terraform: v0.10.7

Problem

Plan the resources fail at step module.digital-ocean-nemo.null_resource.copy-secrets when run $ terraform apply

module.digital-ocean-nemo.digitalocean_firewall.rules: Modifications complete after 2s (ID: e8029d33-f276-41a3-ba2c-0d62aed4110d)
module.digital-ocean-nemo.null_resource.copy-secrets.2: Still creating... (10s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.0: Still creating... (10s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.1: Still creating... (10s elapsed)
.......
module.digital-ocean-nemo.null_resource.copy-secrets.0: Still creating... (14m30s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.1: Still creating... (14m30s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.2: Still creating... (14m30s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.0: Still creating... (14m40s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.1: Still creating... (14m40s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.2: Still creating... (14m40s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.0: Still creating... (14m50s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.1: Still creating... (14m50s elapsed)
module.digital-ocean-nemo.null_resource.copy-secrets.2: Still creating... (14m50s elapsed)

Error applying plan:

3 error(s) occurred:

  • module.digital-ocean-nemo.null_resource.copy-secrets[2]: 1 error(s) occurred:

  • ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

  • module.digital-ocean-nemo.null_resource.copy-secrets[0]: 1 error(s) occurred:

  • ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

  • module.digital-ocean-nemo.null_resource.copy-secrets[1]: 1 error(s) occurred:

  • ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Steps to Reproduce

Follow the instruction at https://typhoon.psdn.io/digital-ocean/ and run $ terraform plan

Deploy has stopped working, "copy_secrets" times out

Bug

Not entirely sure what is going on, but this morning a deploy worked, and now it's not. The failure is right at the end when terraform is trying to ssh to the controllers and add some setup (keys etc) to them.

I get:

3 error(s) occurred:

* module.tudorcity.null_resource.copy-secrets[2]: 1 error(s) occurred:

* ssh: handshake failed: ssh: unable to authenticate, attempted methods [publickey none], no supported methods remain
* module.tudorcity.null_resource.copy-secrets[0]: 1 error(s) occurred:

* ssh: handshake failed: ssh: unable to authenticate, attempted methods [publickey none], no supported methods remain
* module.tudorcity.null_resource.copy-secrets[1]: 1 error(s) occurred:

* ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

I've tried it 4 or 5 times now, and get the same result.

I've deleted my .secrets directory, the .terraform directory, re-initialised terraform and a few other things that I've lost track of now.

Environment

  • Platform: aws
  • OS: container-linux
  • Terraform: v0.11.3
  • Plugins: Provider plugin versions

Steps to Reproduce

For reference, here is my template (sanitised):

variable "cluster_name" {
    default = "test"
}

variable "environment" {}

variable "primary_region" {
  default = "eu-west-1"
}

variable "aws_profile" { }

provider "aws" {
  region                  = "${var.primary_region}"
  shared_credentials_file = "~/.aws/credentials"
  profile                 = "${var.aws_profile}"
}

provider "aws" {
  region                  = "${var.primary_region}"
  shared_credentials_file = "~/.aws/credentials"
  profile                 = "${var.aws_profile}"
  alias                   = "default"
}

provider "local" {
  version = "~> 1.0"
  alias   = "default"
}

provider "null" {
  version = "~> 1.0"
  alias   = "default"
}

provider "template" {
  version = "~> 1.0"
  alias   = "default"
}

provider "tls" {
  version = "~> 1.0"
  alias   = "default"
}

data "aws_route53_zone" "public_zone" {
  name         = "${var.environment}.mydomain.com."
  private_zone = false
}

module "test" {
  source = "git::https://github.com/poseidon/typhoon//aws/container-linux/kubernetes?ref=v1.9.3"

  providers = {
    aws      = "aws.default"
    local    = "local.default"
    null     = "null.default"
    template = "template.default"
    tls      = "tls.default"
  }

  cluster_name = "${var.cluster_name}"

  # AWS
  dns_zone           = "${var.environment}.mydomain.com"
  dns_zone_id        = "${data.aws_route53_zone.public_zone.zone_id}"
  controller_count   = 3
  controller_type    = "c5.large"
  worker_count       = 2
  worker_type        = "c5.2xlarge"
  ssh_authorized_key = "ssh-rsa ..."

  # bootkube
  asset_dir = "/home/paul/.secrets/clusters/test"
}

Additionally, well done for something that does work. I've tried a few terraform kubernetes templates and this has been the only one that worked out of the box.

Allow TLS materials injection in Typhoon

Feature Request

Feature

Would it be possible to update the Typhoon (and on-prem) repo to accept/reflect all the bootkube variables please? I am talking about these variables to these ones.

Tradeoffs

My understanding is Typhoon = Booktube + Matchbox + CL templates.
It would be nice to be able to customise the Kubernetes deployment like when we use bootkube manually.

Some are already accepted like pod_cidr, service_cidr and networking .
In my current context, I would like to force the ca_certificate , ca_key_alg and ca_private_key to keep my kubeconfig files static, even after update/clusters re-creation.

Thank you.

Terraform v0.11.x causes unexpected prompts and deletion errors

Bug

Environment

  • Platform: aws, bare-metal, google-cloud, digital-ocean
  • OS: container-linux
  • Terraform: v0.11.x

Problem

Terraform v0.11.x changes the provider and module relationships significantly. This causes issues with Teraform provider version constraints which allowed modules to specify minimum versions of providers that are required. Typhoon uses these constraints to ensure end-users have appropriate plugin versions.

You can find the full saga in hashicorp/terraform#16824.

Short term

Stick with Terraform v0.10.x.

Mid term

I'm working on docs to show how v0.11.x can be used. Basically,

Explicitly add every provider in providers.tf and give it an alias, such as "default".

provider "local" {
  version = "~> 1.0"
  alias = "default"
}

provider "null" {
  version = "~> 1.0"
  alias = "default"
}

provider "template" {
  version = "~> 1.0"
  alias = "default"
}

provider "tls" {
  version = "~> 1.0"
  alias = "default"
}

Edit each instance of a module in your infrastructure to explicitly pass the providers.

module "aws-cluster" {
  source = "git::https://github.com/poseidon/typhoon//aws/container-linux/kubernetes?ref=5ea7ce0af559857591f20fe19b03aab177fd7032"
  
  providers = {
    aws = "aws.default"
    local = "local.default"
    null = "null.default"
    template = "template.default"
    tls = "tls.default"
  }

  cluster_name = "blah"
   ...

Re-run terraform plan and terraform apply on your infrastructure. Plan should claim there are 0 changes, but run apply anyway.

Now you should be able to use Terraform again normally.

  • If you run plan, you won't see random prompts for provider fields you're setting
  • If you comment or delete a module instance and terraform apply, the cluster is correctly deleted

Yes this is silly.

Long term

Upstream is aware of the impact on modules that use provider versions. They're hoping to address with hashicorp/terraform#16835 in a future v0.11.x release. We may wait on this instead of asking dear users to hop through the hoops above. Feel free to weigh in on how you'd like to see this proceed.

Why is etcd configured with a systemd dropin?

It seems that typhoon only relies on systemd+docker to run etcd. It can be specific as a full system unit instead of a drop-in

- name: etcd-member.service
enable: true
dropins:
- name: 40-etcd-cluster.conf
contents: |
[Service]
Environment="ETCD_IMAGE_TAG=v3.2.0"
Environment="ETCD_NAME=${etcd_name}"
Environment="ETCD_ADVERTISE_CLIENT_URLS=https://${etcd_domain}:2379"
Environment="ETCD_INITIAL_ADVERTISE_PEER_URLS=https://${etcd_domain}:2380"
Environment="ETCD_LISTEN_CLIENT_URLS=https://0.0.0.0:2379"
Environment="ETCD_LISTEN_PEER_URLS=https://0.0.0.0:2380"
Environment="ETCD_INITIAL_CLUSTER=${etcd_initial_cluster}"
Environment="ETCD_STRICT_RECONFIG_CHECK=true"
Environment="ETCD_SSL_DIR=/etc/ssl/etcd"
Environment="ETCD_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/server-ca.crt"
Environment="ETCD_CERT_FILE=/etc/ssl/certs/etcd/server.crt"
Environment="ETCD_KEY_FILE=/etc/ssl/certs/etcd/server.key"
Environment="ETCD_CLIENT_CERT_AUTH=true"
Environment="ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/certs/etcd/peer-ca.crt"
Environment="ETCD_PEER_CERT_FILE=/etc/ssl/certs/etcd/peer.crt"
Environment="ETCD_PEER_KEY_FILE=/etc/ssl/certs/etcd/peer.key"
Environment="ETCD_PEER_CLIENT_CERT_AUTH=true"
which depends on container linux.

Same thing should apply to hyperkube/bootkube, etc.. However I have not looked in into these.

I would like to remove the dependencies on a specific OS as much as possible. But I want to learn if there is a specific reason for Typhoon to do so.

Ability to specify matchbox assets location

Feature Request

Feature

I'd like to use assets from my local matchbox assets directory directly for the container-linux-install profile, these are hard coded to use release.core-os.net.

Document how to force redeploy of assets

Feature Request

Feature

Add documentation to force terraform to copy assets and start bootkube.


In a bare-metal environment I wanted to wipe cluster and start fresh, but with the same assets (keys, certs, etc). Simply wiping the machines and running terraform apply didn't cause the assets to be copied to the fresh machines.

I tried tainting the terraform modules with no success (this is my lack of experience with terraform)

$ terraform taint module.cluster.null_resource.copy-etcd-secrets
Failed to parse resource name: Malformed resource state key: module.cluster.null_resource.copy-etcd-secrets
$ terraform taint module.cluster.null_resource.copy-kubeconfig
Failed to parse resource name: Malformed resource state key: module.cluster.null_resource.copy-kubeconfig

I ended up having to delete the terraform.tfstate file and regenerating all assets anyway.

Set Calico as the default network provider instead of flannel

Calico will replace flannel as the default network provider on platforms that fully support it (i.e. all except Digital Ocean).

Calico has a number of advantages:

  • Calico is more actively developed and improved
  • Calico provides network policy out of the box (i.e. restricting what traffic can reach pods)
  • Calico uses BGP to peer with nodes and network infrastructure (verified with Ubiquiti gear)
  • Calico is closer to having a functional IPv6 story transparent to Kubernetes
  • Calico is easier to debug, just use ordinary ip utils (e.g. show exchanged routes with ip route)
  • Slight performance improvement as its not a full overlay network.
    • In practice, this is hard to observe - a bare-metal cluster (w flannel) had 902 MBit/s average pod-to-pod bandwidth and 920 MBit/s with Calico. Its within the margin for error though

Status of Flannel

Flannel will stick around for a bit, mostly for comparison and debugging purposes. However, Typhoon will cease testing flannel and remove support for flannel in the future.

can't get terraform gce code to work

This is with terraform 0.10.6 on a mac.

If I try the example as given, I get:

bash-3.2$ terraform get --update
Get: git::https://github.com/poseidon/typhoon//google-cloud/container-linux/kubernetes (update)
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=dbfb11c6eafa08f839eac2834ca1aca35dafe965 (update)
Error loading modules: module controllers: invalid source string: ../controllers

If I clone the repo so it sits alongside my other terraform code and use a relative path instead, I get a little further:

bash-3.2$ terraform plan
Plugin reinitialization required. Please run "terraform init".
Reason: Could not satisfy plugin requirements.

Plugins are external binaries that Terraform uses to access and manipulate
resources. The configuration provided requires plugins which can't be located,
don't satisfy the version constraints, or are otherwise incompatible.

6 error(s) occurred:

* provider.google: no suitable version installed
  version requirements: "(any version)"
  versions installed: none
* provider.local: no suitable version installed
  version requirements: "(any version)"
  versions installed: none
* provider.tls: no suitable version installed
  version requirements: "(any version)"
  versions installed: none
* provider.template: no suitable version installed
  version requirements: "(any version)"
  versions installed: none
* provider.ct: no suitable version installed
  version requirements: "(any version)"
  versions installed: none
* provider.null: no suitable version installed
  version requirements: "(any version)"
  versions installed: none

Terraform automatically discovers provider requirements from your
configuration, including providers used in child modules. To see the
requirements and constraints from each module, run "terraform providers".

error satisfying plugin requirements
bash-3.2$ terraform init
Downloading modules...
Get: file:///Users/joshua/go/src/github.com/poseidon/typhoon/google-cloud/container-linux/kubernetes
Get: git::https://github.com/poseidon/bootkube-terraform.git?ref=dbfb11c6eafa08f839eac2834ca1aca35dafe965
Get: file:///Users/joshua/go/src/github.com/poseidon/typhoon/google-cloud/container-linux/controllers
Get: file:///Users/joshua/go/src/github.com/poseidon/typhoon/google-cloud/container-linux/workers

Initializing provider plugins...
- Checking for available provider plugins on https://releases.hashicorp.com...
- Downloading plugin for provider "local" (1.0.0)...
- Downloading plugin for provider "tls" (1.0.0)...

Provider "ct" not available for installation.

A provider named "ct" could not be found in the official repository.

This may result from mistyping the provider name, or the given provider may
be a third-party provider that cannot be installed automatically.

In the latter case, the plugin must be installed manually by locating and
downloading a suitable distribution package and placing the plugin's executable
file in the following directory:
    terraform.d/plugins/darwin_amd64

Terraform detects necessary plugins by inspecting the configuration and state.
To view the provider versions requested by each module, run
"terraform providers".

- Downloading plugin for provider "null" (0.1.0)...
- Downloading plugin for provider "google" (0.1.3)...
- Downloading plugin for provider "template" (0.1.1)...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.