Coder Social home page Coder Social logo

machine-controller's Introduction

Kubermatic machine-controller

Important Note: User data plugins for machine-controller have been removed. Operating System Manager is the successor of user data plugins. It's responsible for creating and managing the required configurations for worker nodes in a Kubernetes cluster with better modularity and extensibility. Please refer to Operating System Manager for more details.

Table of Contents

Features

What Works

  • Creation of worker nodes on AWS, Digitalocean, Openstack, Azure, Google Cloud Platform, Nutanix, VMWare Cloud Director, VMWare vSphere, Hetzner Cloud and Kubevirt
  • Using Ubuntu, Flatcar, CentOS 7 or Rocky Linux 8 distributions (not all distributions work on all providers)

Supported Kubernetes Versions

machine-controller tries to follow the Kubernetes version support policy as close as possible.

Currently supported K8S versions are:

  • 1.30
  • 1.29
  • 1.28

Community Providers

Some cloud providers implemented in machine-controller have been graciously contributed by community members. Those cloud providers are not part of the automated end-to-end tests run by the machine-controller developers and thus, their status cannot be guaranteed. The machine-controller developers assume that they are functional, but can only offer limited support for new features or bugfixes in those providers.

The current list of community providers is:

  • Linode
  • Vultr
  • OpenNebula

What Doesn't Work

  • Master creation (Not planned at the moment)

Quickstart

Deploy machine-controller

  • Install cert-manager for generating certificates used by webhooks since they serve using HTTPS
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.2/cert-manager.yaml
  • Run kubectl apply -f examples/operating-system-manager.yaml to deploy the operating-system-manager which is responsible for managing user data for worker machines.
  • Run kubectl apply -f examples/machine-controller.yaml to deploy the machine-controller.

Creating a MachineDeployment

# edit examples/$cloudprovider-machinedeployment.yaml & create the machineDeployment
kubectl create -f examples/$cloudprovider-machinedeployment.yaml

Advanced Usage

Specifying the Apiserver Endpoint

By default the controller looks for a cluster-info ConfigMap within the kube-public Namespace. If one is found which contains a minimal kubeconfig (kubeadm cluster have them by default), this kubeconfig will be used for the node bootstrapping. The kubeconfig only needs to contain two things:

  • CA-Data
  • The public endpoint for the Apiserver

If no ConfigMap can be found:

CA Data

The Certificate Authority (CA) will be loaded from the passed kubeconfig when running outside the cluster or from /var/run/secrets/kubernetes.io/serviceaccount/ca.crt when running inside the cluster.

Apiserver Endpoint

The first endpoint from the kubernetes endpoints will be taken. kubectl get endpoints kubernetes -o yaml

Example cluster-info ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-info
  namespace: kube-public
data:
  kubeconfig: |
    apiVersion: v1
    clusters:
    - cluster:
        certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURHRENDQWdDZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREE5TVRzd09RWURWUVFERXpKeWIyOTAKTFdOaExtaG1kblEwWkd0bllpNWxkWEp2Y0dVdGQyVnpkRE10WXk1a1pYWXVhM1ZpWlhKdFlYUnBZeTVwYnpBZQpGdzB4TnpFeU1qSXdPVFUyTkROYUZ3MHlOekV5TWpBd09UVTJORE5hTUQweE96QTVCZ05WQkFNVE1uSnZiM1F0ClkyRXVhR1oyZERSa2EyZGlMbVYxY205d1pTMTNaWE4wTXkxakxtUmxkaTVyZFdKbGNtMWhkR2xqTG1sdk1JSUIKSWpBTkJna3Foa2lHOXcwQkFRRUZBQU9DQVE4QU1JSUJDZ0tDQVFFQTNPMFZBZm1wcHM4NU5KMFJ6ckhFODBQTQo0cldvRk9iRXpFWVQ1Unc2TjJ0V3lqazRvMk5KY1R1YmQ4bUlONjRqUjFTQmNQWTB0ZVRlM2tUbEx0OWMrbTVZCmRVZVpXRXZMcHJoMFF5YjVMK0RjWDdFZG94aysvbzVIL0txQW1VT0I5TnR1L2VSM0EzZ0xxNHIvdnFpRm1yTUgKUUxHbllHNVVPN25WSmc2RmJYbGxtcmhPWlUvNXA3c0xwQUpFbCtta3RJbzkybVA5VGFySXFZWTZTblZTSmpDVgpPYk4zTEtxU0gxNnFzR2ZhclluZUl6OWJGKzVjQTlFMzQ1cFdQVVhveXFRUURSNU1MRW9NY0tzYVF1V2g3Z2xBClY3SUdYUzRvaU5HNjhDOXd5REtDd3B2NENkbGJxdVRPMVhDb2puS1o0OEpMaGhFVHRxR2hIa2xMSkEwVXpRSUQKQVFBQm95TXdJVEFPQmdOVkhROEJBZjhFQkFNQ0FxUXdEd1lEVlIwVEFRSC9CQVV3QXdFQi96QU5CZ2txaGtpRwo5dzBCQVFzRkFBT0NBUUVBamlNU0kxTS9VcUR5ZkcyTDF5dGltVlpuclBrbFVIOVQySVZDZXp2OUhCUG9NRnFDCmpENk5JWVdUQWxVZXgwUXFQSjc1bnNWcXB0S0loaTRhYkgyRnlSRWhxTG9DOWcrMU1PZy95L1FsM3pReUlaWjIKTysyZGduSDNveXU0RjRldFBXamE3ZlNCNjF4dS95blhyZG5JNmlSUjFaL2FzcmJxUXd5ZUgwRjY4TXd1WUVBeQphMUNJNXk5Q1RmdHhxY2ZpNldOTERGWURLRXZwREt6aXJ1K2xDeFJWNzNJOGljWi9Tbk83c3VWa0xUNnoxcFBRCnlOby9zNXc3Ynp4ekFPdmFiWTVsa2VkVFNLKzAxSnZHby9LY3hsaTVoZ1NiMWVyOUR0VERXRjdHZjA5ZmdpWlcKcUd1NUZOOUFoamZodTZFcFVkMTRmdXVtQ2ttRHZIaDJ2dzhvL1E9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
        server: https://hfvt4dkgb.europe-west3-c.dev.kubermatic.io:30002
      name: ""
    contexts: []
    current-context: ""
    kind: Config
    preferences: {}
    users: []

Development

Testing

Unit Tests

Simply run make test-unit

End-to-End Locally

[WIP]

Troubleshooting

If you encounter issues file an issue or talk to us on the #kubermatic channel on the Kubermatic Slack.

Contributing

Thanks for taking the time to join our community and start contributing!

Before You Start

  • Please familiarize yourself with the Code of Conduct before contributing.
  • See CONTRIBUTING.md for instructions on the developer certificate of origin that we require.

Pull Requests

  • We welcome pull requests. Feel free to dig through the issues and jump in.

Changelog

See the list of releases to find out about feature changes.

machine-controller's People

Contributors

ahmedwaleedmalik avatar alvaroaleman avatar dependabot[bot] avatar dermorz avatar eiabea avatar embik avatar guusvw avatar happy2c0de avatar hdurand0710 avatar irozzo-1a avatar kdomanski avatar kron4eg avatar kubermatic-bot avatar littlefox94 avatar mate4st avatar mfranczy avatar mlavacca avatar moadqassem avatar moelsayed avatar mrincompetent avatar multi-io avatar nikhita avatar p0lyn0mial avatar pratikdeoghare avatar sachintiptur avatar sankalp-r avatar thz avatar wozniakjan avatar xmudrii avatar xrstf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

machine-controller's Issues

RBAC broken

#56 apparently broke RBAC:

GET https://10.96.0.1:443/api/v1/configmaps?resourceVersion=13569859&timeoutSeconds=346&watch=true
I0203 01:42:59.303274       1 round_trippers.go:439] Response Status: 403 Forbidden in 1 milliseconds
I0203 01:42:59.303287       1 round_trippers.go:442] Response Headers:
I0203 01:42:59.303293       1 round_trippers.go:445]     Content-Type: application/json
I0203 01:42:59.303298       1 round_trippers.go:445]     X-Content-Type-Options: nosniff
I0203 01:42:59.303302       1 round_trippers.go:445]     Content-Length: 277
I0203 01:42:59.303307       1 round_trippers.go:445]     Date: Sat, 03 Feb 2018 01:42:59 GMT
I0203 01:42:59.303628       1 request.go:873] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"configmaps is forbidden: User \"system:serviceaccount:kube-system:machine-controller\" cannot watch configmaps at the cluster scope","reason":"Forbidden","details":{"kind":"configmaps"},"code":403}

What surprises me a little is that it watches all configmaps, shouldn't a watch on the cluster-info configmap in the kube-public namespace be enough?

Running into AWS rate-limits

When creating 5 machines simultanously, we're getting rate limited by AWS - on all machines.

It seems it happens during validation. Thus the errors we get from AWS are being handled as terminal.

Add integration testing script

To be able to properly validate the machine-controller is working as intended, we need some kind of integration testing.

Because it is not possible to both test external PRs automatically and be sure they are not used to steal credentials, this script is not supposed to be executed automatically. Instead it will:

  • Take credentials for a cloud provider, e.G. from environment
  • Take a ssh pubkey
  • Create a single node k8s cluster via kubeadm at cloudprovider
  • Deploy machine-controller in a version built from git HEAD into the newly created cluster using the deployment in the repo
  • Verify machine-controller is running
  • Provide a teardown functionality

Add end-to-end testing to CircleCI pipeline

To better know if PRs add bugs, we should add the existing end-to-end testing to the CircleCI-Pipeline.

This requires:

  • Configuring the ssh keypair name at the cloudprovider with a random prefix
  • Deleting the ssh keypair before ending the e2e test
  • Adding the test-e2e target to circleci

Parse versions via a semver library

We need to parse the user given versions (kubelet & container runtime) to process them correctly in the end.
Especially so we can accept v1.9.2 and 1.9.2 as input.
Currently we require the kubelet version to have a leading v but we dont require it for the container runtime version

Hetzner E2E tests

Add the following test cases to the existing E2E test suite.

Ubuntu + Docker 1.13
Ubuntu + Docker 17.03
Ubuntu + CRI-O 1.9

Clean up .circle/config.yaml, so that it doesn't run the test using create-and-destroy-machine.sh script

Extend e2e testing

We should add the following test cases:

  • Hetzner
    • Ubuntu + Docker 1.13
    • Ubuntu + Docker 17.03
    • Ubuntu + CRI-O 1.9
  • Digitalocean
    • Ubuntu + Docker 1.13
    • Ubuntu + Docker 17.03
    • Ubuntu + CRI-O 1.9
    • CoreOS + Docker 1.13
    • CoreOS + Docker 17.03
  • AWS
    • Ubuntu + Docker 1.13
    • Ubuntu + Docker 17.03
    • Ubuntu + CRI-O 1.9
    • CoreOS + Docker 1.13
    • CoreOS + Docker 17.03
  • Openstack (We need a sponsor here)
    • Ubuntu + Docker 1.13
    • Ubuntu + Docker 17.03
    • Ubuntu + CRI-O 1.9
    • CoreOS + Docker 1.13
    • CoreOS + Docker 17.03

Schedule E2E night tests run

Since running the complete e2e suite takes too long, as a temporary step we could schedule a night tests run. Running the test frequently would increase confidence and hopefully would reveal potential issues that might crop up.

Use `kubeadm join` instead of manually maintaining kubelet config

Right now we maintain the kubelet config as part of the distro-specific templates. This has some drawbacks:

  • We may miss important configuration parameters
  • Whenever we change something, we have to change it at multiple places
  • There is no way to have different configs based on Kubelet version

Instead it would be easier if we just used kubeadm join to configure the Kubelet

Move cloudprovider secrets out of machine definition and into a secret

Right now the machine definition contains all access secrets to the cloud provider it is spawned on. This has two drawbacks:

  • Someone who shall have the permission to create machines has to know these credentials
  • There are some objects (e.G. security groups, ssh keys) whose lifetime is not coupled to a machine but to the usage of the cloudprovider, meaning as long as any machine uses that cloud provider, they have to exist

Instead we want to move the cloudprovider secrets into an actual secret which is then referenced by machines.

deadlock when trying to delete a machine

Steps to reproduce :

  1. Create an invalid machine - you can use the following manifest that doesn't specify required credentials https://github.com/kubermatic/machine-controller/blob/master/examples/machine-digitalocean.yaml
  2. Delete the previously created machine.
  3. List machine resources

Result:
The machine was not deleted and the server keeps saying machine1 failed with: failed to get instance for machine machine1 after the delete got triggered
The only way of getting out of this situation is to manually edit the machine's spec and remove finalizers.

In general the described state exists because we add finalizers to a machine before creating a node because we want to prevent deletion of a machine resource.

As I can imagine that the call that requests a node can fail for many reasons, I think that this issue could help us track discussion on possible solutions to this issue.

Trigger events

With the implementation of transient and terminal errors we now correctly set machine.status.errorReason & machine.status.errorMessage when the controller runs into a terminal error.

Transient errors though are not reported back. The only way to see those is by investigating the logs.
Instead if just logging, we should trigger a event which is attached to the machine.

Do not use cluster-info configmap anymore

Right now the machine-controller uses the cluster-info configmap to get the CACert and the endpoint for the apiserver.

Instead it should get the CACert from its kubeconfig or from /run and the apiserver endpoints from its kubeconfig or from the endpoints of the kubernetes service when running in-cluster.

This will reduce the configuration overhead and help ppl to get started faster.

Defaulting for Openstack

Usage of the Openstack provider would be easier if there was defaulting for

  • availabilityZone
  • Region
  • Network
  • Subnet
  • FloatingIPPool

To achieve this, the machine-controller should request a list of the given resource, check if there is exactly one and if yes default to that.

Make container runtime version optional

Based upon the entered Kubernetes version and the selected OS we should default to a docker/cri-o version.

For now the logic should be:

  • cri-o
    • Kubernetes v1.8 + Ubuntu 16.04 -> error as theres no cri-o 1.8 package in the repos
    • Kubernetes v1.9 + Ubuntu 16.04 -> cri-o 1.9
    • Kubernetes v1.8 + Container Linux -> error as theres no cri-o for coreos
    • Kubernetes v1.8 + Container Linux -> error as theres no cri-o for coreos
  • docker
    • Kubernetes v1.8 + Ubuntu 16.04 -> docker 1.13
    • Kubernetes v1.9 + Ubuntu 16.04 -> docker 1.13
    • Kubernetes v1.8 + Container Linux -> docker 1.12
    • Kubernetes v1.8 + Container Linux -> docker 1.12

simple e2e test tool

having a simple command line tool that would verify whether a node has been created serves not only as a good warm up exercise but also as a handy test tool.

the idea is that we would have a list of predefined machine manifests that would need some customisation in terms of credentials. The credentials could be accepted as a command line arguments and passed all the way down to the manifests. After POST'ing the given manifests to the kube-api server the test tool would read the current cluster state in order to determine the correctness of machine-controller

the test tool would use the standard client-go library to talk to the api server and would read the kubeconfig configuration file to discover where the cluster is actually located.

assumptions:

  • cluster was created manually
  • kube config is accessible
  • there is a list of predefined machine manifests

for example, running the following command: verify -input path_to_manifest -parameters key=value, key2=value would print a machine "node-docker" has been crated to stdout.

Allow to specify ssh-key via flag

Current state:
On initial start, we check if a secret with a private ssh key exists.
If no secret is found, we generate a secret with a private key.

This ssh key will be later used when creating instances at cloud-providers.
This was made so the user does not have to specify a ssh public key in the machine-manifest, as some cloud providers require to specify a public key when creating a instance (aws).

All public keys from the machine manifest are getting deployed via cloud-init.

Desired state:
The controller should accept a path to a private key via a command line flag.
If the flag is specified and a valid key got found, this key should be taken.
If no flag was specified or the key was not found, the old logic with the secret should apply.

Extend circle pipeline

Whats missing:
Building a docker image

  • On push using the commit hash as docker tag
  • On tag using the git tag as docker tag & latest

process machines which were annotated

the machine controller has been incorporated into kubermatic and is an inherent part of every cluster. That made local development/testing impossible as every machine is processed by incluster machine controller.

we could annotate a machine manifest with some arbitrary data and at the same time we could introduce a new command line flag. On a successful match a controller should continue otherwise it should leave it to others. An empty annotation means there is no preference.

running machine-controller through leader election is optional

the machine controller has been incorporated into kubermatic and is an inherent part of every cluster.
That made local development/testing impossible as is highly possible that the machine controller which runs inside kubermatic will acquire a lock right before the local instance.

Making leader election optional seems to remedy this issue.

Create temporary ssh key during instance creation when required. Delete afterwards

Current state:
On initial start, we check if a secret with a private ssh key exists.
If no secret is found, we generate a secret with a private key.

This ssh key will be later used when creating instances at cloud-providers.
This was made so the user does not have to specify a ssh public key in the machine-manifest, as some cloud providers require to specify a public key when creating a instance (digitalocean).

All public keys from the machine manifest are getting deployed via cloud-init.

Desired state:
The whole ssh key logic should be removed.
If a cloud provider requires a ssh key during instance creation:

  • Create a temporary key before the instance get created
  • Use the temporary key for instance creation
  • Delete the temporary key after the key has been created

add prometheus metrics

Add the following metrics:

  • Total number of errors
  • Total number of machines
  • Total number of nodes
  • Time how long it took to create/delete a instance on the cloud provider
  • Timedifference between node.CreationTimestamp -
    machine.CreationTimestamp

Test and document centos on vsphere

As a user I want to be able to spin up worker nodes on vsphere that use CentOS as distribution.

Acceptance criteria:

  • There is a documentation on how to import/create a suitable image for centos on vsphere
  • An image in our vsphere test cluster was created by following the steps documented
  • The e2e tests are extended to also test CentOS on vsphere

e2e tests modify manifest by providing a field selector

At the moment tests replace desired fields in the manifest based on string matching. For example:

params = fmt.Sprintf("%s,<< MACHINE_NAME >>=%s,<< NODE_NAME >>=%s", params, machineName, nodeName)
params = fmt.Sprintf("%s,<< OS_NAME >>=%s,<< CONTAINER_RUNTIME >>=%s,<< CONTAINER_RUNTIME_VERSION >>=%s", params, testCase.osName, testCase.containerRuntime, testCase.containerRuntimeVersion)

we would like to change that by providing the field path for example spec.providerConfig. cloudProvider. this would not only look better but would also allow to consume manifest under example directory.

Vsphere E2E tests

Add the following test cases to the existing E2E test suite.

Ubuntu + Docker 1.13
Ubuntu + Docker 17.03
Ubuntu + CRI-O 1.9

Clean up .circle/config.yaml, so that it doesn't run the test using create-and-destroy-machine.sh script

Add support for accepting cloud-provider credentials as EnvVar's

The Machine object accepts multiple sources for cloudProviderSpec fields:

  • Direct value
...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId: "foo"
  • Secret ref
...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId:
        secretKeyRef:
          namespace: kube-system
          name: machine-controller-aws
          key: accessKeyId
  • ConfigMap ref
...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId:
        configMapKeyRef:
          namespace: kube-system
          name: machine-controller-aws
          key: accessKeyId

It should also be possible to pass in the secret values implicitly as environment variable.
The secret values differ from cloud provider.

  • AWS
    • Access Key ID
    • Secret Access Key
  • Hetzner
    • Token
  • Digitalocean
    • Token
  • OpenStack
    • Username
    • Password

Each secret field needs to have one specific environment key. Like AWS_ACCESS_KEY_ID.
During the processing of the cloudProviderSpec we would need to check if the environment variable is set, and if so we need to use this value.

Reason: In scenarios where the master components is managed by an external entity (Loodse kubermatic/ SAP Gardener) it might not be possible to expose the cloud provider specific secrets to the users.

Not possible to delete machines using AWS

We have 2 clusters on dev.kubermatic.io which cannot be deleted because the machine-controller is not able to delete the machines.

Logs:
kubectl -n cluster-dt56ds7tsb logs machine-controller-559788b7f9-89q9v

E0411 07:29:28.561133       1 machine.go:200] machine-kubermatic-dt56ds7tsb-gf4xr failed with: failed to delete machine at cloudprovider, due to instance not found
E0411 07:29:28.594842       1 machine.go:200] machine-kubermatic-dt56ds7tsb-d5pgz failed with: failed to delete machine at cloudprovider, due to instance not found
E0411 07:29:28.613675       1 machine.go:200] machine-kubermatic-dt56ds7tsb-64ql4 failed with: failed to delete machine at cloudprovider, due to instance not found

Make security-group creation on aws a fallback

We need to add a config variable for the securityGroups & should only create a security-group on AWS when none is defined. As a convenience/quickstart help.
Also we need to log this with a loglevel of 2 probably.

Openstack: Floating IPs are not reused which may result in FIP exhaustion

Basically title, from the the machine controller log:

E0120 13:47:35.431740       1 machine.go:162] machine-controller failed with: failed to create machine at cloudprovider: failed to allocate a floating ip: Expected HTTP response code [201 202] when accessing [POST http://192.168.0.39:9696/v2.0/floatingips], but got 409 instead
{"NeutronError": {"message": "No more IP addresses available on network 06fb6e98-4e98-4320-9f00-34e028ed53cb.", "type": "IpAddressGenerationFailure", "detail": ""}}

I'd expect the machine-controller to reuse already assigned but unused FIPs instead of requesting a new one.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.