projectcontour / gimbal Goto Github PK

Gimbal is an ingress load balancing platform capable of routing traffic to multiple Kubernetes and OpenStack clusters. Built by Heptio in partnership with Actapio.

Home Page: https://github.com/projectcontour/gimbal

License: Apache License 2.0

Makefile 0.81% Go 98.77% Dockerfile 0.42%

kubernetes envoy ingress openstack loadbalancer

gimbal's People

Contributors

Stargazers

Watchers

Forkers

castrojo stevesloka majst01 ananthc yastij mudocli ganpatagarwal jbdalido alexbrand fjibj lisidan zhangjh1984 jaycoon akinswin haoyehaoye zmyer sun363587351 bradserbu krisnova bradamant3 etiennecoutaud case7026 servicefoundation bsteciuk jhamilton1 krisdock c-va ryuinoue0724 mrbranan hongyunnchen avontd2868 lukekalbfleisch cxz hmorikaw yutaokaz galsagie munepom horus33 johnharris85 jpoley stevejr allaeddineel franklinharry allenlinupup etsangsplk aexleader mumoshu bsmr smadol anstoli strikingraghu influx6 rajatparida86 loetterle kiranmeduri mangqin-hy jafernandez73 skymysky weizai118 sinhasantos bizybot hichikaw ctrlxx donovanmuller tinygrasshopper jeremyrickard 05kritika jonasrosland bonomali hichtakk mqudahbeamwallet ssk8s devops-corner laashub-soa iyacontrol prakritichauhanpuresoftware bochuxt ubijsk octokas xaleeks senorllama sbhalsing0 voodemsanthosh nightmare727 a-dhaou galebias steveefemsc

gimbal's Issues

Discovery of K8s namespaces that do not exist in the gimbal cluster fail

Given that the kubernetes discoverer does not create namespaces in the gimbal cluster, and the fact that the discoverer is watch-based, any backend services that live in namespaces that do not exist in the gimbal cluster will not be discovered unless the discoverer is restarted.

As a Gimbal operator, I would like the ability to define these missing namespaces after the discoverer has been started and have the discoverer retry the creation of these service/endpoints objects.

OpenStack discoverer metrics missing clustername tag

The metrics produced from the OpenStack discoverer are missing the clustername metric.

OpenStack discoverer: Make the auth DomainName configurable

Currently, the DomainName is hard-coded. Make it configurable via an environment variable, similar to the rest of the OpenStack auth options.

Discoverer: Cleanup log messages from sync queue

The log messages from the sync queue end up having backslashes because we are using double quotes around the service/endpoint name. We can use single quotes to avoid this.

Sample RBAC permissions for team namespace

Need to define some sample RBAC permissions which can be applied to team namespaces. These would be applied to namespaces in the Gimbal cluster and should only allow users the following permissions:

Ingress: "get", "list", "watch", "create", "update", "patch", "delete"
Services and Endpoints: "get", "list", "watch"

Solicit feedback on supporting upstreams with non-routable Pod/VM IPs

At launch, Gimbal requires that discovered Application VMs (OpenStack) and Pods (Kubernetes) must have routable IPs that can be reached from the Gimbal cluster. While this is sufficient for our initial user and other users with flat IP namespaces, this will likely prohibit other common scenarios.

We should extend Gimbal to support Kubernetes and OpenStack deployments that do not provide routable IPs. This could include clusters that use an overlay network (e.g. Weave or Flannel) or that simply do not provide routable IPs.

One proposed solution would be to configure a GRE tunnel per upstream cluster.

This goal of this issue is to solicit feedback from the community about their deployments and use-cases so that we can design a viable solution.

Namespace ignore list

Add ability for discoverer to read in a configmap or args list of namespaces in which the discoverer should not watch for changes.

Cleanup deployment directory

Update Contour to v0.4.1 (not the routecrd branch)
Ordered yaml creation for Prometheus & Grafana (e.g. 01-, 02-,

Allow Prometheus data to persist to PVC

Currently, Prometheus data is stored in the pod's temporary storage. We need to make an example which mounts the data on a node's local storage. Should make it apply the deployment to the same node and utilized PV / PVC's accordingly.

Enable histogram support for Envoy metrics

Some of the metrics from Envoy are stored as Histograms which are not natively supported in Envoy and require a statsd endpoint. We should enable this to create better dashboards and alerting for traffic.

https://www.envoyproxy.io/docs/envoy/latest/configuration/cluster_manager/cluster_stats

Multi-team functionality through IngressRoute Delegation

Teams should not be able to use Virtual Hosts or Services that do not belong to them.

Adding a new team

Administrator creates like-named Namespace & Tenant created in all OpenStack, Kubernetes, and Gimbal clusters and provide RBAC credentials as necessary to the new Team
Administrator defines Virtual Hosts new Team may use in the Discovery Custom Resource Definition
Team deploys applications and then configures their Route CRD based on the discovered Service

Contour can be put into an enforcing mode where only whitelisted namespaces may have Root IngressRoutes.

Liveliness Probes

Should look into the deployments of the discoverers and define probes which assist in the health. We have the metrics components, but if something starts to error, would be good to have them attempt to self-heal in addition to reporting error status.

Performance benchmarking

Gimbal needs to support internet-scale workloads. As such, we should test Gimbal to ensure that it is capable of handling the following:

Ingress: 10s of Gbps
Egress: 10s of Gbps
X million concurrent connections
Latency: p99 maximum 20ms-30ms RTT

We should document suggested hardware footprint to support traffic of this amount.

Enable route configuration across multiple services

The Kubernetes Ingress object limits a single 'service' per Route path. For Gimbal, users should be able to define multiple Services (from Kubernetes or OpenStack clusters) that will receive traffic for a given route. Users should be able to define the load balancing strategy (RoundRobin, Random, Least load, EWMA).

This enables a wide variety of use-cases including weight-shifting, A/B testing, etc.

Rename --cluster-name flags on discoverers

The discovery naming conventions use "backend name" instead of "cluster name". We had a brief discussion on #102 about renaming the --cluster-name flag on the discoverers to --backend-name, so opening this issue to finish up that discussion.

[Discoverer] Ignore `kubernetes` service

The remote kubernetes service shouldn't replicate the kubernetes service in the default namespace.

Alertmanager docs

Should have examples on how to test alerts on Alert manager so once it's deployed initially, users can validate it is functioning correctly.

Additional Metrics

Is this a BUG REPORT, PERFORMANCE REPORT or FEATURE REQUEST?:

Feature Request

Need additional metrics exposed via discoverer:

upstream-services{cluster, tenant/namespace} -- (Number of upstream services, labeled by cluster and tenant/namespace)
replicated-services{cluster, tenant/namespace} -- (Number of services replicated to gimbal cluster labeled by cluster and tenant/namespace)
invalid-services{cluster, tenant/namespace} -- (Number of services unable to be replicated labeled by cluster and tenant/namespace)
upstream-endpoints{cluster, tenant/namespace,service} -- (Number of endpoints - meaning IP:Port - in the upstream labeled by cluster, namespace, and service name)
replicated-endpoints{cluster, tenant/namespace,service} -- (Number of endpoints - meaning IP:Port replicated to Gimbal cluster labeled by cluster, namespace and service name)
invalid-endpoints{cluster, tenant/namespace,service} -- (Number of endpoints - meaning IP:Port - unable to be replicated to gimbal labeled by namespace and service name)

Openstack Discoverer logs incorrect update

The openstack discoverer logs updates for services when no updates have happened. The following example show the lb being added, then shows an update on each cycle when it wasn't updated in openstack.

time="2018-05-08T13:10:17Z" level=info msg="Successfully handled: add service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:10:17Z" level=info msg="Successfully handled: add endpoints 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:10:47Z" level=info msg="Successfully handled: update service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:11:17Z" level=info msg="Successfully handled: update service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:11:47Z" level=info msg="Successfully handled: update service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:12:19Z" level=info msg="Successfully handled: update service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:12:47Z" level=info msg="Successfully handled: update service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:13:21Z" level=info msg="Successfully handled: update service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:13:47Z" level=info msg="Successfully handled: update service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"
time="2018-05-08T13:14:17Z" level=info msg="Successfully handled: update service 'demo/lb1-2ad78ab3-aa94-43e4-a764-194a67601d16-openstack'"

Discoverer: Initialize gauge metrics to zero

Some of the metrics exposed by the discoverer components should probably be initialized to zero so that they show up in grafana. An example that comes to mind is the number of discovery errors.

OpenStack discoverer: Improve handling of LB and Listener names

We use the LBaaS listener name as the port name. When the name is long enough, it can result in an invalid port name.

time="2018-04-26T14:54:23Z" level=error msg="error handling update endpoints 'vioadmin/1-61fb2abd-f007-4dd2-8784-9db807b27e4d-openstack': Endpoints "1-61fb2abd-f007-4dd2-8784-9db807b27e4d-openstack" is invalid: subsets[1].ports[0].name: Invalid value: "k8s-registry-listener-a2fd7571-5091-4a1f-8a65-610250ce8877-11000": must be no more than 63 characters"

Document sample RBAC rules for remote k8s clusters

For the Kuberenetes Discoverer, should document RBAC rules that could be applied to remote clusters which are going to be discovered. Need to document in docs as well as provide samples in deployment.

Secure envoy admin interface

#89 adds network policies to limit who can access the admin interface (to gather metrics), however, it would be better to fully secure the interface. There's an open issue in Envoy, adding to track from here: envoyproxy/envoy#2763

Persist grafana dashboards and make them editable

Currently, grafana dashboards are deployed via the dashboard provisioning feature . This is great because the dashboards come pre-loaded with the grafana deployment. The downside, however, is that grafana does not support modifying these dashboards, and persisting the changes (we are also using configmaps, which are read-only).

Ideally, the grafana dashboards can be modified, and the changes persisted to something like a PV+PVC.

Endpoints discovered from Kubernetes include nodeName and targetRef

Endpoints discovered by the kubernetes-discoverer include the nodeName and the targetRef, which reference resources that do not exist in the Contour cluster. NodeName could be helpful for future features, but I agree that targetRef should be removed.

kind: Endpoints
metadata:
  creationTimestamp: 2018-02-21T22:23:20Z
  labels:
    contour.heptio.com/cluster: origin-k8s
    contour.heptio.com/namespace: default
    contour.heptio.com/service: hello
  name: hello-origin-k8s
  namespace: default
  resourceVersion: "236374"
  selfLink: /api/v1/namespaces/default/endpoints/hello-origin-k8s
  uid: d28f23da-1755-11e8-ab00-f80f4182762e
subsets:
- addresses:
  - ip: 1.2.3.4
    nodeName: worker03
    targetRef:
      kind: Pod
      name: hello-hjn8n
      namespace: default
      resourceVersion: "905453"
      uid: 713bc282-16ce-11e8-a53a-fa163eab6398

Metrics by VHost

Is this a BUG REPORT, PERFORMANCE REPORT or FEATURE REQUEST?:

Feature Request

It would be good to view metrics from a team perspective or by VHost so I could see Requests per second against a VHost + Path. Right now the metric support allows those numbers against a backend, but it might not show the whole picture for what a team would be interested in.

Update Envoy Prometheus Endpoint

Envoy now adds a Prometheus endpoint /stats/prometheus that can be utilized. Need to update the deployment (https://github.com/heptio/gimbal/blob/master/deployment/contour/02-deployment.yaml#L21) to point to the new url.

Document Discoverer Secrets better

Example deployment files and directions need to document the cluster-name in the secret.

Setup DNS for internal routing

Gimbal is focused on handling Ingress traffic. But another use-case is to handle internal application routing for multiple clusters.

A good example would be to have an application deployed to a cluster. If maintenance is required on that cluster, then spin up the application on a second and slowly transition work over to the new cluster. By providing a single DNS endpoint within the cluster, any consumer of the application doesn't need to change logic.

// cc @hhoover

Split Envoy from Contour

Each instance of Contour will create watches on Services, Endpoints, etc. As we need to scale Envoy to handle more traffic this scales Contour with each instance of Envoy. We should split those apart to allow Envoy to scale separately as needed.

Contour allows us to specify the grpc endpoint (https://github.com/heptio/contour/blob/master/cmd/contour/contour.go#L59), however, it's not secured.

Also, Envoy shouldn't run under the contour service account since it no longer needs the same access to the k8s api.

Allowed namespace list

Add ability for discoverer to read in a configmap or args list of namespaces in which the discoverer should watch for changes.

Use kube-state-metrics for counting the number of services and endpoints discovered

The current grafana dashboard displays the total number of services and endpoints that have been discovered. The dashboard computes these numbers based on the event_timestamp metric, which indicates the last time a service or endpoint was updated.

The problem with using this metric for computing the count is that if the discoverer is restarted, the grafana dashboard will display zero until the services/endpoints are updated (because the "event_timestamp" metric will not be set).

Sanitize Discoverer Cluster-Name param

The cluster-name param need to be sanitized to remove any special characters including whitespace

Inspect OpenStack status fields when discovering services

OpenStack resources have a couple of status fields that we might have to inspect to determine whether we should route traffic to that LB or endpoint.

Load balancers have admin_state_up, provisioning_status and operating_status fields. Listeners, pools and pool members all have an admin_state_up field.

Update contour and discovery namespaces

Use gimbal-contour for the contour deployment, and gimbal-discovery for the discoverers.

Metrics: API Latency

Need to implement api latency, the milliseconds it takes for requests to return from a remote discoverer api (e.g. Openstack)

Labels:

clustername
clustertype

Total Endpoints Discovered graph misleading

Is this a BUG REPORT, PERFORMANCE REPORT or FEATURE REQUEST?:

bug report

What happened:

Deployed master of Gimbal and looked at the Grafana dashboard for Gimbal Discovery. The Total Endpoints Discovered graph is misleading because Endpoints are not the same as Endpoint Address available.

Scaling up a deployment in a upstream cluster from 1 Pod to 10 Pods results in the total endpoints graph remaining flat at 1.

What you expected to happen:

The graph should increase as remote deployments are scaled up or down.

As the term Endpoints is overloaded, we might consider renaming Total Endpoints to Remote Endpoint Addresses.

Anything else we should know:

Right now we're graphing kube_endpoint_labels but we should be using kube_endpoint_address_available. The catch is that the kube-state-metrics code doesn't provide labels (which include the cluster name) in the kube_endpoint_address_available metric.

Documentation of service discovery naming conventions

Related to #71

We don't currently have any documentation about the naming schemes used by Discoverers. Discoverers should have a consistent naming convention and we should document accordingly.

e.g.

${discoverer-prefix)-${service-name}-${cluster-name}

Update dashboards

Some new metrics have been added, would be good to show them in the included dashboards:

API Latency (Openstack)
Cycle Duration (Openstack)
Queue Size

Metrics: QueueSize

Need to implement number of items in process queue with the following labels:

namespace
clustername
clustertype

Discoverer watch for changes to secret

The discoverer takes in a secret for access credentials to a remote cluster. It needs to watch for changes to this secret and update its configuration accordingly to allow for credential rotation. Could watch for file changes on disk or utilize a Controller to implement.

Metrics: Cycle Duration

Need to implement Cycle Duration which is the milliseconds it takes for all objects to be synced from a remote discoverer api (e.g. Openstack)

Labels:

clustername
clustertype

[Discoverers] Better handling of whitespace in --cluster-name field

It's possible to specify a cluster-name that includes invalid characters for a Service (e.g. whitespace like a newline character). We should either probably strip whitespace (newlines are common mistakes in shell scripting) or refuse to start the discoverer due to an invalid cluster-name.

time="2018-04-09T01:33:26Z" level=error msg="error handling add service \"default/hello-bgp-cluster\n\": Service \"hello-bgp-cluster\\n\" is invalid: [metadata.name: Invalid value: \"hello-bgp-cluster\\n\": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name',  or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?'), metadata.labels: Invalid value: \"bgp-cluster\\n\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')]"

To gather Prometheus metrics from Envoy the admin interface is exposed (https://github.com/heptio/gimbal/blob/master/deployment/contour/02-deployment.yaml#L49). We should find a way to not expose the Envoy admin functionality but still expose the Prometheus metrics endpoint.

Active Health Checking of Endpoints

While Kubernetes Services support native health-checking, other upstream systems, including OpenStack, do not include health checking.

Gimbal should allow active health checking of upstream endpoints (grouped by Service).

Active health checking of Endpoints

Remove Route CRD References

At launch, we should not include references to the RouteCRD or routeCRD features.

One example to be removed: https://github.com/heptio/gimbal/blob/master/deployment/contour/01-common.yaml#L12

projectcontour / gimbal Goto Github PK

gimbal's People

Contributors

Stargazers

Watchers

Forkers

gimbal's Issues

Recommend Projects

Recommend Topics

Recommend Org