haproxytech / vmware-haproxy Goto Github PK

License: Apache License 2.0

Dockerfile 1.99% Makefile 3.58% Shell 45.41% Python 49.02%

vmware-haproxy's Introduction

VMware + HAProxy

This project enables customers to build an OSS virtual appliance with HAProxy and its Data Plane API designed to enable Kubernetes workload management with Project Pacific on vSphere 7.

Download
Deploy
Build
Test
Upgrade

Download

The latest version of the appliance OVA is always available from the releases page:

NOTE

If running on or upgrading to vSphere 7.0.1 or later, you must upgrade to version v0.1.9 or later.

Version	SHA256
v0.2.0	`07fa35338297c591f26b6c32fb6ebcb91275e36c677086824f3fd39d9b24fb09`
v0.1.10	`81f2233b3de75141110a7036db2adabe4d087c2a6272c4e03e2924bff3dccc33`
v0.1.9	`f3d0c88e7181af01b2b3e6a318ae03a77ffb0e1949ef16b2e39179dc827c305a`
v0.1.8	`eac73c1207c05aeeece6d17dd1ac1dde0e557d94812f19082751cfb6925ad082`

Deploy

Refer to the system requirements and the installation documentation.

For a tutorial on deploying and using the HAProxy load balancer in vSphere with Tanzu, check out the vSphere with Tanzu Quick Start Guide.

Build

Documentation on how to build the appliance is available here.

Test

Documentation on how to test the components in the appliance with Docker containers is available here.

Configure

Documentation on how to configure the Virtual IPs managed by the appliance is available here.

Upgrade

Documentation on recommended upgrade procedures can be found here.

vmware-haproxy's People

Contributors

Stargazers

Watchers

vmware-haproxy's Issues

Tool for collecting diagnostics information

User Stories

As an Operator,
I would like a tool that collects diagnostics information related to the HAProxy appliance,
Because I need to be able to triage issues when they occur.

Details

Currently there is no good solution for collecting diagnostics information related to the HAProxy appliance that proves helpful when triaging/root-causing issues on or related to the appliance. We need a tool that can do this, and it should exist on the appliance. A tool should collect at least the information we already gather internally when triaging issues:

SSH to the HAProxy appliance.

Save the HAProxy service log to disk:

sudo journalctl -xu haproxy | tee /var/log/haproxy.log

Create a file with information about the version of HAProxy:

{ { rpm -qa haproxy || true; } &&
  { command -v haproxy >/dev/null 2>&1 && haproxy -vv || /usr/sbin/haproxy -vv; }; \
} | \
  sudo tee /etc/haproxy/haproxy-version

Create a file with information about the version of the Data Plane API:

{ command -v dataplaneapi >/dev/null 2>&1 && dataplaneapi --version || /usr/local/bin/dataplaneapi --version; } | \
  sudo tee /etc/haproxy/dataplaneapi-version

Create a file with information about the network configuration:

{ echo '--- IP TABLES ---' && \
  { iptables-save || iptables -S; } && \
  echo '--- IP ADDRS ---' && \
  ip a && \
  echo '--- IP ROUTES ---' && \
  ip r && \
  echo '--- IP ROUTE TABLE LOCAL ---' && \
  ip r show table local && \
  echo '--- IP ROUTE TABLES ---' && \
  for table_name in $(grep 'rtctl_' /etc/iproute2/rt_tables | awk '{print $2}'); do echo "${table_name}" && ip route show table "${table_name}"; done && \
  echo '--- OPEN PORTS ---' && \
  sudo lsof -noP | grep LISTEN; } | \
  sudo tee /var/log/network-info.log

Create a compressed tarball that includes several files and directories:
```
sudo tar -C / -czf "${HOME}/haproxy-diag.tar.gz" \
  /etc/haproxy \
  /var/log/haproxy.log \
  /var/log/network-info.log
```
Please note, the above command will return a non-zero exit code if any of the above directories or files do not exist, but a tarball will still be created with the content that does exist.
Validate the über-tarball created above includes the requested content:
```
sudo tar tzf "${HOME}/haproxy-diag.tar.gz"
```
The tarball should include some or all of the directories and files listed above.
Rename the tarball to include the timestamp and even host name of the VM on which the tarball was created:
```
sudo mv "${HOME}/haproxy-diag.tar.gz" "${HOME}/haproxy-diag-$(hostname --fqdn)-$(date +%s).tar.gz"
```
Copy the tarball from the remote VM to a local location using the scp program.

Requirements

The following requirements are the minimum, known requirements. There could be additional information/requirements that should be added as well:

A tool that collects at minimum the information listed in the Details section:
- The versions of HAProxy and Data Plane API
- The /etc/haproxy directory
- The logs for the HAProxy and Data Plane API services
- Runtime network information, such as:
  - ip tables
  - ip addresses
  - ip policy rules
  - ip routes
  - open ports
The tool should be present on the HAProxy appliance

Enable Identity management in vsphere with tanzu

We are trying to implement Identity management in tanzu kubernetes cluster using ldap. We are using vsphere 7.0.2 which comes with vsphere with tanzu enabled.

How TKG can be integrated with ldap ? Do vsphere with Tanzu support ldap integration? If not how we can implement the external identity management in vsphere with tanzu enabled?

Add ability to override DataPlaneAPI build consumed

Today, the packer build process only consumes the DataPlaneAPI binary from a URL.

A URL is passed in via a user variable - https://github.com/haproxytech/vmware-haproxy/blob/master/packer.json#L9

This variable is then used in an Ansible get_url task
https://github.com/haproxytech/vmware-haproxy/blob/master/ansible/roles/haproxy/tasks/main.yml#L15

It would be convenient to be able to use a locally built version of DataPlaneAPI for this process. One approach might be to rename the variable to dataplaneapi_location and allow the Ansible tasks to determine if it's a file/URL, and use copy/get_url appropriately (I have a topic branch with this approach that seems to work, though since I'm new to Ansible, it may not be the cleanest)

FWIW, it's already possible to override the DataPlaneAPI URL by setting up PACKER_FLAGS to set the correct var

PACKER_FLAGS="-var='dataplane_api_url=<url goes here>'"

Use case:

Using modified (ie. not officially released) DataPlaneAPI binaries in an HAProxy OVA build.

Motivation:

In cases where a DataPlaneAPI client sets up check-ssl without passing in the check flag, you can end up with a situation where the backend is configured to use SSL for it's health checks, but without 'check', it's always considered up and therefore no health checks are performed.
In cases where a DataPlaneAPI client is always requesting SSL for health checks, we may want to disable it from a DataPlaneAPI standpoint (As not all backends may be serving SSL traffic) - this could be a workaround until we determine a path forward on such clients. (it could be nice if we allowed end users to hint whether their LB will be serving SSL traffic or not, for instance)

I've been working around the above with a patched DataPlaneAPI build that -

looks for the presence of 'check-ssl' to determine if health checks should be enabled (which @akutz put together)
Strips out the check-ssl flag (to avoid enforcing health checks over SSL)

Working with a patched DataPlaneAPI binary, while not a long term fix, allows using health checks in a specific way without requiring modifications to the client. In this case, the client is an operator in a running K8s cluster that won't be upgraded in the short term, even though that operator is where we probably want these kinds of policy decisions coming from in the long run. Still, doing this via DPAPI today gives us a 'break glass' approach to setting up health checks correctly. (it's a fairly simple patch to DataPlaneAPI, see the topic branch linked below, which contains the two fixes I mentioned above.)

Replacing DataplaneAPI binary in live environment

Build DataplaneAPI (eg. I've been using https://github.com/mayankbh/dataplaneapi/tree/topic/bmayank/disable-ssl-even-if-requested) via the following

dataplaneapi/ $ go version    
go version go1.14 darwin/amd64

dataplaneapi/ $ GOOS=linux make build # you'll want GOOS if you're cross compiling (on a Mac, for instance)

dataplaneapi/ $ file build/dataplaneapi 
build/dataplaneapi: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped

scp the dataplane API to the running HAProxy instance

dataplaneapi/ $ scp ./build/dataplaneapi root@<HAProxy IP>:/root

SSH into the HAProxy instance for the rest of the steps-

Stop the existing dataplane API service

systemctl stop dataplaneapi

Back up the old binary, just in case

cp /usr/local/bin/dataplaneapi /root/dataplaneapi.bak

Copy the new binary in its place

cp dataplaneapi /usr/local/bin # This should be the new binary you just scp'd in.

Restart dataplaneapi

systemctl start dataplaneapi

The new DataplaneAPI build should now be working.

  server <server name> 172.16.0.12:443 check no-check-ssl weight 100 verify none

Enabling workload cluster never completes when HAProxy has specific special characters

It was listed that v0.1.7 resolves a special charachter issue with HA Proxy, however, I was working with v0.1.8 and faced an issue where my we set HA Proxy password to --200p@ssword and when we attempted to enable workload management on the vSphere cluster, it just spun and spun and spun. We repeated these steps several times with the same result. However when we changed the HA Proxy password to not have special characters, enabling workload management on the vSphere cluster worked as expected.

Please verify that the password I referenced above is acceptable, or provide guidance in the docs as two what is an acceptable documentation.

Clear Text Credentials stored on VM post Install

if you check /var/log/vmware and check the file ovf-to-cloud-init.log, you will find a clear text username and password for the root user.

Probably shouldn't be stored at all but for certain not in clear text

anyip-routes and route-tables services should be resilient to failures

If anyip-routes and route-tables crash for any reason, they currently stay dead because we do not have the unit files currently configured with any restart policy. We should configure these services with simple restart policies to ensure they can recover from any failures if a valid configuration becomes available at a later date.

Here's one such example:

[Unit]
StartLimitInterval=300
StartLimitBurst=3
[Service]
Restart=always
RestartSec=30

The files that need to be changed are:
anyip-routes.service
route-tables.service

Image does not fucntion correctly when using IPv6 addresses

When provisioned with IPv6 addresses, the created VM boots but gets stuck when configuring the network interfaces. This eventually fails and the VM finishes to boot, however it does not have any settings applied. For example the network interfaces do not have any IP addresses applied, and logging in with the specified root password does not work.

For testing, I used the following settings:

loadbalance.dataplane_port: 5556
appliance.permit_root_login: True
network.hostname: tanzuproxy.example.com
network.frontend_ip: 2001:xxx:xxx:105::6/64
loadbalance.haproxy_user: admin
network.nameservers: 2001:xxx:xxx:102::5,2001:xxx:xxx:102::18
appliance.root_pwd: *******
loadbalance.service_ip_range: 2001:xxx:xxx:110::/64
loadbalance.haproxy_pwd: *******
network.management_ip: 2001:xxx:xxx:102::13/64
network.workload_ip: 2001:xxx:xxx:109::2/64
network.workload_gateway: 2001:xxx:xxx:109::1
network.frontend_gateway: 2001:xxx:xxx:105::1
network.management_gateway: 2001:xxx:xxx:102::1

Security Issue CVE-2021-40346

is this repo release also affected by this security issue?

An integer overflow exists in HAProxy 2.0 through 2.5 in htx_add_header that can be exploited to perform an HTTP request smuggling attack, allowing an attacker to bypass all configured http-request HAProxy ACLs and possibly other ACLs.

https://nvd.nist.gov/vuln/detail/CVE-2021-40346

Ensure link-scoped routes are deleted when route-tables.service is shut down

As part of d612463 we created linked-scoped routes that send traffic out the adjacent L2 network. This required a new routing rule to be added into route-tables.cfg here: d612463#diff-321d57ac6a7554b516a6365a1d4b53a72f2ffa8b3ba114aa3f6298b7ce2486fdR336

When the route-tables.service is restarted, we expect that the routes are cleared out and then re-added. Instead, it seems this route lingers inside the active route tables preventing the service to delete the route table and re-add the routes later. We need to revisit this logic to either fix the bug or simply make the route deletion logic less brittle.

Another option is to move this entire configuration into systemd-networkd and have it program routes for us.

Static routes from frontend to "isolated" workload networks

version: HA Proxy Load Balancer API v0.1.10

I have a 3 NIC HA Proxy setup.
NIC 1: MGMT: Default Gateway is configured here.
NIC2: Primary Workload
NIC3: Frontend

I have peculiar management network setup. My environment is setup such that the MGMT network where my ESXi hosts, vCenter, Supervisor MGMT, and HA Proxy MGMT all reside, does not have a route to the Workload networks. It's essentially an air gapped management network.

My Tanzu cluster setup contains a Primary, and two additional "isolated" Workload networks. Traffic that enters the HA Proxy Frontend and is destined for backends on the Primary Workload network reaches those backends fine because the HA Proxy NIC2 is directly connected.

However, the issue I run into is that when traffic enters the HA Proxy Frontend, and is forwarded to the destination backends located on the isolated Workload networks it is being sent to the Default Gateway on the management interface, and this network cannot reach the secondary workload networks. I thought by adding some values in the route-tables.cfg for the isolated workload networks I would be able to configure static routes for the Frontend network, but either this does not work the way I was thinking it would, or I am getting the syntax wrong.

In the end I was able to work around my issue by adding static routes into the Frontend network-scripts file (/etc/systemd/network/10-frontend.network).

Can't SSH to get generated certificate to paste in vcenter

I selected the option to have certificates generated for me because I don't care, its a lab.
I can vmconsole in and see the certificates I think I need are in /etc/haproxy directory.
Problem is vCenter 8 is wanting the certificate. I can see it, but not copy paste from vmconsole so i need ssh.

I've verified sshd.service is started, the config file looks good, and i disabled iptables.
I still get Connection refused to mgmt_ip:22

Dataplane API 2.x and newer versions

The version of DP API that is built and ships with this image is 2.1, and the latest version of DP API is 2.8 (with HAProxy Enterprise using 2.7). We have a question as to whether a DP API 2.1 client is comaptible with 2.7/2.8? We also tried updating the version of DP API in this repo to 2.7, but the binary fails to start due to a now invalid config file.

Can someone from HAProxy please:

Answer the question about compatibility?
Help figure out what changes need to occur in the DP API config file to make it compatible with 2.7/2.8?

Thanks!

A route rule is added for each line in route-tables.cfg

We observed that route rules are added for each route in /etc/vmware/route-tables.cfg. See below:

root@haproxy [ ~ ]# ip rule show
0:	from all lookup local
32762:	from 172.16.10.2/24 lookup rtctl_frontend
32763:	from 172.16.10.2/24 lookup rtctl_frontend
32764:	from 192.168.1.2/16 lookup rtctl_workload
32765:	from 192.168.1.2/16 lookup rtctl_workload
32766:	from all lookup main
32767:	from all lookup default
root@haproxy [ ~ ]# cat /etc/vmware/
anyip-routes.cfg  route-tables.cfg
root@haproxy [ ~ ]# cat /etc/vmware/route-tables.cfg
...
2,workload,00:50:56:b8:10:00,192.168.1.2/16,192.168.1.1
2,workload,00:50:56:b8:10:00,192.168.1.2/16
3,frontend,00:50:56:b8:48:f1,172.16.10.2/24,172.16.10.1
3,frontend,00:50:56:b8:48:f1,172.16.10.2/24

Instead we should see only a single lookup for each network for each route table, which means it should look something like this:

root@haproxy [ ~ ]# ip rule show
0:	from all lookup local
32764:	from 172.16.10.2/24 lookup rtctl_frontend
32765:	from 192.168.1.2/16 lookup rtctl_workload
32766:	from all lookup main
32767:	from all lookup default

There's no harm with this bug per se, but it may be confusing to users.

Request to add Traceroute packaged on the HAProxy by default

Ideally having traceroute and other various networking tools would be helpful when troubleshooting why certain IPs are not available in HAProxy.

Specifically traceroute and tcpdump would be handy as neither of them are packaged currently and deployments for haproxy may or may not have access to the internet to pull these repos post deployment.

Help with install

Where can I find better documentation? Im having a hard time understanding how to get an ova to install in my vsphere content library. Can you point me in the right direction?

From the vsphere docs they say to download an ova file from here.... but i dont see any ova files in this repo, nor do I see it in the release. Where, how, can I get the ova file?

Management Network IP Subnet mask ignored

After deploying a 0.1.10 haproxy ova the subnet for the Management network is wrong.
I deployed it using 10.42.1.31/24 as Management IP but in the deployed VM in /etc/systemd/network/10-management.conf Address is set to 10.42.1.31 without a netmask which defaults to 10.0.0.0/8 as network.

When will next release ova be available?

Hi, when will the next release (0.1.9?) ova be available for download? Thank you

Support for SSH keys as login method at deployment

Please add the ability to supply SSH public key(s) during deployment to allow a passwordess login on day 1.
This is helpful for all automation needs and also a nice gesture for any admin needing to login to the appliance.

some post ova deployment issues

I've just managed to deploy an ova based on this project for a tanzu trial but had to fix a number of things due to the code base being out of date (at a guess)

in usr/lib/python3.7/site-packages/cloudinit/distros
/photon.py line17
from cloudinit.net.network_state import mask_to_net_prefix

mask _to_net_prefix has been removed upstream, replacing with ipv4_mask_to_net_prefix worked

The ovf says if you dont include a certificate a selfsign will be created, but that doesnt happen so I had to import a certificate

/var/lib/vmware/routetablectl.sh
line 152 and 200 relies on ip rule add lookup when that has been deprecated, replacing lookup with table seems to work

line 116
while ip call "rule del from 0/0 to 0/0 table ${route_table_name} 2>/dev/null"; do true; done
should be
while call "ip rule del from 0/0 to 0/0 table ${route_table_name} 2>/dev/null"; do true; done

I also had to clear some duplicate lines from /etc/vmware/route-tables.cfg

terraform deployment

Hello! Has anyone tried deploying this with terraform? I have been able to deploy into vsphere, but it looks like it is trying to deploy as a "frontend" haproxy and not the "default" configuration option. After deploying, I run ifconfig on the command line and see a "frontend" NIC and a "workload" NIC. If I deploy manually through the vsphere GUI, everything works as expected. Here's the code I'm working with. I'm looking for help please =)

haproxy.tf

data "vsphere_datacenter" "dc" {
name = var.datacenter_name
}

data "vsphere_datastore" "datastore" {
name = var.datastore_name
datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_resource_pool" "pool" {
name = "${var.cluster_name}/Resources"
datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_network" "management" {
name = "Tanzu Management Network"
datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_network" "workload" {
name = "Tanzu Workload Network"
datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_host" "host" {
name = "vmhost"
datacenter_id = data.vsphere_datacenter.dc.id
}

resource "vsphere_virtual_machine" "tanzu-haproxy" {
name = var.vm_name
datacenter_id = data.vsphere_datacenter.dc.id
resource_pool_id = data.vsphere_resource_pool.pool.id
datastore_id = data.vsphere_datastore.datastore.id
host_system_id = data.vsphere_host.host.id
folder = "Tanzu"
wait_for_guest_net_timeout = 0
wait_for_guest_ip_timeout = 0
wait_for_guest_net_routable = false

ovf_deploy {
local_ovf_path = "./haproxy-v0.2.0.ova"
disk_provisioning = "thin"
ip_protocol = "IPV4"
ip_allocation_policy = "STATIC_MANUAL"
ovf_network_map = {
"management" = data.vsphere_network.management.id
"workload" = data.vsphere_network.workload.id
}
}

network_interface {
network_id = data.vsphere_network.management.id
}

network_interface {
network_id = data.vsphere_network.workload.id
}

vapp {
properties = {
"root_pwd" = "12345"
"permit_root_login" = "True"
"hostname" = "tanzu-haproxy"
"nameservers" = "10.1.1.224, 10.6.100.55"
"management_ip" = "10.6.15.40/27"
"management_gateway" = "10.6.15.33"
"workload_ip" = "172.28.201.135/25"
"workload_gateway" = "172.28.201.129"
"service_ip_range" = "172.28.201.208/28"
"dataplane_port" = "5556"
"haproxy_user" = "dataplane-api"
"haproxy_pwd" = "12345"
}
}

}

Support for custom NTP during deployment

Please add the ability to (optionally) specify custom NTP servers during OVF deployment with an additional parameter.
Currently, the appliances gets deployed with the default photon settings which have a fallback to the google NTP servers as documented per KB 76088.

For environments where the internet traffic is blocked by default this is not feasible and requires manual effort to change the NTP settings as a day2 operation after deployment.

vmware-haproxy for Tanzu Kubernets Grid1.4

We are trying to deploy Tanzu Kubernetes Grid1.4 on Vsphere 7.
Qeus1: Do this vmware haproxy is supported for TKG1.4 ?
Ques2: Can we deploy TKG1.4 without vmware haproxy?
Ques3: Is there any way that we can customise the OS from Photon to RHEL for vmware haproxy.