sapcc / swift-drive-autopilot Goto Github PK

View Code? Open in Web Editor NEW

7.0 46.0 0.0 6.6 MB

Finds and mounts Swift storage drives (also from within a container)

License: Apache License 2.0

Makefile 5.30% Go 66.73% Shell 27.06% Dockerfile 0.92%

swift-drive-autopilot's Introduction

swift-drive-autopilot

This service finds, formats and mounts Swift storage drives, usually from within a container on a Kubernetes host.

How it works

Swift expects its drives to be mounted at /srv/node/$id, where the $id identifier is referenced in the cluster's ring files. The usual method is to set $id equal to the device's name in /dev, e.g. /dev/sdc becomes /srv/node/sdc, but that mapping is too rigid for some situations.

swift-drive-autopilot establishes disk identity by examining a special file called swift-id in the root directory of the disk. In detail, it performs the following steps:

enumerate all storage drives (using a configurable list of globs)
(optional) create a LUKS encryption container on fresh devices, or unlock an existing one
create an XFS filesystem on devices that do not have a filesystem yet
mount each device below /run/swift-storage with a temporary name
examine each device's swift-id file, and if it is present and unique, mount it to /srv/node/$id

As a special case, disks with a swift-id of "spare" will not be mounted into /srv/node, but will be held back as spare disks.

The autopilot then continues to run and will react to various types of events:

A new device file appears. It will be decrypted and mounted (and formatted if necessary).
A device file disappears. Any active mounts or mappings will be cleaned up. (This is especially helpful with hot-swappable hard drives.)
The kernel log contains a line like error on /dev/sda. The offending device will be marked as unhealthy and unmounted from /srv/node. The other mappings and mounts are left intact for the administrator to inspect.

This means that you do not need swift-drive-audit if you're using the autopilot.
Mounts of managed devices disappear unexpectedly. The offending device will be marked as unhealthy (see previous point).
After a failure of one of the active disks, an operator removes the failed disk, locates a spare disk and changes its swift-id to that of the failed disk. The autopilot will mount the new disk in the place of the old one.

Internally, events are collected by collector threads, and handled by the single converger thread.

Operational considerations

swift-drive-autopilot runs under the assumption that a few disks are better than no disks. If some operation relating to a single disk fails, the autopilot will log an error and keep going. This means that it is absolutely crucial that you have proper alerting in place for log messages with the ERROR label.

Installation

To build the binary:

make

The binary can also be installed with go get:

go get github.com/sapcc/swift-drive-autopilot

To build the Docker container: (Note that this requires a fairly recent Docker since a multi-staged build is used.)

docker build .

To run the integration tests: (Note that this actually runs the autopilot on your system and thus requires root or sudo for mounting, device-mapper etc.)

make check

Development setup

Please see HACKING.md.

Usage

Call with a configuration file as single argument. The configuration file is a YAML and the following options are supported:

drives:
  - /dev/sd[a-z]
  - /dev/sd[a-z][a-z]

The only required field, drives, contains the paths of the Swift storage drives, as a list of shell globs.

As a special rule, the autopilot will ignore all drives that contain valid partition tables. This rule allows one to use a very general glob, like /dev/sd[a-z], without knowing the actual disk layout in advance. The system installation will usually reside on a partitioned disk (because of the need for special partitions such as boot and swap partition), so it will be ignored by the autopilot. Any other disks can be used for non-Swift purposes as long as they are partitioned into at least one partition.

For this reason, the two globs shown above with will be appropriate for most systems of all sizes.

metrics-listen-address: ":9102"

If given, expose a Prometheus metrics endpoint on this port below the path /metrics. The following metrics are provided:

swift_drive_autopilot_events: counter for handled events (sorted by type, e.g. type=drive-added)

If Prometheus is used for alerting, it is useful to set an alert on rate(swift_drive_autopilot_events[type="consistency-check"]). Consistency check events should occur twice a minute.

chroot: /coreos

If chroot is set, commands like cryptsetup/mkfs/mount will be executed inside the chroot. This allows to use the host OS's utilities instead of those from the container.

chown:
  user: "1000"
  group: "swift"

If chown is set, mountpoints below /srv/node and /var/cache/swift will be chown'ed to this user and/or group after mounting. Give the UID/GID or names of the Swift user and group here.

keys:
  - secret: "bzQoG5HN4onnEis5bhDmnYqqacoLNCSmDbFEAb3VDztmBtGobH"
  - secret: { fromEnv: ENVIRONMENT_VARIABLE }

If keys is set, automatic disk encryption handling is activated. LUKS containers on the drives will be decrypted automatically, and empty drives will be encrypted with LUKS before a filesystem is created.

When decrypting, each of the keys is tried until one works, but only the first one is used when creating new LUKS containers.

Currently, the secret will be used as encryption key directly. Other key derivation schemes may be supported in the future.

Instead of providing secret as plain text in the config file, you can use a special syntax (fromEnv) to read the respective encryption key from an exported environment variable.

swift-id-pool: [ "swift1", "swift2", "swift3", "swift4", "swift5", "swift6" ]

If swift-id-pool is set, when a new drive is formatted, it will be assigned an unused swift-id from this pool. This allows a new node to go from unformatted drives to a fully operational Swift drive setup without any human intervention.

Automatic assignment will only happen during the initial formatting (i.e. when no LUKS container or filesystem or active mount is found on the drive). Automatic assignment will not happen if there is any broken drive (since the autopilot cannot check the broken drive's swift-id, any automatic assignment could result in a duplicate swift-id).

IDs are assigned in the order in which they appear in the YAML file. If there are only four drives, using the configuration above, they will definitely be identified as swift1 through swift4.

As a special case, the special ID spare may be given multiple times. The ordering still matters, so disks will be assigned or reserved as spare in the order that you wish. For example:

# This will always keep two spare disks.
swift-id-pool: [ "spare", "spare", "swift1", "swift2", "swift3", "swift4", "swift5", "swift6", ... ]

# Something like this will keep one spare disk per three active disks.
swift-id-pool: [ "swift1", "swift2", "swift3", "spare", "swift4", "swift5", "swift6", "spare", ... ]

Runtime interface

The autopilot advertises its state by writing the following files and directories: swift-drive-autopilot maintains the directory /run/swift-storage/state to store and advertise state information. (If a chroot is configured, then this path refers to inside the chroot.) Currently, the following files will be written:

/run/swift-storage/state/flag-ready is an empty file whose existence marks that the autopilot has handled each available drive at least once. This flag can be used to delay the startup of Swift services until storage is available.
/run/swift-storage/state/unmount-propagation is a directory containing a symlink for each drive that was unmounted by the autopilot. The intention of this mechanism is to propagate unmounting of broken drives to Swift services running in separate mount namespaces. For example, if the other service sees /run/swift-storage/state/unmount-propagation/foo, it shall unmount /srv/node/foo from its local mount namespace.

/run/swift-storage/state/unmount-propagation can be ignored unless you have Swift services running in multiple private mount namespaces, typically because of containers and because your orchestrator cannot setup shared or slave mount namespaces (e.g. Kubernetes). In plain Docker, pass /srv/node to the Swift service with the slave or shared option, and mounts/unmounts made by the autopilot will propagate automatically.
/run/swift-storage/broken is a directory containing symlinks to all drives deemed broken by the autopilot. When the autopilot finds a broken device, its log will explain why the device is considered broken, and how to reinstate the device into the cluster after resolving the issue.
/var/lib/swift-storage/broken has the same structure and semantics as /run/swift-storage/broken, but its contents are retained across reboots. A flag from /run/swift-storage/broken can be copied to /var/lib/swift-storage/broken to disable the device in a more durable way, once a disk hardware error has been confirmed.

The durable broken flag can also be created manually using the command ln -s /dev/sd$LETTER /var/lib/swift-storage/broken/$SERIAL. The disk's serial number can be found using smartctl -d scsi -i /dev/sd$LETTER.
Since the autopilot also does the job of swift-drive-audit, it honors its interface and writes /var/cache/swift/drive.recon. Drive errors detected by the autopilot will thus show up in swift-recon --driveaudit.

In Docker

When used as a container, supply the host's root filesystem as a bind-mount and set the chroot option to its mount point inside the container. Also, the container has to run in privileged mode to access the host's block devices and perform mounts in the root mount namespace:

$ cat > config.yml
drives:
  - /dev/sd[c-z]
chroot: /host
$ docker run --privileged --rm -v $PWD/config.yml:/config.yml -v /:/host sapcc/swift-drive-autopilot:latest /config.yml

Warning: The entire host filesystem must be passed in as a single bind mount. Otherwise, the autopilot will be unable to correctly detect the mount propagation mode.

In Kubernetes

You will probably want to run this as a daemonset with the nodeSelector matching your Swift storage nodes. Like described for Docker above, make sure to mount the host's root filesystem into the container (with a hostPath volume) and run the container in privileged mode (by setting securityContext.privileged to true in the container spec).

Any other Swift containers should have access to the host's /run/swift-storage/state directory (using a hostPath volume) and wait for the file flag-ready to appear before starting up.

swift-drive-autopilot's People

Contributors

Stargazers

Watchers

swift-drive-autopilot's Issues

all devices being flagged as broken after restart

Upon restarting the autopilot in our staging environment, I encountered the following "interesting" behavior:

2016/11/28 12:22:13 INFO: event received: new device found: /dev/sdc
2016/11/28 12:22:13 INFO: discovered /dev/sdc to be mapped to /dev/mapper/81dde30aa3d9bf946fb0281f260b68e4 already
2016/11/28 12:22:13 INFO: discovered /dev/mapper/81dde30aa3d9bf946fb0281f260b68e4 to be mounted at /srv/node/swift-01 already
2016/11/28 12:22:13 INFO: event received: new device found: /dev/sdd
2016/11/28 12:22:13 INFO: discovered /dev/sdd to be mapped to /dev/mapper/dbc0b33565fc9c0784c8ed66ec7d123e already
2016/11/28 12:22:13 INFO: discovered /dev/mapper/dbc0b33565fc9c0784c8ed66ec7d123e to be mounted at /srv/node/swift-03 already
2016/11/28 12:22:13 INFO: event received: new device found: /dev/sde
2016/11/28 12:22:13 INFO: discovered /dev/sde to be mapped to /dev/mapper/c338434f254439c7cd0c8ba21a872405 already
2016/11/28 12:22:13 INFO: discovered /dev/mapper/c338434f254439c7cd0c8ba21a872405 to be mounted at /srv/node/swift-02 already
2016/11/28 12:22:43 INFO: event received: scheduled consistency check
2016/11/28 12:22:43 ERROR: expected /dev/mapper/81dde30aa3d9bf946fb0281f260b68e4 to be mounted at /srv/node/swift-01, but is not mounted anymore
2016/11/28 12:22:43 INFO: flagging /dev/sdc as broken because of previous error
2016/11/28 12:22:43 INFO: To reinstate this drive into the cluster, delete the symlink at /run/swift-storage/broken/81dde30aa3d9bf946fb0281f260b68e4
2016/11/28 12:22:43 ERROR: expected /dev/mapper/dbc0b33565fc9c0784c8ed66ec7d123e to be mounted at /srv/node/swift-03, but is not mounted anymore
2016/11/28 12:22:43 INFO: flagging /dev/sdd as broken because of previous error
2016/11/28 12:22:43 INFO: To reinstate this drive into the cluster, delete the symlink at /run/swift-storage/broken/dbc0b33565fc9c0784c8ed66ec7d123e
2016/11/28 12:22:43 ERROR: expected /dev/mapper/c338434f254439c7cd0c8ba21a872405 to be mounted at /srv/node/swift-02, but is not mounted anymore
2016/11/28 12:22:43 INFO: flagging /dev/sde as broken because of previous error
2016/11/28 12:22:43 INFO: To reinstate this drive into the cluster, delete the symlink at /run/swift-storage/broken/c338434f254439c7cd0c8ba21a872405
2016/11/28 12:23:13 INFO: event received: scheduled consistency check

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Awaiting Schedule

These updates are awaiting their schedule. Click on a checkbox to get an update now.

Renovate: Update github.com/sapcc/go-bits digest to 98409d6

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Renovate: Update module github.com/prometheus/client_golang to v1.20.1

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Renovate: Update module gopkg.in/yaml.v2 to v3

Vulnerabilities

Renovate has not found any CVEs on osv.dev.

Detected dependencies

github-actions

.github/workflows/checks.yaml

actions/checkout v4

actions/setup-go v5

golangci/golangci-lint-action v6

golang/govulncheck-action v1

reviewdog/action-misspell v1

.github/workflows/ci.yaml

actions/checkout v4

actions/setup-go v5

actions/checkout v4

actions/setup-go v5

.github/workflows/codeql.yaml

actions/checkout v4

actions/setup-go v5

github/codeql-action v3

github/codeql-action v3

github/codeql-action v3

gomod

go.mod

go 1.23

github.com/prometheus/client_golang v1.20.0

github.com/sapcc/go-api-declarations v1.12.4

github.com/sapcc/go-bits v0.0.0-20240815085238-fce0691187a2@fce0691187a2

go.uber.org/automaxprocs v1.5.3

gopkg.in/yaml.v2 v2.4.0

Check this box to trigger a request for Renovate to run again on this repository

next level of autonomy

Augment the autopilots running on each storage node with a central controller (probably implemented as a Kubernetes operator) to enable automation of additional workflows.

Scenarios

error handling
- When a drive error is observed (e.g. in kernel log), unmount that drive.
- When it is a repairable error - do the repair, e.g. xfs_repair
- The user can then cordon the drive, which will remove the swift-id assignment from the drive and (if available) assign a spare disk. Cordons shall be more long-lived than the "broken" flags, and shall particularly survive node reboots.
- A drive may also be cordoned automatically when hardware metrics indicate drive failure.
mount propagation
- When the set of active mounts changes, restart Swift pods consuming these mounts automagically.
  - This will not be required when running on a Kubernetes that supports mount propagation.
- When the autopilot cannot unmount or unmap a drive because of a lingering mount in a different namespace, reboot the node to get rid of the offending mount.
ring propagation
- the operator keep the current rings and is able to propagate them to the swift server processes incl. a coordinated restart if needed
ring building - optional
- The controller tracks all drives, assigns weight to them, and builds the Swift rings.
  - The user can check the rebalance output and might issue a swift-ring-buider <ring> dispersion
  - The user must approve the ring before rollout
- The user can change weight of the drives and build a new ring, which the controller then distributes to all consuming pods.

Cross concerns

resilience
- There should not be too many operations (drive reassignments, rebalancing, rebooting) going on at once to avoid long replication times and/or data loss.
serviceability/discoverability
- There should be a user-friendly way to enumerate storage nodes and drives, inspect their status (e.g. list broken drives) and trigger operations (cordon/uncordon, change weight, rebalance).

LUKS support

when a drive is recognized as LUKS container, open it and proceed the /dev/mapper device like normal
when enabled, create LUKS container automatically in unformatted drives
plan ahead for key rotation

Encryption keys should be derived from the drive UUID and a master key using a cryptographic key derivation function. Alternatively, an initial solution could use the same master key for all volumes.

drive errors are not recognized when device file is removed by kernel

We just saw this in prod:

$ # journalctl -k | grep sdaa
Dec 07 18:50:45 localhost kernel: sd 0:0:74:0: [sdaa] 1953506646 4096-byte logical blocks: (8.00 TB/7.28 TiB)
Dec 07 18:50:45 localhost kernel: sd 0:0:74:0: [sdaa] Write Protect is off
Dec 07 18:50:45 localhost kernel: sd 0:0:74:0: [sdaa] Mode Sense: f7 00 10 08
Dec 07 18:50:45 localhost kernel: sd 0:0:74:0: [sdaa] Write cache: disabled, read cache: enabled, supports DPO and FUA
Dec 07 18:50:45 localhost kernel: sd 0:0:74:0: [sdaa] Attached SCSI disk
Dec 22 14:00:30 storage5 kernel: sd 0:0:74:0: [sdaa] tag#32 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 22 14:00:30 storage5 kernel: sd 0:0:74:0: [sdaa] tag#32 CDB: Write(10) 2a 00 00 00 02 0c 00 00 04 00
Dec 22 14:00:30 storage5 kernel: print_req_error: I/O error, dev sdaa, sector 4192
Dec 22 14:00:30 storage5 kernel: sd 0:0:74:0: [sdaa] tag#66 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec 22 14:00:30 storage5 kernel: sd 0:0:74:0: [sdaa] tag#66 CDB: Write(10) 2a 00 40 01 a1 20 00 00 01 00
Dec 22 14:00:30 storage5 kernel: print_req_error: I/O error, dev sdaa, sector 8590788864

However, no drive error was reported by the autopilot, because the kernel removed /dev/sdaa and the autopilot thus treated the incident like the manual removal of a disk.

How does this relate to swift-drive-audit?

This tool automounts drives for Swift, but swift-drive-audit will automatically unmount broken drives. Sounds like the two could easily work against each other if we're not careful.

cc @reimannf

/var/cache/swift should be chown'ed to swift user/group

The autopilot already chowns the drive root directories to the Swift user/group (if configured thusly). It should do the same with /var/cache/swift.

Assigning this to @reimannf since he wants to familiarize himself with the codebase.

LUKS: automatic rekeying

When a LUKS container is opened with an old key (i.e., not the first one in the config), automatically add the new key (i.e., the first one in the config) and remove the old one.

Alternatively, add a deprecated flag on keys, so that rekeying only happens when the unlocking key is deprecated: true.

unmount disk on metadata corruption error

We found this in a kernel log on prod today:

[Wed Nov 15 16:00:36 2017] XFS (sdf): Metadata corruption detected at xfs_dir3_block_read_verify+0x5e/0x110 [xfs], block 0x1000e5d88
[Wed Nov 15 16:00:36 2017] XFS (sdf): Unmount and run xfs_repair
[Wed Nov 15 16:00:36 2017] XFS (sdf): First 64 bytes of corrupted metadata buffer:

That looks like something the autopilot should detect. (It currently doesn't because it only looks for the word "error".)

send statsd metrics

counter for errors
gauge for number of broken drives

Even though we use Prometheus in our system, I will explicitly not expose a metrics endpoint from the autopilot itself, but have it send metrics to a statsd-exporter itself. The autopilot is quite a privileged application, and I will not have it expose open TCP ports if I can help it.

assign swift-ids automatically from a pool

Add a new configuration option, e.g.

swift-id-pool: [ swift1, swift2, swift3 ]

When a new device is formatted and mounted for the first time, try to choose an unused swift-id from the pool and assign it automatically.

Safety measures:

Do not choose swift-ids when some other devices could not be mounted cleanly (we could be reassigning their ID, thus creating a duplicate assignment).
Only choose the swift-id when formatting a device for the first time, not for existing file systems (this could create confusion when some other process or user deletes the swift-id in error and the autopilot proceeds to assign a different ID to the drive).

LUKS: derive a separate key per device

Instead of using the same key for all devices, derive an individual device key using e.g. HMAC-SHA256 or PBKDF2, with some string identifying the device and machine as payload.

Support for spare drives

Use case: Keep a few drives as spare to have a quick replacement when a productive disk fails.

Implementation:

When a drive has the swift-id spare, do not mount it into /srv/node.
When an operator changes the swift-id on a spare disk, pick this up during the next healthcheck run and mount the disk to /srv/node with the new swift-id. (This should work already, but a test will be useful.)
Allow spare in the swift-id-pool. When there are multiple instances of spare in the pool, assign that many spare disks (while still respecting the pool entry order).

crash when device file disappears while LUKS container is open

When this happens:

$ sudo cryptsetup status S4D17XYS0000K705ATH0
/dev/mapper/S4D17XYS0000K705ATH0 is active and is in use.
  type:    n/a
  cipher:  aes-xts-plain64
  keysize: 256 bits
  device:  (null)
  offset:  4096 sectors
  size:    11721041072 sectors
  mode:    read/write

Then s-d-a gets confused by device: (null) and crashes when trying to look at the device /dev/(null). To reproduce,

echo offline > /sys/block/sdc/device/state
echo 1 > /sys/block/sdc/device/delete

Detect usable drives based on absence of a partition table

As the drive glob might be too inaccurate, e.g. after a reinstall of the coreos, extend the check of a swift usable drive to the absence of a partition label.

report explicit error when no drives found

When the drive globs don't match anything, the autopilot just sits there doing nothing. It should at least report INFO: no drives matching globs or something like that.

chown /srv/node mountpoints to Swift user

When the drives are formatted and mounted for the first time, their root directory should be chowned to the Swift user and group. The UID and GID should be configurable because the Swift user might not exist in the container where swift-storage-boot runs.

auto-format new drives

When a drive without a valid file system is encountered, automatically format it. (Filesystems can be detected with file(1).)

sfdisk -l can fail with return code 1 and error No medium found

So far we assumed that sfdisk -l on devices which are not readable will print nothing to stdout and return successful, see https://github.com/sapcc/swift-drive-autopilot/blob/master/collectors.go#L137-L141.

Now we have several regions where that behaves different and sfdisk -l will fail with exit code 1 and stderr sfdisk: cannot open <device>: No medium found.
That causes swift-drive-autpliot to exit and crashlooping and the dependent services can not start because of the drives-ready-flag.

The intended behaviour to just ignore such devices should also apply to the described failure, so there might be some evaluation of stderr needed.

Example log output

2018/02/03 21:11:41 Output from sfdisk: sfdisk: cannot open /dev/sdo: No medium found
2018/02/03 21:11:41 FATAL: exec(chroot /coreos sfdisk -l /dev/sdo) failed: exit status 1