The agent from stackabletech

Agent should be more helpful when it isn't allowed to interact with systemd

Currently when the agent is not allowed to start/stop/etc. a service we'll get an error message like Interactive authentication required. (or similar depending on OS).

It'd be great if we could check for common causes here and suggest a fix in the log message

Ability to update systemd unit files

Implement the ability to check whether a systemd unit file needs an update based on upstream (orchestrator) changes.
Change the unit file based on these changes.

Agent should check on startup if important directories are readable/writable

When the data directory is not writeable the Agent will exit with a weird panic from the bowels of Krustlet because the plugins dir cannot be created. To avoid this we should check if the data directory exists and is writeable so we can provide a nicer error.

The same might be nice for the log dir and maybe others? Not sure where it makes sense.

Ability to remove systemd units

Implement functionality to cleanly remove a systemd unit

stop unit
delete unit

Ability to perform daemon-reload

Whenever any unit file was changed systemd needs to reload information from the unit files. This should be transparently triggered after any relevant changes.

Add License

Allow running services as non-root

Add ability to specify extra taints for the node

Log dir should point at /var/log by default

Currently it points to /opt/stackable which is non-standard and may be surprising

Ability to set file permissions on config files

RUSTSEC-2020-0053: dirs is unmaintained, use dirs-next instead

dirs is unmaintained, use dirs-next instead

Details
Status	unmaintained
Package	`dirs`
Version	`3.0.1`
URL	https://github.com/dirs-dev/dirs-rs
Date	2020-10-16

The dirs crate is not maintained any more;
use dirs-next instead.

See advisory page for additional details.

Process Management using systemd

One of the core components of the Agent will be to manage processes.
We decided to use systemd as the tool doing the actual supervision as it's available on all our target systems (Debian, Ubuntu, RHEL).

The Agent needs a way to manager systemd Units and this ticket is about implementing this functionality.

Every time a process starts it'll need configuration data on disk which needs to be materialized from configuration in the Operator/Orchestrator resources.
To make debugging easier we'd like to have a new uniquely named directory in a configurable location (e.g. /var/run/stackable/...) which has all the files a process needs.
An open question is whether restarts due to failure reuse an existing configuration directory or not.

This is the epic for this task and these are the dependencies:

Open Issues/questions:

Log file target (stdout/stderr)
Restart/failure behavior
Environment variable
Start on boot behavior

Investigate existing crates that deal with systemd and see if we can use one or more of these for the functionatily we need

Investigate existing crates that deal with systemd and see if we can use one or more of these for the functionality we need

Document mapping of pod data structure to systemd service files

We need to define, which pod properties we initially want to support and how we want to map those to our systemd unit files.

I'll just start with my initial ideas in this ticket, if we find out that it becomes too unwieldy we'll move the discussion somewhere else.

Once we are agreed, I'll pull the result out into an ADR.

The following shows fields I've looked at so far and where in the systemd file I'd extract them, but I am sure that list is far from complete.

pod:
  metadata:
    name: -> name
  spec:
    containers: <we currently only allow one container per pod>
      image: <used by virtual kubelet for downloading package>
      env: Environment=...
      command: ExecStart
      args: ExecStart
      name: <unused, taken from meta.name>
      volumeMounts: <used by virtual kubelet to set up machine>
      workingDir: WorkingDirectory
    initcontainers: <we could either implement these as extra services linked by a before= statement, or build ExecStartPre commands from them, not sure which makes more sense>
    restartPolicy: Restart
    terminationGracePeriodSeconds: TimeoutStopSec  <Systemd also offers TimeoutStartSeconds but I could not find a matching pod field, should we reuse this one for both?>

A few assumptions in there that might be worth discussing. We currently restrict pods to only have one container as the idea was to just create multiple pods if you need more. Do we want to change that and allow multiple containers? What would be the benefit - downsides?

For init containers, I am unsure how to treat these, I don't think it is relevant at this stage, but might be worth having a quick look at just to make sure we don't burn any bridges. There's two options, we can implement these as one-shot services that are required before our main service starts. That way systemd should run them once, before trying to start our main service (need to investigate the full implications of this).
Alternatively we can create ExecStartPre commands from these fields, which systemd would run once, before starting the main service.

Both have things pro and con I guess.. does anyone have any preference of the top of their heads?

Also, we should probably at least take the user to run this as from the PodSecurityContext, but that opens up an entire can of worms that I am not sure we are ready to deal with just yet.. thoughts?

Use dbus code generation feature instead of manually implementing dbus code

The rust dbus crate offers the ability to generate code from in introspected dbus connection.
This could potentially make our code more robust and we should look into using this instead of the current "manual" code.

Add configurable liveness probe intervall for systemd units

Currently the agent checks whether services are still up once every 10 seconds, we could change that to be configurable vs the podspec for example via https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command

Investigate whether we can put our unit files in a separate location from the standard unit files

Add ability to restrict possible mount paths

Pods can specify absolute paths for mounting config maps into (zookeeper needs this for example).
It would be good to be able to restrict this so that only specific directories can be targeted with this.

Refactor usage of Krustlet config to allow using command line parameters to configure agent

Currently all command line parameters are passed to the Krustlet config code as well, which causes an exception if any parameters are present that the Krustlet does not understand. In reverse, any command line parameters which the agent doesn't unterstand cause an exception in the agent.

The combination of these two means that it is currently not possible to pass anything on the commandline (except maybe version or help) - because for any parameter one of the two components will fail.

This should be fixed by manually constructing a Krustlet config object instead of using the methods that parse the command line, but this requires some boiler plate and re-implementation of default values.

Ability to replace path variables in generated config

When downloading configmaps to local config files the agent needs to replace the following values:

package path where the package has been downloaded to
config path where the config files will reside
log directory, where the application should place its logs

Allow configuration of Pod CIDR Range

We don't use it but Krustlet defaults to use one.
We need to disable it to not cause issues:

Ability to download and unpack packages

The agent should when it is assigned a process download and extract the specified package.

Ability to create systemd units

Implement functionality to create systemd unit files based on the description data structure defined in #3

RUSTSEC-2020-0016: `net2` crate has been deprecated; use `socket2` instead

net2 crate has been deprecated; use socket2 instead

Details
Status	unmaintained
Package	`net2`
Version	`0.2.37`
URL	deprecrated/net2-rs@`3350e38`
Date	2020-05-01

The net2 crate has been deprecated
and users are encouraged to considered socket2 instead.

See advisory page for additional details.

Make generation of systemd unit files deterministic

Currently the structure of systemd files that are generated is not determined by any specific set of rules, which means that the ordering of entries may change if for example the order that environment variables are specified in the pod is changed.

We need to come up with a set of rules to apply during generation of these files to ensure that only actual changes to the unit file trigger a rewrite.

Something along the lines of:

write known sections in this order (unit, service, install)
write rest of sections in alphabetical order
sort lines within sections alphabetically

Thread seems to panic when pod moves to terminating state before the service was successfully started

Implement systemd unit monitoring

Receive and process notifications from systemd about process state changes (e.g. a process dying or restarting) - this will probably happen via D-Bus

Build and deploy nightly packages

We want to build and deploy binary packages in RPM & DEB format at least every day, potentially as part of every build.

We'll target at least CentOS/RHEL 7 for now, potentially 8.
Debian I'm unsure which versions make sense as I'm not too familiar with it.

Update Krustlet dependency to 0.6.0

There have been many changes in the upstream Krustlet and we should update our code to work with the new version that was just released

Evaluate whether systemk solves all our needs and we should rather pool resources on that project instead of continuing with our own agent

@miekg and @pires have created systemk, a virtual kubelet implementation with very similar goals like our project.

We should have a very close look at that project and evaluate if there are any functional benefits to moving on with our own project or if we should rather use and contribute to systemk instead.

Support running of services as application users

Currently all our managed processes (e.g. ZooKeeper) are started as root.
We'd like to support also running all our services as non-root.

We'd like to let the user choose the username the services should run as.
So the CRDs need to be extended (but that's for operator specific issues).

We will follow what the systemk does which means that the Agent will need to look in each Pod for the desired username.

This is the field where the name can be found: pod.securityContext.windowsOptions.runAsUserName
Note: There is also a pod.securityContext.runAsUser field but that only takes an integer which is not enough for us.

The agent needs to be extended to read the aforementioned property
The username then needs to be propagated to the systemd unit
Optional/Bonus: If the user does not exist it can be created automatically
- If this is implemented it should be an optional feature, the Agent should have a configuration option disabling auto-creation of users
When the agent creates directories for the services (via Volumes) they need to be owned by the same user
- This might also be made configurable later but that's for another issue

Note: This is not about running the Agent itself as non-root!

RUSTSEC-2018-0015: term is looking for a new maintainer

term is looking for a new maintainer

Details
Status	unmaintained
Package	`term`
Version	`0.4.6`
URL	Stebalien/term#93
Date	2018-11-19

The author of the term crate does not have time to maintain it and is looking
for a new maintainer.

Some maintained alternatives you can potentially switch to instead, depending
on your needs:

See advisory page for additional details.

Implement start/stop/restart for systemd units

RUSTSEC-2020-0056: stdweb is unmaintained

stdweb is unmaintained

Details
Status	unmaintained
Package	`stdweb`
Version	`0.4.20`
URL	koute/stdweb#403
Date	2020-05-04

The author of the stdweb crate is unresponsive.

Maintained alternatives:

See advisory page for additional details.

Implement graceful handling of agent restart

Once we have implemented systemd for running services in #5 , we need to look at identifying agent restarts and gracefully handling that.

Basically the test upon startup needs to be

check if config directory for current pod version exists
if yes, assume service should be running and check state
if not: start service with current config
Remove current Krustlet behavior to evict (=delete) pods upon shutdown

Set up Github Actions to automatically keep task references in issues in sync

This helps so we can have "epics" with a master issue that links to all the others it depends on.
There is a Github Action https://github.com/jonabc/sync-task-issues which can be used to keep things in sync.

Once this is done open a similar issue and PR at the orchestrator and common repositories.

Investigate pod deletion happening before processes are stopped

@maltesander has noticed that his spark processes when he restarts them quite often need to find a new port because the old process is still up and running.
We need to look at adding finalizers to pods and maybe have a "shutting down" state that allows to wait for the process to stop.

Ability to replace values beyond just the path objects in the containers command property

Crash if no tag is provided for the container image

The agent crashes if no tag is provided for the container image.

Given the following kafka.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: kafka
spec:
  containers:
  - name: kafka
    image: kafka
  tolerations:
  - key: "kubernetes.io/arch"
    operator: "Equal"
    value: "stackable-linux"

When the pod is deployed:

$ kubectl apply -f kafka.yml

Then the agent crashes with the following log output:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/provider/repository/package.rs:42:35

Ability to add init containers to pods

Add functionality to tail logs via kubectl (or similar tools)

It would be brilliant to support looking at the logs of our services via kubectl logs

The code to send logs is present and could be used, I have successfully sent "test" messages to kubectl logs, however some extra work needs to be done to read the actual logs from the systemd journal.

I have found two crates that seem to support this functionality, but both have issues if I am not mistaken:

https://gitlab.com/systemd.rs/sd-journal -> AGPL
https://github.com/jmesmon/rust-systemd -> unsafe code

I believe this is a medium-low hanging fruit if we can find a crate that offers this functionality, as most of the supporting code could be reused from Krustlet.

Add a README

To explain what this project is all about and how it can be used

Build and publish RPM packages

Check if krustlet configuration environment variables are set before overwritting them

Write action log to support technical audits

As a system administrator I'd like to know what happened on a stackable 'managed' node.

An action log containing the relevant actions performed by the agent could help maintaining a comprehensible state of the nodes.

downloaded version x.y.z of software abc
extracted software abc version x.y.z into folder foo
applied default security configuration
started node as part of a cluster
...
downloaded security patch x.y.z+1 of software abc
removed node from running cluster
stopped node
started rolling update of node 3 of cluster
added node to running cluster
rolling update finished successfully
...
download new major version x+1.0.0 of software abc
applied configuration migration
started stop-the-world cluster update
...

configroot
logroot
packageroot

These replacements should also happen in environment variables that are added to pods (see below for an example), but this is not currently working.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2020-12-14T12:48:19Z"
  generateName: spark-cluster-worker-
spec:
  containers:
  - command:
    - spark-3.0.1-bin-hadoop2.7/sbin/start-slave.sh
    - spark://bawa-virtualbox:7077
    env:
    - name: SPARK_CONF_DIR
      value: '{{configroot}}/conf'
    - name: SPARK_NO_DAEMONIZE
      value: "true"
    image: spark:3.0.1
    imagePullPolicy: IfNotPresent
    name: spark-3-0-1

stackabletech / agent Goto Github PK

agent's People

Contributors

Stargazers

Watchers

Forkers

agent's Issues

Recommend Projects

Recommend Topics

Recommend Org