scottg489 / conjob Goto Github PK

Simple web interface to run containers as jobs or serverless functions

License: MIT License

Dockerfile 0.70% Shell 5.44% Java 87.64% HCL 4.88% Scala 1.33%

ci build-tool docker

conjob's Introduction

ConJob

ConJob is a web service for running containers as jobs. It's intended to run specified images that will do some work and then exit on their own. Any output from the job is returned to the caller.

Usage

It's recommended to run ConJob using docker, but it can also be built then run from source.

Docker

docker run -it \
  -p 8080:8080 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  scottg489/conjob

Build and run from source

git clone https://github.com/ScottG489/conjob.git \
  && cd conjob \
  && cp default-config.yml local-config.yml \
  && ./gradlew run --args="server local-config.yml"

local-config.yml can be edited to configure the server.

Note: Java 11 or higher is required.

Make a request

curl 'localhost:8080/job/run?image=library/hello-world:latest'

Which should display the output of the hello-world image. Similar to if you run docker run hello-world.

You can also supply input:

curl -X POST --data 'foobar' 'localhost:8080/job/run?image=scottg489/echo-job:latest'

POST data is supplied as arguments to the application. Similar to if you run docker run scottg489/echo-job:latest foobar.

Development

Common gradle tasks

unitTest - Unit tests and ArchUnit architecture tests
integrationTest - Integration tests
run --args="server local-config.yml" - Starts service using local config
acceptanceTest - Acceptance tests
performanceTest - Performance tests using Gatling
piTest - Mutation tests using Pitest
jacocoApplicationReport - Generate JaCoCo coverage report for application
jacocoAllTestReport - Generate accumulative JaCoCo coverage report for all tests

Test coverage

Unit and integration test coverage is captured via JaCoCo.

Acceptance test coverage is also capturing via JaCoCo. However, it's instrumented on the service itself. In order for the coverage data file to be created, the service needs to be shut down. After which the jacocoAllTestReport can be run.

Additionally, after coverage data files are generated, the jacocoAllTestReport task can be run to generate cumulative coverage data for all tests.

Complete build testing

To fully test your changes run ./test.sh at the root of the project. However, first make sure to change the file locations of the secrets to your actual locations.

This script runs the complete build, publish, infrastructure provisioning, and deployment to the test environment. It then runs the acceptance tests against the deployed service before then tearing the environment down. This allows you to get a very high degree of certainty that your changes will deploy successfully once pushed.

Note that you'll need to comment out the git clone in infra/build/run.sh otherwise it will fail since you've mounted a directory where it will attempt to clone to. You'll also probably want to comment out all the prod deploy steps unless you really intend to deploy to prod from your local workstation.

Creating alt/bootstrap server

The ConJob project uses ConJob itself as a CI server. This means that in order to build the project you need an initial ConJob server running. In order to create this alternate/bootstrap server, simply run ./alt-env.sh. It can also be run to update the current alt server to the latest version.

The reason this is needed is that as part of its deploy the server shuts itself down. Although the build container will continue to run, the server will not return the build's output, and we'll have no logs. At some point this will be replaced by a blue-green deployment.

Development troubleshooting

+ terraform import aws_s3_bucket.backend_bucket tfstate-conjob
aws_s3_bucket.backend_bucket: Importing from ID "tfstate-conjob"...
aws_s3_bucket.backend_bucket: Import prepared!
  Prepared aws_s3_bucket for import

Error: Resource already managed by Terraform

Terraform is already managing a remote object for
aws_s3_bucket.backend_bucket. To import to this address you must first remove
the existing object from the state.

To remediate this issue you'll have to clean up the tfstate files in the appropriate backend-init directory.

conjob's People

Contributors

Stargazers

Watchers

conjob's Issues

Investigate flakiness with JobRunnerTest's

I believe the flakiness has something to do with the fact that JobRunner executes jobs using a separate thread for timeouts. Not quite sure what else it could be other than that.

Rename externally facing components of project to be "jobs" rather than "CI builds"

This is the second step (after #14) to re-purposing and re-branding the service as a generic job runner rather than only a CI server.

After renaming internally facing names, components, and interfaces, we need to do the same for outwardly facing ones.

Some things we'll want to address::

Change /build endpoint to something like /job or /run or likely /job/run
The domain name should be changed to conjob.io
- Be sure this is domain change is coordinated with other builds since there are a few projects that currently use this service for their CI builds.

Other things will likely be discovered as the work progresses.

#14 is a dependency of this issue.

Follow docker best practices

https://github.com/ScottG489/docker-ci-prototype/blob/ff77dac58e784a7e1cdd8314351bd7ed1931119f/Dockerfile#L3-L4

https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

Always combine RUN apt-get update with apt-get install in the same RUN statement. Using apt get update alone in a RUN statement causes caching issues and subsequent apt-get install instructions fail.

Capture all exceptions and return an error in the form of a `JobResponse`

When an exception is thrown from a resource dropwizard returns a JSON message with a code and message field. However, this is inconsistent with the /job/run endpoint's interface which has jobRun, result, etc. We should capture all exceptions and return them with the same format regardless of it it's an error or not.

Look into code linting/formatting and advanced code analysis tools

Integration with Intellij is a plus or possibly requirement for at least code linting.

/var/lib/sysbox doesn't seem to be cleaned up

Containers or images (not sure which) for sysbox aren't being cleaned up properly. An example of where this can be found:

/var/lib/sysbox/docker/506ffa0b6a59e9b36335de834d55e1cbfedcb69c3f8f91f85b549f389845e1db/overlay2/bc1de8e4acae2126dc4b20c02d93f3e5e6392d6738206e9103d9b57bb9e1cf56

Detailed information can be found here:
https://github.com/nestybox/sysbox/blob/master/docs/user-guide/troubleshoot.md#the-varlibsysbox-is-not-empty-even-though-there-are-no-containers

They recommend cleaning up all sysbox containers and there's a script to do so. However, we do create a new container every deploy so those resources should be freed up. It might be worth stopping all of them and then while none are running clean up everything and see if that cleans up the /var/lib/sysbox dir. However, this basically happens on deploy. We should also try cleaning up all volumes too to see if that makes a difference.

If this doesn't work then we might want to create a github issue for sysbox since it seems like this shouldn't happen.

Job settings endpoint where settings for a job that apply to conjob can be configured

An example of this might be to configure a job to run on a schedule (although the feature isn't implemented yet). These would be settings for a job as they apply to conjob, not to the job run.

Some ideas:

Set max number of concurrent runs (less than or equal to global setting)
Set job to run on a schedule
Set default timeout
Set job to kick off another job (this idea should be thought through carefully)

Should the config and secrets endpoints be moved under this? (i.e. settings/secrets and settings/config)

Fix unit test JRE crash

The failure seems to consistently fail at the same point.

TempSecretsFileUtilTest > createSecretsFile STANDARD_OUT
    timestamp = 2021-05-25T05:27:25.951897, TempSecretsFileUtilTest:createSecretsFile = 
                                  |-------------------jqwik-------------------
    tries = 1000                  | # of calls to property
    checks = 1000                 | # of not rejected calls
    generation = RANDOMIZED       | parameters are randomly generated
    after-failure = PREVIOUS_SEED | use the previous seed
    when-fixed-seed = ALLOW       | fixing the random seed is allowed
    edge-cases#mode = MIXIN       | edge cases are mixed in
    edge-cases#total = 4          | # of all combined edge cases
    edge-cases#tried = 4          | # of edge cases tried in current run
    seed = 8450192231027384562    | random seed to reproduce generated values


#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:1261), pid=474, tid=494
#  guarantee((retry_count++ < 100)) failed: Could not resolve to latest version of redefined method
#
# JRE version: OpenJDK Runtime Environment (11.0.11+9) (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
# Java VM: OpenJDK 64-Bit Server VM (11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing, tiered, compressed oops, serial gc, linux-amd64)
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /opt/build/conjob/core.474)
#
# An error report file with more information is saved as:
# /opt/build/conjob/hs_err_pid474.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
#

ConfigTaskTest > Given a conjob configuration, and new config values, when updating the config with new values, then fields in the config should be updated with new values, and fields not updated should be the same as the originals. SKIPPED

> Task :unitTest FAILED
:unitTest (Thread[Daemon worker,5,main]) completed. Took 1 mins 43.959 secs.
Closing Git repo: /opt/build/conjob/.git

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':unitTest'.
> Process 'Gradle Test Executor 1' finished with non-zero exit value 134
  This problem might be caused by incorrect test process configuration.
  Please refer to the test execution section in the User Manual at https://docs.gradle.org/6.7/userguide/java_testing.html#sec:test_execution

* Try:
Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 3m 39s
10 actionable tasks: 10 executed

Unfortunately it's only reproducible on the build server, which is bad because the whole point is to have parity. In any case, a few ideas to fix this:

Try upgrading the JDK version. If this works I'm not sure we'll stick with it but it would be good to know.
Try playing with the tests that happen before/after the failure (see failure above, i.e. TempSecretsFileUtilTest and ConfigTaskTest). Try removing them, changing the number of tries, etc.
It might be a good idea to post an issue on the jqwik github page.

Return failure response code when the container stops with non-zero exit code

We currently ignore the exit code of the container run by the server. This means that if the build in the container fails then it will return a 200 response. This is not desirable because then the requester can't easily determine if there was a failure or not and can't act appropriately to a failed build.

We should detect the exit code and response accordingly. We should consider what would be an appropriate exit code. This would likely be a 4xx or 5xx.

Here are some arguments for each:

4xx:

If the user provided bad input to the build such as bad credentials or just a generally bad envar then it would be proper to return a 4xx (likely 400). However, this may be difficult to detect. Each build is very unique.
The build container itself (not just the name) is an input to the system. This means that if it failed, it's still because the user supplied an image for the server to run which had flaws in it. This means the build server itself was successful, but the user provided container was misconfigured and failed. In other words, the user provided bad input (the container itself).

5xx:

Due to the inability to determine what went wrong in the job because each build is unique, this may be the best blanked approach.
We have more control over errors on the server side (such as when a container exists so we can't even start it). This would mean that if we returned a 500 it was likely due to a failure with the build itself. However, it's always possible that the server will have unforeseen errors and could still return a 500.

That said, the important thing is that we don't return a successful response code so at least builds will report as having failed. I think it's also ok for there to be some overlap between failures returned from the server and failures returned from the container. They don't need to be mutually exclusive.

Note: that if the container itself fails to run for some reason such as the container not existing we currently return a 500. We may want to revisit this and return a different error code instead. For instance in the example above of a container not existing we should return a 404.

Use docker cache across runs

This will bring builds back down to their previous duration and just, or maybe more, importantly will prevent disk size from being used up so quickly.

I know there is documentation in the sysbox GitHub repo. That would be a good place to start.

Add the ability rate limit jobs

In order to prevent the server from being overloaded we should set a rate limit. This can be done with both/either ways:

Rate limit on the entire server. Don't allow it to accept requests above a certain rate
Rate limit per "user". This might not make sense since we don't have the concept of a "user" in the server and don't want one.

Currently I prefer the former.

Usually rate limits are on a per-user basis. However, since there is not concept of a user in our server, this wouldn't quite make sense. The only way I can think of implementing this would be to track IPs which feels a bit out of scope.

However, rate limiting on the entire server seems like it would be fairly straightforward. See if dropwizard has any kind of native support for this. Otherwise look for some kind of generic rate limiting library or possibly implement it yourself if it wouldn't be too complex.

This is a follow on ticket from #4.

Install `shiftfs` on AMI

Follow up issue from #11.

Job isolation is now working. The AMI didn't have shiftfs support but we were able to get around that with userns-remap and --userns=host. See related issue #11 and linked discussions for more info.

The --userns=host may not be ideal. Ideally we want shiftfs installed on the server. From #11:

Now everything seems to be working as expected. A slightly cleaner solution is outlined in this comment in the sysbox discussion: nestybox/sysbox/discussions/121#discussioncomment-130136.
This should install shiftfs on the AMI which would mean the --userns=host wouldn't be necessary.

We can follow the directions in that discussion to try to install shiftfs on the AMI. This seems like a cleaner solution.

However, for now everything seems to be working fine with acceptable job isolation so I'd say this is lower priority.

Add way to determine version of CI server

There is no way right now to determine what version of the code the server is running on. This would be helpful for troubleshooting issues with the server itself

Rename and repurpose project

Repurpose to a "job" service rather than a service specifically for CI jobs. This would involve renaming things such as:

Change /build endpoint to something like /job or /run
Renaming all references in code such as class names or variables to indicate it's a "job" not a "build"
Use new project name. The name I've decided on is con job
Along with the project name, the domain name should be changed to conjob.io
- Be sure this is domain change is coordinated with other builds since there are a few projects that currently use this service for their CI builds.

Other things will likely be discovered as the work progresses.

Fix integration test failure

There's an issue with the build:

com.spotify.docker.client.shaded.javax.ws.rs.client.ResponseProcessingException: com.spotify.docker.client.shaded.com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Cannot construct instance of `com.spotify.docker.client.messages.Image`, problem: Null virtualSize

Upgrade to Gradle 7

As of this writing I believe the only blocker to this is the gatling-gradle-plugin which only supports up to version 6.6.1. More information here:

https://gatling.io/docs/current/extensions/gradle_plugin/#compatibility

Once the upgrade can happen here's documentation on upgrading:
https://docs.gradle.org/current/userguide/upgrading_version_6.html#changes_7.0

If tag isn't specified then default to "latest"

If the tag isn't specified then the pull seems to pull down EVERY tag for that image. As a quick fix #18 was implemented. Once the work for this is done #18 should be rolled back.

Look into testspace.com for publishing test reports

Setup is pretty simple, though we'll need to specify another secret.

Seems like a good solution for monitoring test data over time.

https://help.testspace.com/category/overview

Report more informative error to user when same job is run concurrently with docker cache mounted

Currently when two concurrent jobs try to access /var/lib/docker the second job will fail with errors only reported to the dockerd logs.

After a little more research, two jobs **can actually mount /var/lib/docker at the same time. However, it seems like the second job to run dockerd will result in dockerd failing to start because only one instance of the daemon can run at the same time.

This actually relaxes the requirement a bit in that only one instance of a job can run at the same time with /var/lib/docker mounted and only requires that one of them run dockerd.

However, this also makes reporting this error to the user more difficult since the problem occurs inside their job rather than at the outer docker level where we can monitor.

It seems like we'd have to somehow introspect the job to somehow see if dockerd was running at the same time or two things were trying to access the mounted /var/lib/docker. This seems rather invasive though.

A better alternative could be to warn the user if the job is being run concurrently and any attempts to run dockerd could result in errors.

This may even open the possibility to only running dockerd in jobs just in time for needing to run docker commands and then stopping the daemon afterwards. This would mitigate any overlap, but would not completely outrule failures.

Documentation of this limitation should also be made.

Finally, eventually we may want to add job configuration or query parameters to enable/disable mounting of /var/lib/docker and maybe of use of sysbox altogether if the user doesn't need this functionality. This way they would have to be explicit and more aware of limitations.

This is a follow up to #17.

Secret management

Currently secrets are passed in as input to containers being run. This is fine for most cases since we want it to be the callers responsibility to manage secrets.

However, there still may be some use cases where we want to be able to initialize secrets so a job can access them when run.

For instance, take a caller that needs to run a particular job. This job requires credentials to a 3rd-party to carry out it's tasks. However, the caller may not have permission to this third party.

Unfortunately, I think this will require some sort of state. Statelessness is something we want to maintain whenever possible. However, I don't think it can be entirely avoided if we want this functionality. A good compromise may be to have this state entirely managed by the docker application if possible.

So far I have looked into

Docker secrets
Docker volumes

On first pass at least, docker secrets don't seem to be an option. They seem only able to be used with docker swarm and swarm doesn't seem to mesh well with our use case. It seems a lot more focused on services. And since our jobs are meant to terminate it seemed like we're using it for something it isn't meant for.

Now, I am currently looking into using docker volumes. The idea would be to allow a secret to be registered by storing it in a volume and for a specified image. Then when the image is run it would have access to this secret. I believe this should be safe enough because when you create the secret you specify which image(s) have access to it by name.

However, one big issue arises from the fact that all images that are run are privileged and can control the host dockerd (since docker.sock is mounted on all run images). This means that they have full access to run anything on the host dockerd and thus would have access to any volume and view anything stored in these volumes, such as a secret. Since containers are allowed to run docker freely there is little that can be done about this right now.

I think a longer term solution would be to not run all images with docker.sock mounted. This would be safer in general since doing this in the first place is a bad idea for security reasons. However, this will also break current builds that expect this to be allowed and do things like building docker images within their builds.

One solution to this would be to require builds be more "modular" and instead call specific docker images like the "docker-build-push" image to handle docker-related tasks.

For instance, in a build, an application could build it's binary and put it on a docker named volume (unrelated to secrets) then quit. Then an approved/privileged image like docker-build-push could mount that volume and run the docker commands that it needs to. This would probably require some kind of whitelist on the server so that only specified images would be allowed to run as privileged.

Add the ability to control the max number of/rate limit jobs

In order to prevent potential abuse or accidental overusage of the service there should be a way to configure two things:

A rate limit on how quickly the endpoint can be hit to start up new jobs
A limit on the maximum number of jobs

Note that for both of these it could be further configured to these same constraints but on a per-container basis. In other words you can't run more than x number of the same jobs or in quick succession.

This is along the same lines of attempting to prevent abuse as issue #3.

Config endpoint for non-secret values that the user can set and retrieve for a given job

I imagine this will be almost exactly like the secrets endpoint with the exception of a GET method.

Mount docker socket when running container for deployment

The docker socket isn't mounted during the deployment here:

https://github.com/ScottG489/docker-ci-prototype/blob/master/infra/ansible/master-playbook.yml#L28

Move DockerAdapter out of core.job into new core.adapter package

DockerAdapter is the only core class that should have third party dependencies. Perhaps it shouldn't even live in core, but at least for now lets move it out of core.job.

After this is done, also create an ArchUnit test which makes sure that no packages in the entire project depend on the third party docker lib except that package.

Use of mutation testing in conjob - Help needed

Hello there!

My name is Ana. I noted that you use the mutation testing tool Pit in the project.
I am a postdoctoral researcher at the University of Seville (Spain), and my colleagues and I are studying how mutation testing tools are used in practice. With this aim in mind, we have analysed over 3,500 public GitHub repositories using mutation testing tools, including yours! This work has recently been published in a journal paper available at https://link.springer.com/content/pdf/10.1007/s10664-022-10177-8.pdf.

To complete this study, we are asking for your help to understand better how mutation testing is used in practice, please! We would be extremely grateful if you could contribute to this study by answering a brief survey of 21 simple questions (no more than 6 minutes). This is the link to the questionnaire https://forms.gle/FvXNrimWAsJYC1zB9.

We apologize if you have already received message multiple times or if you have already had the opportunity to complete the survey. If you have already shared your feedback, we want to convey our appreciation, kindly disregard this message, and please accept our apologies for any inconvenience.

Drop me an e-mail if you have any questions or comments ([email protected]). Thank you very much in advance!!

Fix job duration limiting

It seems like this isn't working properly. A "service" job was left to run and kept going past 30 minutes.

Catch all exceptions between incrementing and decrementing limiters

This is particularly important for the limiter for max concurrent jobs. If an exception occurs during processing the count is never decremented and it will cause the service to think there are still jobs running. This will cause the job count to continue to climb and will exceed the max allowed jobs and start rejecting all other jobs.

Add support for basic auth

Anyone with access to the build server's URL can run jobs at will and with no limit. A good initial solution to this would be to add basic auth to the endpoint so it can be somewhat restricted.

Look into automated security testing tools

I'm looking for something that could be easily integrated into the build pipeline. It would be something such as a:

linter
testing framework (like ArchUnit)

Here's a very large list of tools:
https://owasp.org/www-community/Source_Code_Analysis_Tools

Consider disabling unattended upgrades

Unattended upgrades have caused a few issues in the past.

Package installations would fail because the unattended upgrade was running apt which held a lock. One solution to this specific problem is that we could wait for the upgrade to finish.

It has introduce breaking changes to production. Specifically, the version of runc was upgraded to fix a security vulnerability. This is good, but it also caused the service to completely break. It would have been better if there was a test run on package upgrades first and then if that worked the production server could be upgraded

Performance of the server shortly after startup could be affected. I haven't seen this specifically, but if unattended upgrades are running upgrades right after the server starts up this could affect performance.

Here are a few relevant links:

ansible/ansible#25414
https://stackoverflow.com/questions/45269225/ansible-playbook-fails-to-lock-apt/51919678#51919678
https://help.ubuntu.com/community/AutomaticSecurityUpdates
https://unix.stackexchange.com/questions/463498/terminate-and-disable-remove-unattended-upgrade-before-command-returns
https://unix.stackexchange.com/questions/342663/how-is-unattended-upgrades-started-and-how-can-i-modify-its-schedule/342674#342674
https://unix.stackexchange.com/questions/315502/how-to-disable-apt-daily-service-on-ubuntu-cloud-vm-image

Is it problematic that docker-build-push docker cache is shared between projects?

Are there any security implications? The docker-build-push image is building separate projects, but they're all sharing the same cache. I think this should be OK as long as there isn't a way for someone to abuse the inputs to docker-build-push to run malicious code. I haven't actually audited it, so right now this could be a problem. However, I think this is really a problem for any image that's run on conjob.

For instance, a job has run and caches a build which may have secrets on it. Then a malicious user runs the job again with inputs that run some malicious code. (Thinking about this, isn't this a problem with GitHub Actions even? For instance PR's causing builds to run automatically, but those PRs could edit the workflow file or something else in the build to expose secrets?).

It also means that only one of the jobs can be running at a single time since you can't have more than one instance of a job that's accessing the docker cache shared between them. Not even pertaining to the docker cache, this is a bit of an annoying limitation.

I'd have to think this through more thoroughly, but it got me wondering if we even need docker-build-push.

Could you somehow have an image that built itself? It would probably have to be bootstrapped, but then perhaps it would just need to be parameterized to either build and push or run the actual project build.

Perhaps there should be something like a base image per project on which everything runs on top of? This way they are entirely "namespaced" separately from other builds. The base image would just need dockerd installed and running and then it could build the CI build image (maybe wouldn't even need to push it?) and then run that image to do the actual CI build. The base image could be built by conjob too, but I think this might just get us back to the security and partially the single instance problem (though maybe less so since it should never change and would build from cache every time?) because the base image would still need to be built by docker-build-push with the required secrets. So it might be necessary for the base image to be bootstrapped (built and pushed locally once).

Use PostBodyTask for admin config endpoint

PostBodyTask is more appropriate for the admin config endpoint used to do live updates to the applications config (currently only job limit config). The current config update endpoint only accepts GET requests. However, we are updating the config so a POST request is what's appropriate.

Look into this Task to fully evaluate if this is the better use case and then implement it.

Also refer to email "Regarding updating the dropwizard config on runtime" and see if we can give credit to the email's sender for letting me know about this Task.

Performance tests are inconsistent

The performance test stability seems to be somewhat improved by warming up the endpoints. However, they can still be inconsistent and lead to flaky tests.

Determine a way to make the tests less flaky or perhaps even find an alternative for what we're trying to accomplish with having the tests around (which is at least to prevent a performance degradation when making changes).

Another promising option might be to spread the tests out over a longer period of time. Right now all the tests are fairly short mainly to keep the build time low. However, spreading them out over a longer period of time could even out the results.

I think right now the performance tests are still on the quick side so we have some leeway to make them longer without it being much of an impact. It might also be an option to remove them from the main build and have them run nightly if they are going to run for an extended period of time.

Configure provisioning to point simple-ci.com at service

Currently simple-ci.com is hardcoded to point to an unmanaged instance. When the instance is created we want that record to actually point at the provisioned instance. This should be done via terraform by adding a new (route-53?) resource.

Use "pull if absent" logic to get intermediary image (tianon/true) for secret endpoint

We pull the tianon/true image on startup when we initialize the /secret resource. This is a good solution, but if for some reason the image gets removed from the machine while the service is running then the /secret endpoint will stop working.

To solve this we should use the "pull if absent" strategy which is one of the pull strategies available in the /job/run endpoint.

Make sure to weigh this change with any performance hit the endpoint will take. I don't believe there should be any performance penalty because up until starting the container there is very little overhead, but make sure to test this. This would be a good time to add some performance tests for the /secret endpoint.

Rename internals of project to be "jobs" rather than "CI builds"

This is the first step (before #15) to re-purposing and re-branding the service as a generic job runner rather than only a CI server.

We first need to rename inward facing components such as class names and other naming. We'll then follow up with renaming outward facing interfaces since other projects assume certain names, URLs, etc.

Some things we'll want to address::

Renaming all internally facing references in code such as class names or variables to indicate it's a "job" not a "build"
Use new project name. The name I've decided on is con job (containerized jobs)
- I don't think renaming the project should be outwardly facing but this should be verified

Other things will likely be discovered as the work progresses.

#15 depends on this issue.

Look for tools to look for out of date dependencies (and possibly update them)

Try to find a tool that will be able to keep tabs and possibly automatically update any dependencies we have.

This applies to libs as well as tools. Code dependencies all live in the build.gradle file. However, there are plenty of other dependencies we have such as docker, ansible, terraform, etc.

As time goes on new versions for tools and libs we use will be updated. It's a good idea to keep up with them as much as possible.

Should we write dropwizard integration tests even though it doesn't seem to integrate with jqwik?

jqwik @Property tests don't seem to work with @ExtendWith(DropwizardExtensionsSupport.class). The DropwizardClientExtension or DropwizardAppExtension instances don't seem to be instantiating properly and no information is set on them (base URL most importantly).

I think these tests would cover about the same breadth as the acceptance tests, so I don't think it's a huge deal. There's quite a bit of overlap between them. It might be a little more performant to do it with integration tests but maybe not. We'd also have easier control over the configuration, but with the config endpoint that might not matter as much.

In lieu of this, it might be a good idea to try introducing jqwik into the acceptance tests.

Wait for service to come up before finishing playbook

This change here adds an arbitrary wait:
2e3f5e1

Try one of these methods instead of the script:
https://docs.ansible.com/ansible/latest/modules/wait_for_connection_module.html
https://docs.ansible.com/ansible/latest/modules/wait_for_module.html

Require tags

It seems like otherwise it tries to pull down every single tag.

Reduce error output in unit tests

We have unit tests which ensure errors are thrown. This means in the test output there will be a stacktrace. Normally this is fine, but since we're using property based tests, the tests run a large number of times. So large that it usually overflows console scrollback and even github's action logs won't display the entire logs.

To reduce this output we could simply hide all of it. However, it might hide problems and make troubleshooting a little more difficult. I think ideally we could do something like show the first stacktrace and then just a count of all the other ones. Perhaps the property based testing framework we're using even has something built in for this?

Start service on boot

In the deployed instance, set up conjob to start on boot.

This will help with being able to restart the server unattended knowing the service will come back up automatically.

Once this is done, create a follow up ticket to run the alt service as a persistent rather than one-time spot_type. Then also consider reducing the spot_price substantially since it isn't super important that it's always available since the conjob build doesn't run very often.

Consider `docker-java` over spotify's `docker-client`

Consider replacing Spotify's docker-client dependency with docker-java. docker-client is no longer maintained

https://github.com/docker-java/docker-java
https://github.com/spotify/docker-client

Remove docker references from alt-env

We don't do anything with docker in the alt env (i.e. build) so we don't need to run dockerd or need any of the associated variables.

Consider not mounting docker.sock on each container

This was always a security issue but is now more urgent/required due to secrets being introduced with the completion of #10. Here is some relevant information:

However, one big issue arises from the fact that all images that are run are privileged and can control the host dockerd (since docker.sock is mounted on all run images). This means that they have full access to run anything on the host dockerd and thus would have access to any volume and view anything stored in these volumes, such as a secret. Since containers are allowed to run docker freely there is little that can be done about this right now.

I think a longer term solution would be to not run all images with docker.sock mounted. This would be safer in general since doing this in the first place is a bad idea for security reasons. However, this will also break current builds that expect this to be allowed and do things like building docker images within their builds.

One solution to this would be to require builds be more "modular" and instead call specific docker images like the "docker-build-push" image to handle docker-related tasks.

For instance, in a build, an application could build it's binary and put it on a docker named volume (unrelated to secrets) then quit. Then an approved/privileged image like docker-build-push could mount that volume and run the docker commands that it needs to. This would probably require some kind of whitelist on the server so that only specified images would be allowed to run as privileged.

The benefit of mounting docker.sock is strong easy of use. This allows for users to run docker commands within their containers which is a completely valid use case. I personally am using it often to build and push application images from project build containers.

Not mounting the docker.sock would mean some substantial work would have to be done to not lose the ability to do docker related tasks. The work needed for this should be discussed in a separate issue. However, to summarize some ideas here:

Allow for native support to do building and pushing (running? or too unsafe?). What would this look like?
Allow for "whitelisted" images which are allowed to do docker operations (via having the docker.sock mounted). This would be fairly easy to implement but isn't idea because it would not scale well.
Allow for special purpose images to be "whitelisted" which would handle docker related tasks specifically (e.g. the docker-build-push image). This could work by building your application as normal, putting the artifact on a shared volume, then mounting that volume onto the whitelisted image and it would do the necessary docker operations. Another, possibly less versatile, option would be to publish your built artifact and then download it during your docker build.

Whatever the implementation, great care needs to be taken so that the docker operations will at least not be able to access the secrets volumes.

This should be a configuration option rather than something hard-coded. If users completely trust all jobs running on the server then it's fine to allow it.

Allow config values to be updated dynamically through the admin interface

Add an endpoint on the admin port to be able to update config values dynamically.

The main reason for this is testing, but it's also a useful feature to have in general.

For some of the tests, it's necessary to have values like request limiting or job timeouts changed or reduced for the sake of testing. For instance, the job timeout is 30 minutes. It isn't possible to test this, otherwise the test would take too long. However, if the tests are able to update the timeout limit to be shorter then they can test effectively.

This will require some research into how the dropwizard config can be dynamically updated.

Introduce jqwik property tests into acceptance tests

It might be a good idea to introduce jqwik property based tests into the acceptance tests. We should probably keep the tries down to a small amount though but we can see how performant they are to judge how many would be necessary.

"Warm up" endpoints before performance tests

The performance tests can be unpredictable when certain endpoints need to do a pull.

I don't think it's important for us to factor in pulls when doing load tests. Network performance affects it a lot.

I saw that Gatling has a way to "warm up" endpoints before starting the performance test. We should do this for all tests.

Consider adding a timeout for pulls

We have a timeout for job run time but not for pulls. We don't want to pull an object for too long because that probably means it's going to be massive or there is some other issue with the pull. We can handle this by:

Including the pull time in the job run time which means there would be 1 total time between the two
Include a separate timeout just for the pull

I think I prefer the latter. It wouldn't make sense to allow a pull to go on for 30 minutes. It would also allow for jobs to know exactly how much time they have to run rather than having to worry about the ambiguity of how long a pull took.

It also might be a good idea to have an overall timeout. Perhaps rather than a job timeout of pull time + run time. We could have a pull time included in the overall timeout. If desired, this may be more complicated and should probably be made into a separate issue. Before closing this decide if this is something desired and if so make another issue.

scottg489 / conjob Goto Github PK

conjob's Introduction

ConJob

Usage

Docker

Build and run from source

Make a request

Development

Common gradle tasks

Test coverage

Complete build testing

Creating alt/bootstrap server

Development troubleshooting

conjob's People

Contributors

Stargazers

Watchers

conjob's Issues

Recommend Projects

Recommend Topics

Recommend Org