philips-labs / terraform-aws-github-runner Goto Github PK

View Code? Open in Web Editor NEW

2.4K 29.0 574.0 17.85 MB

Terraform module for scalable GitHub action runners on AWS

Home Page: https://philips-labs.github.io/terraform-aws-github-runner/

License: MIT License

HCL 49.40% Shell 3.59% TypeScript 44.22% Dockerfile 0.09% PowerShell 2.71%

github github-actions terraform actions-runner serverless lambda aws scalable cicd self-hosted

terraform-aws-github-runner's People

Contributors

Stargazers

Watchers

Forkers

mpas arashkaffamanesh yuttasakcom cmcconnell1 nielswijers kengotoda doytsujin billyshambrook shehanster relu anoona-co jaydenrasmussen securecollc marcofranssen sorinsugar leoblanc jimrazmus bdruth alonsohki fabernovel bennettp123 smartlyio jefeish farshadhadei rbrto henrynguyen5 antonpuko nomeelnoj ostmodern seemethere noverby toots indigenuity nemani dexterlakin cloudboltsoftware elviejokike ryanoolala eky5006 mm-matsuda eslamanwar dubizzle rlove lvyinggithub jerryline bumbummen99 dingdingmyorg lrytz xahhy vndly-oss binbashar mcaulifn ceecko masterful rr-binh-nguyen davidji99 jonico samuelb texas-a-m-college-of-engineering swipswaps kostiantyn-vorobiov stonedmoose alyragab pdittaro anilpothula jeremy-cianella-nee squidcod fdelamor syllogy harana-oss huangered dnxlabs boldandbusted datyori dylan-smith andrewwells singhprd skhalaf21 ryanbrainard richicoder1 niteshpurohit narrativeapp karancode fac mryanmurphy arun9theja qoala-engineering edgargrimbergnov vimn-public kuvaldini telia-actions meroxa intelliflo oolio-group patil2099 lbaiao2019 ravenolf cshamrick bellese orquestradigital-actions

terraform-aws-github-runner's Issues

scale-up: Resource not accessible by integration

scale-up lambda function fails with error, RequestError [HttpError]: Resource not accessible by integration.

I also struggled to find what is the expected format for the github_app key_base64 variable (I kept getting errors like error:0909006C:PEM routines:get_name:no start line, a multi-line string (starting LS0t) which was the base64 of the PEM file seemed to work.

I tried the suggestion in #203 of granting Read access to the installed app in the "Actions" repository permissions without success.

The error message shows that the URL being accessed is https://api.github.com/repos/<my organisation>/<my repo>/actions/runs?status=queued with authorization header, authorization: 'token [REDACTED]' and the request is rejected by GitHub.com with '403 Forbidden'.

Please advise.

Resource not accessible by integration? What perms am I missing?

I'm sure I am just missing something documented somewhere, my integration has been made under my user account whilst I test this. Does the integration need anymore permissions than:

read/write on admin
read on checks
read/write on actions

    at /var/task/index.js:15325:23
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  status: 403,
  headers: {
    'access-control-allow-origin': '*',
    'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset',
    connection: 'close',
    'content-encoding': 'gzip',
    'content-security-policy': "default-src 'none'",
    'content-type': 'application/json; charset=utf-8',
    date: 'Mon, 13 Jul 2020 14:29:12 GMT',
    'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
    server: 'GitHub.com',
    status: '403 Forbidden',
    'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
    'transfer-encoding': 'chunked',
    vary: 'Accept-Encoding, Accept, X-Requested-With',
    'x-content-type-options': 'nosniff',
    'x-frame-options': 'deny',
    'x-github-media-type': 'github.v3; format=json',
    'x-github-request-id': 'B73E:A1E4:B1A84D:D6FF77:5F0C6FB8',
    'x-ratelimit-limit': '5000',
    'x-ratelimit-remaining': '4884',
    'x-ratelimit-reset': '1594651917',
    'x-xss-protection': '1; mode=block'
  },
  request: {
    method: 'GET',
    url: 'https://api.github.com/repos/callum-tait-pbx/test_repository/actions/runs?status=queued',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit-rest.js/17.11.2 octokit-core.js/2.5.4 Node.js/12.16.3 (Linux 4.14; x64)',
      authorization: 'token [REDACTED]'
    },
    request: { hook: [Function: bound bound register] }
  },
  documentation_url: 'https://developer.github.com/v3/actions/workflow_runs/#list-repository-workflow-runs'

Note I haven't attached a dummy runner to my repository yet, I was assuming I would run into a problem at some point that pointed to that and deal with it then.

Add readme for all submodules

If there are more than 100 runners registered, scale-down fails

scale-down.ts:113 and scale-down.ts:116 use actions.listSelfHostedRunnersForOrg and actions.listSelfHostedRunnersForRepo directly.
The returned object contains an array data.runners, containing up to 100 registered runners according to the API documentation.

If there are more than 100 runners registered, the additional runners are not considered for scaling down.

Question: What happens if a spot instance is terminated by AWS?

Hi, first of all, congratulations for this great project.

We have deployed github-runner successfully and it's running very well so far.

One question please. As you know, spot instances can be terminated by AWS. If a GitHub Runner EC2 instance is suddenly stopped by AWS (I mean, in the middle of a pipeline), what happens to the GitHub pipeline? It fails? Is there any retry/re-schedule mechanism to re-execute the build?

Thank you very much in advance.

Improve runner deletion by using `busy` flag

When building this solution the Github API couldn't tell if a runner was busy or not, so we resorted to trying to delete each runner via de API. If that returned a 500 Internal Server Error, we would know if it was busy.

Just played with the API again and saw the busy flag was added for runners. See https://docs.github.com/en/rest/reference/actions#list-self-hosted-runners-for-a-repository

Instead of trying to delete a runner we should use this flag so reduce the number of API calls on Github.

Add unit test for lambda action distribution syncer

Bring lambda under test, check other lamda's for examples.

If termination of EC2 instance fails, it is never cleaned up

If termination of an EC2 instance fails the code never cleans up the runner on the next pass.

We should handle this better and clean the EC2 instance up in a next attempt.

Test on org

Security Model?

Problem to solve

As a project owner I want to limit production runner access to protected branches

Intended users

Repo owners setting up deployment rules

Further details

In GitLab you can tie certain runners to protected branches. This enables us to use runners with production credentials and access levels, separate from the pool of runners available for every other branch.

It provides a security model in which accidental or intentional changes to production are limited to merged code.

Proposal

No proposal, this is a question.

Documentation

Availability & Testing

What does success look like, and how can we measure that?

Summary

I have followed the README and created the Github App and setup the terraform modules, however, can't get the runners created. Please see the error below, I guess its something to do with App permissions but I have tried them all and have been at this for a while, but no luck, not sure what I'm missing!!

Steps to reproduce

Run the example module and try to create a runner

What is the current bug behavior?

Does not create a runner

What is the expected correct behavior?

Should create a runner

Relevant logs and/or screenshots

2020-09-08T12: 53: 03.620Z	61a6e96f-ddd4-5bd3-ac59-bebe5d0eb4b7	ERROR	RequestError [HttpError
]: Not Found
    at /var/task/index.js: 14863: 23
    at processTicksAndRejections (internal/process/task_queues.js: 97: 5) {
  status: 404,
  headers: {
    'access-control-allow-origin': '*',
    'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset',
    connection: 'close',
    'content-encoding': 'gzip',
    'content-security-policy': "default-src 'none'",
    'content-type': 'application/json; charset=utf-8',
    date: 'Tue,
    08 Sep 2020 12: 53: 03 GMT',
    'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
    server: 'GitHub.com',
    status: '404 Not Found',
    'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
    'transfer-encoding': 'chunked',
    vary: 'Accept-Encoding, Accept, X-Requested-With',
    'x-content-type-options': 'nosniff',
    'x-frame-options': 'deny',
    'x-github-media-type': 'github.v3; format=json',
    'x-github-request-id': 'DFE0: 5DBC: 88B16E0:A4C558A: 5F577EAF',
    'x-ratelimit-limit': '5000',
    'x-ratelimit-remaining': '4986',
    'x-ratelimit-reset': '1599572823',
    'x-ratelimit-used': '14',
    'x-xss-protection': '1; mode=block'
  },
  request: {
    method: 'POST',
    url: 'https: //api.github.com/orgs/theabrar/actions/runners/registration-token',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit-rest.js/18.0.3 octokit-core.js/3.1.1 Node.js/12.18.2 (linux; x64)',
      authorization: 'token [REDACTED
      ]',
      'content-length': 0
    },
    request: { hook: [Function: bound bound register
      ]
    }
  },
  documentation_url: 'https: //docs.github.com/rest/reference/actions#create-a-registration-token-for-an-organization'
}

Possible fixes

I think this issue can be resolved by automating the GitHub app creation, possibly using probot.

Avoid storing secrets in environment variables

Currently secrets required for the lambda functions are stored in environment varialbes. I would be much better to secure them in KMS or SSM.

Generate terraform docs

Avoid manual update of terraform docs (input / output) in readme. Options

pre commit hook
via ci

Distribution Lambda occasionally fails after creation

Summary

Distribution syncer lambda sometimes not working after a terraform apply, cause unclear. Removing the lambda and run an apply again solves the issue.

Steps to reproduce

Not reproduceable, happens occasionally.

Runner should scale only if workflow of check event is queued

Currently the runner scales at the moment any workflow in the repo that sends the event is in the status queued. This should be if the workflow correleated tot the received event.

Stream ec2 instances logging to cloudwatch

Stream ec2 instance logging to cloudwatch

/var/log/user_data
security related logs
runner log

Make retention time configureable.

Automate GitHub App creation

Currently app creation is a manual process. Much better if we can automate the creation of an app.

Add lambda linter for runners

Add linting and update workflow, check webhook for examples

Jobs getting dropped

Summary

I'm not sure if this this is one problem that is with the runners or two problems and one of them is GitHub. Or maybe it's all GitHub's fault, and it's not communicating properly with the webhooks. IDK.

In the past week, I have regularly been seeing jobs getting cancelled or just not happening. The first thing I'm seeing, it seems like there might be some sort of timing issue between the close order being given to the spot request and the request picking up another job. My jobs are getting "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled" when no one has done anything.

The other thing I'm seeing is jobs that just never run and yet the workflow fails.

Steps to reproduce

I don't know. It happens all the time with my pipeline with all of my workflows and jobs. All of my jobs are running bash scripts which in turn run docker containers for everything. I do have a few differences from the default settings. I have instance type set to m5.4xlarge, and I have a post_install script that provides ecr access:

mkdir /home/ec2-user/.docker
touch /home/ec2-user/.docker/config.json
echo "{" >> /home/ec2-user/.docker/config.json
echo '	"credsStore": "ecr-login"' >> /home/ec2-user/.docker/config.json
echo "}" >> /home/ec2-user/.docker/config.json
amazon-linux-extras enable docker
yum install -y amazon-ecr-credential-helper

I just thought to try updating the lambda zips, since I'm based straight on the github repo and haven't done that since the last time I ran a terraform init. So I'll give that a shot.

What happens if an external user installs another organizations github app?

In the readme the following is stated:

Go to GitHub and create a new app. Beware you can create apps your organization or for a user. For now we handle only the organization level app.

But when the option to create an organization level app also forces the app to be public, so it is installable by anyone.

So if I create an organization level app for running this module, what's stopping someone else from discovering my github app installation url and using my self-hosted runners?

AWS lambda logs all levels

Currently all log levels are logged. By default only info should logged.

Cannot upgrade octokit/webhooks for webhook lambd

Due some exports are deprecated the webhook octokit lib cannot upgraded above 7.5.0

Docker install needs a -y

It seems like we need a -y on this line

https://github.com/philips-labs/terraform-aws-github-runner/blob/develop/modules/runners/templates/user-data.sh#L9

My runners hang and don't even finish booting without it.

runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted'

Summary

When following the readme, using the example configuration and adjusting the Github app permissions as per #100 (comment) the scale-up lambda fails to create the EC2 instance due to ServiceLinkedRoleCreationNotPermitted

Steps to reproduce

Do step 1 of Github app setup
Checkout terraform-aws-github-runner repo, cd into example folder
Download lambda zips
Create terraform.tfvars file with Github App credentials
run terraform init && terraform apply
Trigger a build on Github

What is the current bug behavior?

Github app sends webhook, webhook lambda forwards it, scaleup-lambda throws error:

...
ERROR	AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.
    at Request.extractError (/var/task/index.js:41424:35)
    at Request.callListeners (/var/task/index.js:47771:20)
    at Request.emit (/var/task/index.js:47743:10)
    at Request.emit (/var/task/index.js:18467:14)
    at Request.transition (/var/task/index.js:17801:10)
    at AcceptorStateMachine.runTo (/var/task/index.js:26145:12)
    at /var/task/index.js:26157:10
    at Request.<anonymous> (/var/task/index.js:17817:9)
    at Request.<anonymous> (/var/task/index.js:18469:12)
    at Request.callListeners (/var/task/index.js:47781:18) {
  code: 'AuthFailure.ServiceLinkedRoleCreationNotPermitted',
  time: 2020-07-30T15:03:24.631Z,
  requestId: 'c7bab39e-b75c-4e7d-bc29-6622b3d4ddb1',
  statusCode: 403,
  retryable: false,
  retryDelay: 68.19342592727871
}

What is the expected correct behavior?

Scale up lambda should create EC2 instance

Possible fixes

I'm sure this is a IAM permissions issue. I am rather new to both AWS and terraform and am not sure in which of them this needs to be solved and how to go about it.
Would be great to get some pointers.

Rename runners_maxiumum_count input (typo)

Hello, thanks for your great work here.

Can we rename the runners_maxiumum_count variable name please?

The word maxiumum has a typo. Correct name is maximum

Thanks!

PEM routines:get_name:no start line

Summary

Error in scale lambda invocation having to do with the private key decoding.

Steps to reproduce

The configuration:

module "runners" {
  source  = "philips-labs/github-runner/aws"
  version = "~> 0.2"

  ...snip...

  github_app = {
    key_base64     = var.github_app_key_base64
    id             = var.github_app_id
    client_id      = var.github_app_client_id
    client_secret  = var.github_app_client_secret
    webhook_secret = random_password.random.result
  }

  webhook_lambda_zip                = "lambdas-download/webhook.zip"
  runner_binaries_syncer_lambda_zip = "lambdas-download/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas-download/runners.zip"
  enable_organization_runners       = true
  runner_extra_labels               = "default"
}

The github_app_key_base64 which I suspect is the problem is set as following (PKCS#1 RSAPrivateKey):

github_app_key_base64    = <<-EOT
-----BEGIN RSA PRIVATE KEY-----
<base64 encoded>
-----END RSA PRIVATE KEY-----
EOT

What is the current bug behavior?

scale lambda fails.

What is the expected correct behavior?

scale lambda succeeds.

Relevant logs and/or screenshots

ERROR	Error: error:0909006C:PEM routines:get_name:no start line
    at Sign.sign (internal/crypto/sig.js:105:29)
    at Object.sign (/var/task/index.js:12802:45)
    at Object.jwsSign [as sign] (/var/task/index.js:9637:24)
    at Object.module.exports.6343.module.exports [as sign] (/var/task/index.js:36570:16)
    at getToken (/var/task/index.js:1861:23)
    at Object.githubAppJwt (/var/task/index.js:1882:23)
    at getAppAuthentication (/var/task/index.js:1509:57)
    at getInstallationAuthentication (/var/task/index.js:1630:35)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  library: 'PEM routines',
  function: 'get_name',
  reason: 'no start line',
  code: 'ERR_OSSL_PEM_NO_START_LINE'
}

Kudos for the really nice work on this and for sharing with the community! :)

Excess SSM permissions?

https://github.com/philips-labs/terraform-aws-github-runner/blob/develop/modules/runners/policies-runner.tf#L23

Is this policy needed for the runners to function? It seems like it would allow the runner to have arbitrary access to SSM values.
https://github.com/philips-labs/terraform-aws-github-runner/blob/develop/modules/runners/policies-runner.tf#L28
It looks like this one policy should be enough for the runner to access its own secret values.

Runners not executing jobs, just idle and shut down

Summary

Github Actions checks are not executed, instances boot up and then shut down without executing the job.

Steps to reproduce

I just did the normal v2 setup

What is the current bug behavior?

The workers boot up, but are idle, until they are shut down again.

What is the expected correct behavior?

The workers pick up the jobs and execute them in a reasonable time

Not sure if this belongs here, but have you any idea what could be the reason? The workers are definitely online and it just started randomly. The only thing I did is delete workers and unregister some that did not exist anymore and were somehow not unregistered or running without stop for multiple days.

I have an offline macOs Worker for tests not to fail, my CI runs on linux. Does this pose a problem?

Some additional info needs to be added to readme

I managed to fix the issues I encountered regarding Resource not accessible by integration and Not Found:

In the app permissions, also need to do Repository permissions > Actions > Read-only (regardless if you're an organization or not)

Also need to add into the terraform file (via #104):

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}

(this is for the 0.5.0 version)

Request: Change the github.com url for GitHub Enterprise Server

Hello!

Since we are using GitHub Enterprise Server, the github.com url is not available.
I want the download-lambda module to allow me to specify a url when I download a zip.

terraform-aws-github-runner/modules/runners/lambdas/runners/src/scale-runners/scale-up.ts

Line 96 in 5a045f5

    
           ? `--url https://github.com/${payload.repositoryOwner} --token ${token} ${labelsArgument}`

Cheers

Add label support

Workflows immediately fail and jobs are never created.

Summary

I just followed all the instructions and successfuly deployed this services to AWS and integrated it with GitHub. When I create a new job run, GitHub contacts the webhook and the webhook successfuly sends the message to the scale-up lambda, but as there were no runners at the time the job was enqueued, it immediately fails. This provokes that the scale-up lambda finds 0 queued jobs when querying the repository workflows, and doesn't create any runner.

Steps to reproduce

I simply followed the instructions. Tried with 0.1.0 and 0.2.0

Possible fixes

Ideally we would be able to tell a minimum amount of running runners, so we guarantee that there always is an immediately available runner.

Support for Windows runners

I've had a poke at the module and I am presuming this currently only supports Linux-based runners? Any plans to add Windows runner support?

Add linter for lambda action distribution syncer

Add linting and update workflow, check webhook for examples

Lambda distribution should be configureable

Lambda distribution should be configureable so the repo can be used as terraform module.

Queued workflow not picked up by AWS runner

Hi,

Whenever I trigger a workflow run while there are no running EC2 runner instances, the following happens:

Lambda webhook gets the check_run event and queues it to SQS
SQS triggers the scale-up Lambda
Lambda scale-up starts an EC2 instance
EC2 instance properly registers as a self-hosted runner (visible in the GH repository "Actions" settings page
The workflow run isn't picked up by the runner and stays in the queue forever, until I cancel it manually

If I trigger another workflow run while the EC2 runner is started, it gets properly picked up and executed.

Any idea what the problem is here?

Thanks!

Add support for ARM64 runners using AWS Graviton/Graviton2 instance-types.

Problem to solve

The current solution is unable to launch instances compatible to use with GitHub's ARM64 self-hosted runner.

Intended users

Developers/Teams building for ARM64 (e.g. Raspberry Pi)

Further details

Benefit: extends support to GitHub Actions pipelines that use ARM64

Proposal

PR in progress with changes to runner_architecture auto-detected from instance type, support for downloading arm64 actions-runner from GitHub, and a patch to account for lack of pre-installed ICU support in .NET Core, required by the arm64 actions-runner.

Documentation

Will document how to enable arm64 support as well as some gotchas I ran into (some Graviton instances aren't available in all AZs)

Availability & Testing

Not much? Might need a test case for the change to syncer lambda.

What does success look like, and how can we measure that?

Setting a Graviton/Graviton2 instance type in example/default/main.tf and (optionally) specifying subnet AZs in example/default/vpc.tf results in a successful stack that can launch functional ARM64 self-hosted runners.

Summary

dev-usw2-scale-up Execution result: failed

Steps to reproduce

Trigger via commit in configured application with requisite github app per the docs/README in this repo and https://040code.github.io/2020/05/25/scaling-selfhosted-action-runners

What is the current bug behavior?

ERROR Error: error:0909006C:PEM routines:get_name:no start line at Sign.sign ERROR Invoke Error
(see full error trace/out below)

What is the expected correct behavior?

The commit to configured repo causes lambda function execution and requisite scaling up of or deployment of AWS EC2 spot instance.

Relevant logs and/or screenshots

The most recent failure/error upon commit to configured github repo with github app configured to watch
CloudWatch: CloudWatch Logs: Log groups: /aws/lambda/dev-usw2-scale-up
available in github gist here:

gist-file-aws-lambda-dev-usw2-scale-up-error

Possible fixes

At first glance, appears like might be related to a cert/key error?

Who can address the issue

Requesting validation and suggestions on resolution

Problem to solve

As a developer interacting with a public repository, I want to be able to have ephemeral instances so that I can safely use self-hosted runners in a public repo.

Intended users

Any public repository user where github actions are used, and the default github hosted runners do not provide sufficient resources.

Proposal

Have a warm pool of idling runners waiting for a job from github (polling sqs queue or something)
When a idling runner gets a job, execute that job, and delete the runner when finished (the lifetime of the runner is the same as the GHA job it is executing)

What does success look like, and how can we measure that?

Jobs are quickly executed since runners are pre-provisioned
Security concerns over persistence of data across jobs are addressed since the lifetime of runners are tied to a single github job.

EC2 instance type

How can you pass the instance type you want to build. I saw that the default instance is m5.large, but there is no explanation on how we can change that.

scale-down lambda fails with: SyntaxError: Unexpected token u in JSON at position 0

Summary

Hello, thanks for the great project. Everything is working fine, except scale-down lambda. It fails with the SyntaxError: Unexpected token u in JSON at position 0 errors.

Steps to reproduce

here is my lambda download code:

module "lambdas" {
  source  = "philips-labs/github-runner/aws//modules/download-lambda"
  version = "0.4.0"

  lambdas = [
    {
      name = "webhook"
      tag  = "v0.4.0"
    },
    {
      name = "runners"
      tag  = "v0.4.0"
    },
    {
      name = "runner-binaries-syncer"
      tag  = "v0.4.0"
    }
  ]
}

as for idle config, I'm using defaults.

What is the current bug behavior?

here is the logs from the cloudwatch:


2020-08-19T15:03:24.035Z	22fd0c68-5fde-46e1-963e-422a6ae3aa00	ERROR	Unhandled Promise Rejection

{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "SyntaxError: Unexpected token u in JSON at position 0",
    "reason": {
        "errorType": "SyntaxError",
        "errorMessage": "Unexpected token u in JSON at position 0",
        "stack": [
            "SyntaxError: Unexpected token u in JSON at position 0",
            "    at JSON.parse (<anonymous>)",
            "    at Object.<anonymous> (/var/task/index.js:8456:39)",
            "    at Generator.next (<anonymous>)",
            "    at /var/task/index.js:8385:71",
            "    at new Promise (<anonymous>)",
            "    at module.exports.471.__awaiter (/var/task/index.js:8381:12)",
            "    at Object.scaleDown (/var/task/index.js:8455:12)",
            "    at /var/task/index.js:16564:22",
            "    at Generator.next (<anonymous>)",
            "    at /var/task/index.js:16543:71"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: SyntaxError: Unexpected token u in JSON at position 0",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:315:20)",
        "    at process.EventEmitter.emit (domain.js:482:12)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}

  | 2020-08-19T17:03:24.075+02:00Copy[ERROR] [1597849404074] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. | [ERROR] [1597849404074] LAMBDA_RUNTIME Failed to post handler success

What is the expected correct behavior?

Scale down lambda should work as expected and terminate idle instances after timeout.

===
thanks for any help!

Add dependabot configuration to keep github actions updated

https://docs.github.com/en/github/administering-a-repository/configuration-options-for-dependency-updates

Request Limit Exceeded?

Seeing runners fail to delete. The underlying AWS instances get purged with "orphaned runner deleted" log messages, but for some reason we are getting rate limited somewhere (I think in AWS) and then the Github runners never get removed.

If we wait long enough, we have seen as many as 800 offline runners...

Here are some relevant lambda logs from the scale down lambda:

{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "RequestLimitExceeded: Request limit exceeded.",
    "reason": {
        "errorType": "RequestLimitExceeded",
        "errorMessage": "Request limit exceeded.",
        "code": "RequestLimitExceeded",
        "message": "Request limit exceeded.",
        "time": "2020-10-19T12:00:12.277Z",
        "requestId": "eb9ecb84-5f5a-4317-974b-10371c2df8f7",
        "statusCode": 503,
        "retryable": true,
        "stack": [
            "RequestLimitExceeded: Request limit exceeded.",
            "    at Request.extractError (/var/task/index.js:40075:35)",
            "    at Request.callListeners (/var/task/index.js:46386:20)",
            "    at Request.emit (/var/task/index.js:46358:10)",
            "    at Request.emit (/var/task/index.js:17843:14)",
            "    at Request.transition (/var/task/index.js:17177:10)",
            "    at AcceptorStateMachine.runTo (/var/task/index.js:25384:12)",
            "    at /var/task/index.js:25396:10",
            "    at Request.<anonymous> (/var/task/index.js:17193:9)",
            "    at Request.<anonymous> (/var/task/index.js:17845:12)",
            "    at Request.callListeners (/var/task/index.js:46396:18)"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: RequestLimitExceeded: Request limit exceeded.",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:315:20)",
        "    at process.EventEmitter.emit (domain.js:483:12)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}


[ERROR] [1603108812400] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. | [ERROR] [1603108812400] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403.
-- | --

Any ideas what could be going on here? Thanks in advance for your help!

Runner instances are incorrectly detected as orphaned and terminated

Summary

When there are over 30 runners in a repo/organization the scale down lambda thinks new runners are orphans and will terminate them even while running a build.

Steps to reproduce

Add 30 runners to your repository or organisation. These can be offline.
Trigger a new workflow run to generate a new instance via this project
Wait the configured time (minimum_running_time_in_minutes option or 5 minutes by default)
Cloudwatch logs on the scale down function shows that the newly created instance is an orphan

What is the current bug behavior?

Runners get terminated while they should not be deleted.

What is the expected correct behavior?

These runners should not be terminated in this scenario.

Relevant logs and/or screenshots

2020-08-26T09:30:06.050Z 7d6440bb-27ba-4d17-ba26-3edc285b88c1 INFO Runner 'i-0d90f0e61ef64b847' is orphan, and will be removed.
2020-08-26T09:30:06.272Z 7d6440bb-27ba-4d17-ba26-3edc285b88c1 DEBUG Runner terminated.i-0d90f0e61ef64b847

Possible fixes

In modules/runners/lambdas/runners/src/scale-runners/scale-down.ts in the scaleDown function the following code is used to retrieve registered runners.

 const registered = enableOrgLevel
      ? await githubAppClient.actions.listSelfHostedRunnersForOrg({
          org: repo.repoOwner,
        })
      : await githubAppClient.actions.listSelfHostedRunnersForRepo({
          owner: repo.repoOwner,
          repo: repo.repoName,
        });

This API is paginated and by default returns the first 30 runners. The page size can be upped to 100 runners, but to be sure we should get all the runners.

Add dependabot configuration to keep node_modules updated

Maximum number of runners not configurable in Terraform

Maximum number runners can be set in the lambda but not configured in terraform.

Add instuction to contributers for pre-commit

Seems no instructions / directions are documented for pre commit hooks.

Question: Required Envs for Various Lambda Functions

Hello!

This looks excellent whilst we wait for GitHub to provide a supported solution. I work in a Cloudformation shop however and so I will be converting the Terraform, what isn't super clear to me is what varous env variables I need to provide the various Lambda functions as I will be deploying via Cloudformation and the serverless framework. Could you clarify what is required on the individual lambda functions for them to work?

philips-labs / terraform-aws-github-runner Goto Github PK

terraform-aws-github-runner's People

Contributors

Stargazers

Watchers

Forkers

terraform-aws-github-runner's Issues

Problem to solve

Intended users

Further details

Proposal

Documentation

Availability & Testing

What does success look like, and how can we measure that?

Other links/references

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs and/or screenshots

Possible fixes

Summary

Steps to reproduce

Summary

Steps to reproduce

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Possible fixes

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs and/or screenshots

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Summary

Steps to reproduce

Possible fixes

Problem to solve

Intended users

Further details

Proposal

Documentation

Availability & Testing

What does success look like, and how can we measure that?

Other links/references

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs and/or screenshots

Possible fixes

Who can address the issue

Other links/references

Problem to solve

Intended users

Proposal

What does success look like, and how can we measure that?

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Summary

Steps to reproduce

What is the current bug behavior?

What is the expected correct behavior?

Relevant logs and/or screenshots

Possible fixes

Recommend Projects

Recommend Topics

Recommend Org