Coder Social home page Coder Social logo

philips-labs / terraform-aws-github-runner Goto Github PK

View Code? Open in Web Editor NEW
2.4K 29.0 574.0 17.85 MB

Terraform module for scalable GitHub action runners on AWS

Home Page: https://philips-labs.github.io/terraform-aws-github-runner/

License: MIT License

HCL 49.40% Shell 3.59% TypeScript 44.22% Dockerfile 0.09% PowerShell 2.71%
github github-actions terraform actions-runner serverless lambda aws scalable cicd self-hosted

terraform-aws-github-runner's People

Contributors

aadrijnberg avatar alexjurkiewicz avatar bdruth avatar bendavies avatar dependabot[bot] avatar dylanmtaylor avatar forest-releaser[bot] avatar gertjanmaas avatar github-actions[bot] avatar guptanavdeep1983 avatar henrynguyen5 avatar jeroenknoops avatar jonico avatar jpalomaki avatar julada avatar kmaehashi avatar kring avatar kuvaldini avatar marcofranssen avatar marekaf avatar mcaulifn avatar npalm avatar patrickmennen avatar scottguymer avatar sdarwin avatar semantic-release-bot avatar taharah avatar toast-gear avatar ulich avatar wzyboy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

terraform-aws-github-runner's Issues

scale-up: Resource not accessible by integration

scale-up lambda function fails with error, RequestError [HttpError]: Resource not accessible by integration.

I also struggled to find what is the expected format for the github_app key_base64 variable (I kept getting errors like error:0909006C:PEM routines:get_name:no start line, a multi-line string (starting LS0t) which was the base64 of the PEM file seemed to work.

I tried the suggestion in #203 of granting Read access to the installed app in the "Actions" repository permissions without success.

The error message shows that the URL being accessed is https://api.github.com/repos/<my organisation>/<my repo>/actions/runs?status=queued with authorization header, authorization: 'token [REDACTED]' and the request is rejected by GitHub.com with '403 Forbidden'.

Please advise.

Resource not accessible by integration? What perms am I missing?

I'm sure I am just missing something documented somewhere, my integration has been made under my user account whilst I test this. Does the integration need anymore permissions than:

  • read/write on admin
  • read on checks
  • read/write on actions
    at /var/task/index.js:15325:23
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  status: 403,
  headers: {
    'access-control-allow-origin': '*',
    'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset',
    connection: 'close',
    'content-encoding': 'gzip',
    'content-security-policy': "default-src 'none'",
    'content-type': 'application/json; charset=utf-8',
    date: 'Mon, 13 Jul 2020 14:29:12 GMT',
    'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
    server: 'GitHub.com',
    status: '403 Forbidden',
    'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
    'transfer-encoding': 'chunked',
    vary: 'Accept-Encoding, Accept, X-Requested-With',
    'x-content-type-options': 'nosniff',
    'x-frame-options': 'deny',
    'x-github-media-type': 'github.v3; format=json',
    'x-github-request-id': 'B73E:A1E4:B1A84D:D6FF77:5F0C6FB8',
    'x-ratelimit-limit': '5000',
    'x-ratelimit-remaining': '4884',
    'x-ratelimit-reset': '1594651917',
    'x-xss-protection': '1; mode=block'
  },
  request: {
    method: 'GET',
    url: 'https://api.github.com/repos/callum-tait-pbx/test_repository/actions/runs?status=queued',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit-rest.js/17.11.2 octokit-core.js/2.5.4 Node.js/12.16.3 (Linux 4.14; x64)',
      authorization: 'token [REDACTED]'
    },
    request: { hook: [Function: bound bound register] }
  },
  documentation_url: 'https://developer.github.com/v3/actions/workflow_runs/#list-repository-workflow-runs'

Note I haven't attached a dummy runner to my repository yet, I was assuming I would run into a problem at some point that pointed to that and deal with it then.

If there are more than 100 runners registered, scale-down fails

scale-down.ts:113 and scale-down.ts:116 use actions.listSelfHostedRunnersForOrg and actions.listSelfHostedRunnersForRepo directly.
The returned object contains an array data.runners, containing up to 100 registered runners according to the API documentation.

If there are more than 100 runners registered, the additional runners are not considered for scaling down.

Question: What happens if a spot instance is terminated by AWS?

Hi, first of all, congratulations for this great project.

We have deployed github-runner successfully and it's running very well so far.

One question please. As you know, spot instances can be terminated by AWS. If a GitHub Runner EC2 instance is suddenly stopped by AWS (I mean, in the middle of a pipeline), what happens to the GitHub pipeline? It fails? Is there any retry/re-schedule mechanism to re-execute the build?

Thank you very much in advance.

Improve runner deletion by using `busy` flag

When building this solution the Github API couldn't tell if a runner was busy or not, so we resorted to trying to delete each runner via de API. If that returned a 500 Internal Server Error, we would know if it was busy.

Just played with the API again and saw the busy flag was added for runners. See https://docs.github.com/en/rest/reference/actions#list-self-hosted-runners-for-a-repository

Instead of trying to delete a runner we should use this flag so reduce the number of API calls on Github.

Security Model?

Problem to solve

As a project owner I want to limit production runner access to protected branches

Intended users

Repo owners setting up deployment rules

Further details

In GitLab you can tie certain runners to protected branches. This enables us to use runners with production credentials and access levels, separate from the pool of runners available for every other branch.

It provides a security model in which accidental or intentional changes to production are limited to merged code.

Proposal

No proposal, this is a question.

Documentation

Availability & Testing

What does success look like, and how can we measure that?

Other links/references

I asked a similar question in the GitHub Actions community forum:
https://github.community/t5/GitHub-Actions/Limit-self-managed-runners-to-protected-branches/m-p/55943#M9692

Scale up lambda failed

Hi. I've error on lambda scale up after setup your module.
Cloudwatch logs below:

ERROR	Invoke Error 	
{
    "errorType": "Error",
    "errorMessage": "Failed handling SQS event",
    "stack": [
        "Error: Failed handling SQS event",
        "    at _homogeneousError (/var/runtime/CallbackContext.js:12:12)",
        "    at postError (/var/runtime/CallbackContext.js:29:54)",
        "    at callback (/var/runtime/CallbackContext.js:41:7)",
        "    at /var/runtime/CallbackContext.js:104:16",
        "    at /var/task/index.js:16834:16",
        "    at Generator.throw (<anonymous>)",
        "    at rejected (/var/task/index.js:16816:65)",
        "    at processTicksAndRejections (internal/process/task_queues.js:97:5)"
    ]
}
    at /var/task/index.js:15124:23
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  status: 403,
  headers: {
    'access-control-allow-origin': '*',
    'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset',
    connection: 'close',
    'content-encoding': 'gzip',
    'content-security-policy': "default-src 'none'",
    'content-type': 'application/json; charset=utf-8',
    date: 'Tue, 17 Nov 2020 17:51:47 GMT',
    'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
    server: 'GitHub.com',
    status: '403 Forbidden',
    'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
    'transfer-encoding': 'chunked',
    vary: 'Accept-Encoding, Accept, X-Requested-With',
    'x-content-type-options': 'nosniff',
    'x-frame-options': 'deny',
    'x-github-media-type': 'github.v3; format=json',
    'x-github-request-id': '93DE:E7C5:957F272:AC944E7:5FB40DB3',
    'x-ratelimit-limit': '5600',
    'x-ratelimit-remaining': '5598',
    'x-ratelimit-reset': '1605639047',
    'x-ratelimit-used': '2',
    'x-xss-protection': '1; mode=block'
  },
  request: {
    method: 'GET',
    url: 'https://api.github.com/repos/RaketaApp/packer-base-ami/actions/runs?status=queued',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit-rest.js/18.0.6 octokit-core.js/3.1.1 Node.js/12.18.4 (linux; x64)',
      authorization: 'token [REDACTED]'
    },
    request: { hook: [Function: bound bound register] }
  },
  documentation_url: 'https://docs.github.com/rest/reference/actions#list-workflow-runs-for-a-repository'
}``` 

Github App permission issue

Summary

I have followed the README and created the Github App and setup the terraform modules, however, can't get the runners created. Please see the error below, I guess its something to do with App permissions but I have tried them all and have been at this for a while, but no luck, not sure what I'm missing!!

Steps to reproduce

Run the example module and try to create a runner

What is the current bug behavior?

Does not create a runner

What is the expected correct behavior?

Should create a runner

Relevant logs and/or screenshots

2020-09-08T12: 53: 03.620Z	61a6e96f-ddd4-5bd3-ac59-bebe5d0eb4b7	ERROR	RequestError [HttpError
]: Not Found
    at /var/task/index.js: 14863: 23
    at processTicksAndRejections (internal/process/task_queues.js: 97: 5) {
  status: 404,
  headers: {
    'access-control-allow-origin': '*',
    'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset',
    connection: 'close',
    'content-encoding': 'gzip',
    'content-security-policy': "default-src 'none'",
    'content-type': 'application/json; charset=utf-8',
    date: 'Tue,
    08 Sep 2020 12: 53: 03 GMT',
    'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
    server: 'GitHub.com',
    status: '404 Not Found',
    'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
    'transfer-encoding': 'chunked',
    vary: 'Accept-Encoding, Accept, X-Requested-With',
    'x-content-type-options': 'nosniff',
    'x-frame-options': 'deny',
    'x-github-media-type': 'github.v3; format=json',
    'x-github-request-id': 'DFE0: 5DBC: 88B16E0:A4C558A: 5F577EAF',
    'x-ratelimit-limit': '5000',
    'x-ratelimit-remaining': '4986',
    'x-ratelimit-reset': '1599572823',
    'x-ratelimit-used': '14',
    'x-xss-protection': '1; mode=block'
  },
  request: {
    method: 'POST',
    url: 'https: //api.github.com/orgs/theabrar/actions/runners/registration-token',
    headers: {
      accept: 'application/vnd.github.v3+json',
      'user-agent': 'octokit-rest.js/18.0.3 octokit-core.js/3.1.1 Node.js/12.18.2 (linux; x64)',
      authorization: 'token [REDACTED
      ]',
      'content-length': 0
    },
    request: { hook: [Function: bound bound register
      ]
    }
  },
  documentation_url: 'https: //docs.github.com/rest/reference/actions#create-a-registration-token-for-an-organization'
}

Possible fixes

I think this issue can be resolved by automating the GitHub app creation, possibly using probot.

Generate terraform docs

Avoid manual update of terraform docs (input / output) in readme. Options

  • pre commit hook
  • via ci

Distribution Lambda occasionally fails after creation

Summary

Distribution syncer lambda sometimes not working after a terraform apply, cause unclear. Removing the lambda and run an apply again solves the issue.

Steps to reproduce

Not reproduceable, happens occasionally.

Jobs getting dropped

Summary

I'm not sure if this this is one problem that is with the runners or two problems and one of them is GitHub. Or maybe it's all GitHub's fault, and it's not communicating properly with the webhooks. IDK.

In the past week, I have regularly been seeing jobs getting cancelled or just not happening. The first thing I'm seeing, it seems like there might be some sort of timing issue between the close order being given to the spot request and the request picking up another job. My jobs are getting "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled" when no one has done anything.

Screen Shot 2020-07-23 at 12 46 39 PM

The other thing I'm seeing is jobs that just never run and yet the workflow fails.

Screen Shot 2020-07-23 at 1 16 37 PM

Steps to reproduce

I don't know. It happens all the time with my pipeline with all of my workflows and jobs. All of my jobs are running bash scripts which in turn run docker containers for everything. I do have a few differences from the default settings. I have instance type set to m5.4xlarge, and I have a post_install script that provides ecr access:

mkdir /home/ec2-user/.docker
touch /home/ec2-user/.docker/config.json
echo "{" >> /home/ec2-user/.docker/config.json
echo '	"credsStore": "ecr-login"' >> /home/ec2-user/.docker/config.json
echo "}" >> /home/ec2-user/.docker/config.json
amazon-linux-extras enable docker
yum install -y amazon-ecr-credential-helper

I just thought to try updating the lambda zips, since I'm based straight on the github repo and haven't done that since the last time I ran a terraform init. So I'll give that a shot.

What happens if an external user installs another organizations github app?

In the readme the following is stated:

Go to GitHub and create a new app. Beware you can create apps your organization or for a user. For now we handle only the organization level app.

But when the option to create an organization level app also forces the app to be public, so it is installable by anyone.

So if I create an organization level app for running this module, what's stopping someone else from discovering my github app installation url and using my self-hosted runners?

runners-scale-up fails with 'AuthFailure.ServiceLinkedRoleCreationNotPermitted'

Summary

When following the readme, using the example configuration and adjusting the Github app permissions as per #100 (comment) the scale-up lambda fails to create the EC2 instance due to ServiceLinkedRoleCreationNotPermitted

Steps to reproduce

  • Do step 1 of Github app setup
  • Checkout terraform-aws-github-runner repo, cd into example folder
  • Download lambda zips
  • Create terraform.tfvars file with Github App credentials
  • run terraform init && terraform apply
  • Trigger a build on Github

What is the current bug behavior?

Github app sends webhook, webhook lambda forwards it, scaleup-lambda throws error:

...
ERROR	AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.
    at Request.extractError (/var/task/index.js:41424:35)
    at Request.callListeners (/var/task/index.js:47771:20)
    at Request.emit (/var/task/index.js:47743:10)
    at Request.emit (/var/task/index.js:18467:14)
    at Request.transition (/var/task/index.js:17801:10)
    at AcceptorStateMachine.runTo (/var/task/index.js:26145:12)
    at /var/task/index.js:26157:10
    at Request.<anonymous> (/var/task/index.js:17817:9)
    at Request.<anonymous> (/var/task/index.js:18469:12)
    at Request.callListeners (/var/task/index.js:47781:18) {
  code: 'AuthFailure.ServiceLinkedRoleCreationNotPermitted',
  time: 2020-07-30T15:03:24.631Z,
  requestId: 'c7bab39e-b75c-4e7d-bc29-6622b3d4ddb1',
  statusCode: 403,
  retryable: false,
  retryDelay: 68.19342592727871
}

What is the expected correct behavior?

Scale up lambda should create EC2 instance

Possible fixes

I'm sure this is a IAM permissions issue. I am rather new to both AWS and terraform and am not sure in which of them this needs to be solved and how to go about it.
Would be great to get some pointers.

PEM routines:get_name:no start line

Summary

Error in scale lambda invocation having to do with the private key decoding.

Steps to reproduce

The configuration:

module "runners" {
  source  = "philips-labs/github-runner/aws"
  version = "~> 0.2"

  ...snip...

  github_app = {
    key_base64     = var.github_app_key_base64
    id             = var.github_app_id
    client_id      = var.github_app_client_id
    client_secret  = var.github_app_client_secret
    webhook_secret = random_password.random.result
  }

  webhook_lambda_zip                = "lambdas-download/webhook.zip"
  runner_binaries_syncer_lambda_zip = "lambdas-download/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas-download/runners.zip"
  enable_organization_runners       = true
  runner_extra_labels               = "default"
}

The github_app_key_base64 which I suspect is the problem is set as following (PKCS#1 RSAPrivateKey):

github_app_key_base64    = <<-EOT
-----BEGIN RSA PRIVATE KEY-----
<base64 encoded>
-----END RSA PRIVATE KEY-----
EOT

What is the current bug behavior?

scale lambda fails.

What is the expected correct behavior?

scale lambda succeeds.

Relevant logs and/or screenshots

ERROR	Error: error:0909006C:PEM routines:get_name:no start line
    at Sign.sign (internal/crypto/sig.js:105:29)
    at Object.sign (/var/task/index.js:12802:45)
    at Object.jwsSign [as sign] (/var/task/index.js:9637:24)
    at Object.module.exports.6343.module.exports [as sign] (/var/task/index.js:36570:16)
    at getToken (/var/task/index.js:1861:23)
    at Object.githubAppJwt (/var/task/index.js:1882:23)
    at getAppAuthentication (/var/task/index.js:1509:57)
    at getInstallationAuthentication (/var/task/index.js:1630:35)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:97:5) {
  library: 'PEM routines',
  function: 'get_name',
  reason: 'no start line',
  code: 'ERR_OSSL_PEM_NO_START_LINE'
}

Kudos for the really nice work on this and for sharing with the community! :)

Runners not executing jobs, just idle and shut down

Summary

Github Actions checks are not executed, instances boot up and then shut down without executing the job.

Steps to reproduce

I just did the normal v2 setup

What is the current bug behavior?

The workers boot up, but are idle, until they are shut down again.

What is the expected correct behavior?

The workers pick up the jobs and execute them in a reasonable time

Not sure if this belongs here, but have you any idea what could be the reason? The workers are definitely online and it just started randomly. The only thing I did is delete workers and unregister some that did not exist anymore and were somehow not unregistered or running without stop for multiple days.

I have an offline macOs Worker for tests not to fail, my CI runs on linux. Does this pose a problem?

Some additional info needs to be added to readme

I managed to fix the issues I encountered regarding Resource not accessible by integration and Not Found:

  • In the app permissions, also need to do Repository permissions > Actions > Read-only (regardless if you're an organization or not)

Also need to add into the terraform file (via #104):

resource "aws_iam_service_linked_role" "spot" {
  aws_service_name = "spot.amazonaws.com"
}

(this is for the 0.5.0 version)

Workflows immediately fail and jobs are never created.

Summary

I just followed all the instructions and successfuly deployed this services to AWS and integrated it with GitHub. When I create a new job run, GitHub contacts the webhook and the webhook successfuly sends the message to the scale-up lambda, but as there were no runners at the time the job was enqueued, it immediately fails. This provokes that the scale-up lambda finds 0 queued jobs when querying the repository workflows, and doesn't create any runner.

Steps to reproduce

I simply followed the instructions. Tried with 0.1.0 and 0.2.0

Possible fixes

Ideally we would be able to tell a minimum amount of running runners, so we guarantee that there always is an immediately available runner.

Support for Windows runners

I've had a poke at the module and I am presuming this currently only supports Linux-based runners? Any plans to add Windows runner support?

Queued workflow not picked up by AWS runner

Hi,

Whenever I trigger a workflow run while there are no running EC2 runner instances, the following happens:

  1. Lambda webhook gets the check_run event and queues it to SQS
  2. SQS triggers the scale-up Lambda
  3. Lambda scale-up starts an EC2 instance
  4. EC2 instance properly registers as a self-hosted runner (visible in the GH repository "Actions" settings page
  5. The workflow run isn't picked up by the runner and stays in the queue forever, until I cancel it manually

If I trigger another workflow run while the EC2 runner is started, it gets properly picked up and executed.

Any idea what the problem is here?

Thanks!

Add support for ARM64 runners using AWS Graviton/Graviton2 instance-types.

Problem to solve

The current solution is unable to launch instances compatible to use with GitHub's ARM64 self-hosted runner.

Intended users

Developers/Teams building for ARM64 (e.g. Raspberry Pi)

Further details

Benefit: extends support to GitHub Actions pipelines that use ARM64

Proposal

PR in progress with changes to runner_architecture auto-detected from instance type, support for downloading arm64 actions-runner from GitHub, and a patch to account for lack of pre-installed ICU support in .NET Core, required by the arm64 actions-runner.

Documentation

Will document how to enable arm64 support as well as some gotchas I ran into (some Graviton instances aren't available in all AZs)

Availability & Testing

Not much? Might need a test case for the change to syncer lambda.

What does success look like, and how can we measure that?

Setting a Graviton/Graviton2 instance type in example/default/main.tf and (optionally) specifying subnet AZs in example/default/vpc.tf results in a successful stack that can launch functional ARM64 self-hosted runners.

Other links/references

n/a

dev-usw2-scale-up failure: "Failed handling SQS event" "PEM routines:get_name:no start line at Sign.sign"

Summary

dev-usw2-scale-up Execution result: failed

Steps to reproduce

Trigger via commit in configured application with requisite github app per the docs/README in this repo and https://040code.github.io/2020/05/25/scaling-selfhosted-action-runners

What is the current bug behavior?

ERROR Error: error:0909006C:PEM routines:get_name:no start line at Sign.sign ERROR Invoke Error
(see full error trace/out below)

What is the expected correct behavior?

The commit to configured repo causes lambda function execution and requisite scaling up of or deployment of AWS EC2 spot instance.

Relevant logs and/or screenshots

The most recent failure/error upon commit to configured github repo with github app configured to watch
CloudWatch: CloudWatch Logs: Log groups: /aws/lambda/dev-usw2-scale-up
available in github gist here:

gist-file-aws-lambda-dev-usw2-scale-up-error

Possible fixes

At first glance, appears like might be related to a cert/key error?

Who can address the issue

Requesting validation and suggestions on resolution

Other links/references

Thank you

Quesiton: Runner based on label

Hello Again!

I've got a question, how do you support spinning up different images depending on the label? My dream solution is runners are span up as required with them avaliable at the organisation level. The runner that is span up is based on the label provided, so if it's a node-12 label for instance then a node 12 instance is span up based on a node 12 launch template. How does the setup support multiple labels?

Cheers

Ephemeral Runners?

Problem to solve

As a developer interacting with a public repository, I want to be able to have ephemeral instances so that I can safely use self-hosted runners in a public repo.

Intended users

Any public repository user where github actions are used, and the default github hosted runners do not provide sufficient resources.

Proposal

  • Have a warm pool of idling runners waiting for a job from github (polling sqs queue or something)
  • When a idling runner gets a job, execute that job, and delete the runner when finished (the lifetime of the runner is the same as the GHA job it is executing)

What does success look like, and how can we measure that?

  • Jobs are quickly executed since runners are pre-provisioned
  • Security concerns over persistence of data across jobs are addressed since the lifetime of runners are tied to a single github job.

EC2 instance type

How can you pass the instance type you want to build. I saw that the default instance is m5.large, but there is no explanation on how we can change that.

scale-down lambda fails with: SyntaxError: Unexpected token u in JSON at position 0

Summary

Hello, thanks for the great project. Everything is working fine, except scale-down lambda. It fails with the SyntaxError: Unexpected token u in JSON at position 0 errors.

Steps to reproduce

here is my lambda download code:

module "lambdas" {
  source  = "philips-labs/github-runner/aws//modules/download-lambda"
  version = "0.4.0"

  lambdas = [
    {
      name = "webhook"
      tag  = "v0.4.0"
    },
    {
      name = "runners"
      tag  = "v0.4.0"
    },
    {
      name = "runner-binaries-syncer"
      tag  = "v0.4.0"
    }
  ]
}

as for idle config, I'm using defaults.

What is the current bug behavior?

here is the logs from the cloudwatch:


2020-08-19T15:03:24.035Z	22fd0c68-5fde-46e1-963e-422a6ae3aa00	ERROR	Unhandled Promise Rejection 
{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "SyntaxError: Unexpected token u in JSON at position 0",
    "reason": {
        "errorType": "SyntaxError",
        "errorMessage": "Unexpected token u in JSON at position 0",
        "stack": [
            "SyntaxError: Unexpected token u in JSON at position 0",
            "    at JSON.parse (<anonymous>)",
            "    at Object.<anonymous> (/var/task/index.js:8456:39)",
            "    at Generator.next (<anonymous>)",
            "    at /var/task/index.js:8385:71",
            "    at new Promise (<anonymous>)",
            "    at module.exports.471.__awaiter (/var/task/index.js:8381:12)",
            "    at Object.scaleDown (/var/task/index.js:8455:12)",
            "    at /var/task/index.js:16564:22",
            "    at Generator.next (<anonymous>)",
            "    at /var/task/index.js:16543:71"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: SyntaxError: Unexpected token u in JSON at position 0",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:315:20)",
        "    at process.EventEmitter.emit (domain.js:482:12)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}
  | 2020-08-19T17:03:24.075+02:00Copy[ERROR] [1597849404074] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. | [ERROR] [1597849404074] LAMBDA_RUNTIME Failed to post handler success

What is the expected correct behavior?

Scale down lambda should work as expected and terminate idle instances after timeout.

===
thanks for any help!

Request Limit Exceeded?

Seeing runners fail to delete. The underlying AWS instances get purged with "orphaned runner deleted" log messages, but for some reason we are getting rate limited somewhere (I think in AWS) and then the Github runners never get removed.

If we wait long enough, we have seen as many as 800 offline runners...

Here are some relevant lambda logs from the scale down lambda:

{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "RequestLimitExceeded: Request limit exceeded.",
    "reason": {
        "errorType": "RequestLimitExceeded",
        "errorMessage": "Request limit exceeded.",
        "code": "RequestLimitExceeded",
        "message": "Request limit exceeded.",
        "time": "2020-10-19T12:00:12.277Z",
        "requestId": "eb9ecb84-5f5a-4317-974b-10371c2df8f7",
        "statusCode": 503,
        "retryable": true,
        "stack": [
            "RequestLimitExceeded: Request limit exceeded.",
            "    at Request.extractError (/var/task/index.js:40075:35)",
            "    at Request.callListeners (/var/task/index.js:46386:20)",
            "    at Request.emit (/var/task/index.js:46358:10)",
            "    at Request.emit (/var/task/index.js:17843:14)",
            "    at Request.transition (/var/task/index.js:17177:10)",
            "    at AcceptorStateMachine.runTo (/var/task/index.js:25384:12)",
            "    at /var/task/index.js:25396:10",
            "    at Request.<anonymous> (/var/task/index.js:17193:9)",
            "    at Request.<anonymous> (/var/task/index.js:17845:12)",
            "    at Request.callListeners (/var/task/index.js:46396:18)"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: RequestLimitExceeded: Request limit exceeded.",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:315:20)",
        "    at process.EventEmitter.emit (domain.js:483:12)",
        "    at processPromiseRejections (internal/process/promises.js:209:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:98:32)"
    ]
}


[ERROR] [1603108812400] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403. | [ERROR] [1603108812400] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 403.
-- | --

Any ideas what could be going on here? Thanks in advance for your help!

Runner instances are incorrectly detected as orphaned and terminated

Summary

When there are over 30 runners in a repo/organization the scale down lambda thinks new runners are orphans and will terminate them even while running a build.

Steps to reproduce

  1. Add 30 runners to your repository or organisation. These can be offline.
  2. Trigger a new workflow run to generate a new instance via this project
  3. Wait the configured time (minimum_running_time_in_minutes option or 5 minutes by default)
  4. Cloudwatch logs on the scale down function shows that the newly created instance is an orphan

What is the current bug behavior?

Runners get terminated while they should not be deleted.

What is the expected correct behavior?

These runners should not be terminated in this scenario.

Relevant logs and/or screenshots

2020-08-26T09:30:06.050Z 7d6440bb-27ba-4d17-ba26-3edc285b88c1 INFO Runner 'i-0d90f0e61ef64b847' is orphan, and will be removed.
2020-08-26T09:30:06.272Z 7d6440bb-27ba-4d17-ba26-3edc285b88c1 DEBUG Runner terminated.i-0d90f0e61ef64b847

Possible fixes

In modules/runners/lambdas/runners/src/scale-runners/scale-down.ts in the scaleDown function the following code is used to retrieve registered runners.

 const registered = enableOrgLevel
      ? await githubAppClient.actions.listSelfHostedRunnersForOrg({
          org: repo.repoOwner,
        })
      : await githubAppClient.actions.listSelfHostedRunnersForRepo({
          owner: repo.repoOwner,
          repo: repo.repoName,
        });

This API is paginated and by default returns the first 30 runners. The page size can be upped to 100 runners, but to be sure we should get all the runners.

Question: Required Envs for Various Lambda Functions

Hello!

This looks excellent whilst we wait for GitHub to provide a supported solution. I work in a Cloudformation shop however and so I will be converting the Terraform, what isn't super clear to me is what varous env variables I need to provide the various Lambda functions as I will be deploying via Cloudformation and the serverless framework. Could you clarify what is required on the individual lambda functions for them to work?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.