remind101 / empire Goto Github PK

A PaaS built on top of Amazon EC2 Container Service (ECS)

License: BSD 2-Clause "Simplified" License

Makefile 0.06% Go 98.02% Shell 0.39% Ruby 1.51% Dockerfile 0.02%

paas docker aws ecs

empire's Introduction

Empire

Empire is a control layer on top of Amazon EC2 Container Service (ECS) that provides a Heroku like workflow. It conforms to a subset of the Heroku Platform API, which means you can use the same tools and processes that you use with Heroku, but with all the power of EC2 and Docker.

Empire is targeted at small to medium sized startups that are running a large number of microservices and need more flexibility than what Heroku provides. You can read the original blog post about why we built empire on the Remind engineering blog.

Quickstart

To use Empire, you'll need to have an ECS cluster running. See the quickstart guide for more information.

Architecture

Empire aims to make it trivially easy to deploy a container based microservices architecture, without all of the complexities of managing systems like Mesos or Kubernetes. ECS takes care of much of that work, but Empire attempts to enhance the interface to ECS for deploying and maintaining applications, allowing you to deploy Docker images as easily as:

$ emp deploy remind101/acme-inc:master

Heroku API compatibility

Empire supports a subset of the Heroku Platform API, which means any tool that uses the Heroku API can probably be used with Empire, if the endpoint is supported.

As an example, you can use the hk CLI with Empire like this:

$ HEROKU_API_URL=<empire_url> hk ...

However, the best user experience will be by using the emp command, which is a fork of hk with Empire specific features.

Routing

Empire's routing layer is backed by internal ELBs. Any application that specifies a web process will get an internal ELB attached to its associated ECS Service. When a new version of the app is deployed, ECS manages spinning up the new versions of the process, waiting for old connections to drain, then killing the old release.

When a new internal ELB is created, an associated CNAME record will be created in Route53 under the internal TLD, which means you can use DNS for service discovery. If we deploy an app named feed then it will be available at http://feed within the ECS cluster.

Apps default to only being exposed internally, unless you add a custom domain to them. Adding a custom domain will create a new external ELB for the ECS service.

Deploying

Any tagged Docker image can be deployed to Empire as an app. Empire doesn't enforce how you tag your Docker images, but we recommend tagging the image with the git sha that it was built from (any any immutable identifier), and deploying that.

When you deploy a Docker image to Empire, it will extract a Procfile from the WORKDIR. Like Heroku, you can specify different process types that compose your service (e.g. web and worker), and scale them individually. Each process type in the Procfile maps directly to an ECS Service.

Contributing

Pull requests are more than welcome! For help with setting up a development environment, see CONTRIBUTING.md

Community

We have a google group, empire-dev, where you can ask questions and engage with the Empire community.

You can also join our Slack team for discussions and support.

Auth Flow

The current authentication model used by emp login relies on a deprecated GitHub endpoint that is scheduled to be deactivated in November 2020. Therefore both the client and the server need to be updated to support the web authentication flow

The web flow works like this

The user runs a command like emp web-login
The client starts up a HTTP listener on a free local port
The client opens a browser window on the local machine to $EMPIRE_API_URL/oauth/start?port=?????
- The port parameter specifies where the client is listening
The browser executes a GET against the URL
The Empire server sees the request and constructs an OAuth request URL that will hit the GitHub OAuth endpoint and returns it as a redirect
The browser makes the request to the GitHub auth endpoint, which shows the UI a request to authorize the application
- If they've previously authorized it will just immediately grant the request
GitHub redirects the browser back to the redirect URL specified in the configuration, meaning back to the Empire server
The Empire server receives the browser request and can now perform the code exchange to turn the provided code into an actual authentication token
- This is just like it would have received from the old endpoint. However, it's not usable yet because it still isn't in the possession of the client, only the browser
The Empire server now redirects the browser back to localhost on the original port provided by the client
The client receives the token, but can't use it directly. The Empire server expects it to be wrapped in a JSON Web Token that only the server can create.
The client can now make a request directly to the Empire server (its first in this sequence) providing the token and requesting a JSON Web Token in response
The client stores the received token just as it would have with the response to an emp login command
The client is authenticated

In theory the Empire server could construct the JWT directly after the code exchange and push that directly to the client, but the abstraction doesn't really seem to easily support that flow

empire's People

Contributors

Stargazers

Watchers

Forkers

brantc otechnology cgiogkarakis antonini desmondmorris zachlatta gitter-badger bgentry ferrix drooids miguelperalvo miguelperalvopm public curtiszimmerman yourchanges chenggangschool aaithal heartshare galvezz kuguobing zhakui lihuanghai no2key mwildehahn markpeek iserko tomzhang zofuthan vfulco russdeg esouza russ billwanjohi globalxolutions pdaniel-frk logonmy flygsand halfnhav4 krishna0511 henrybell frewsxcv baversjo shidenkai0 alnutile jjbohn intfrr is00hcw sbaryakov izogain tribemedia zabawaba99 askagirl caseycrites tmahoney-handy michaelaquilina cachristopher 16m gabhi lifatov sunilvrao naphthalene jonson pombredanne digideskio cloudxtreme pilgrim2go srinathgs clearcare 40a zbyte64 stewartpark helpershub rbramwell crohr linearregression willhoule warrenca codehopper-uk lambder ncodefresh gophersgang dannythrasher scalawilliam ifarhankhan tommycrush msessa etsangsplk srt32 ip-2014 rv4devops whereswardy gklbiti dshamanthreddy iskibinskahfa mysticaltech ibuystuff cglewis aerioeus jv-sturdy stretch96

empire's Issues

Add better integration test suite

It would be great if we:

Had some happy day cases using the heroku-go client. Maybe we can do something with JSON schema to verify that responses are heroku api compatible.
Maybe even some integration tests that shell out to the hk command to verify things work (probably only a handful of these).
Possibly some tests that boot up a vagrant cluster and deploys acme-inc and tests that it's running. That might be difficult to run in CI though, so dunno if it's worth it.

Add release description

Deploy <commit>
Rollback to <version>
Update configs vars
etc.

Logging

This is not empire logging, but centralized logging for apps running in the minion cluster. We'll want a way to pipe these logs to other systems like sumologic and librato.

bablefish starts in failed state

-- Logs begin at Fri 2015-03-06 02:02:25 UTC. --
Mar 06 02:30:04 c1 systemd[1]: r101-bablefish.1.web.1.service: control process exited, code=exited status=1
Mar 06 02:30:04 c1 systemd[1]: Unit r101-bablefish.1.web.1.service entered failed state.
Mar 06 02:30:04 c1 systemd[1]: r101-bablefish.1.web.1.service failed.
Mar 06 02:30:04 c1 systemd[1]: r101-bablefish.1.web.1.service holdoff time over, scheduling restart.
Mar 06 02:30:04 c1 systemd[1]: Stopping r101-bablefish.1.web.1...
Mar 06 02:30:04 c1 systemd[1]: Starting r101-bablefish.1.web.1...
Mar 06 02:30:04 c1 systemd[1]: start request repeated too quickly for r101-bablefish.1.web.1.service
Mar 06 02:30:04 c1 systemd[1]: Failed to start r101-bablefish.1.web.1.
Mar 06 02:30:04 c1 systemd[1]: Unit r101-bablefish.1.web.1.service entered failed state.
Mar 06 02:30:04 c1 systemd[1]: r101-bablefish.1.web.1.service failed.

Haven't seen this one before:

start request repeated too quickly for r101-bablefish.1.web.1.service

Move API into it's own package.

The Heroku compatible API should just be a consumer of the empire package, with owns own App, Release, Dyno etc representations of things.

Consider a GitHub Deployments integration

Right now, we handle this with Shipr, but I think there would be a lot of value to having built in handling for github deployment events.

An integration might look something like this:

Creating an app also adds a webhook to the github repo for deployment events, pointed at https://empire.remind.com/deploys/github.
The /deploys/github endpoint would basically look the same as this where we:
1. Resolve the git sha to an image id using the docker registry api.
2. Trigger a Deploy using the DeploysService.

The primary advantage that Shipr provides right now is an abstraction around deployment, log storage from the build, and slack integration for deployment_status events. The deployment_status events could be split out of Shipr (and probably should be) into it's own project, and log storage is not an issue since there's no build output when deploying to empire.

Re-organize into sub directories

We should probably re-organize the root directory into a structure like this:

├── cluster
├── empire
│   ├── cmd
│   │   └── empire
│   ├── Dockerfile
│   └── README.md
├── etcd_peers
│   ├── Dockerfile
│   └── README.md
├── tests
├── README.md
└── Vagrantfile

At the very least, it'll make coming into the project a little less daunting.

packer: One Makefile to rule them all

It'd be nice to have a top level makefile that builds everything you need, using locally cached files, etc.

Create some graphics that detail the architecture

Omnigraffle FTW

`hk dynos` shows a negative age

Empire and Minion registry authentication

Once fsouza/go-dockerclient#228 is merged, we can include a global .dockercfg in our empire and minion hosts, and update empire to read from this file.

Authenticate to empire API via Github

Scaling a process should only look at the jobs table

Scaling a process could just look at the jobs table, copy the Job and change the Instance, then submit.

https://github.com/remind101/empire/blob/master/manager.go#L142

packer: Create packer post-processor for s3 pushes

Rather than having an extra script (awscli) do this, it'd be better if you could do this with a post-processor.

packer: Need a way to store the resulting AMI ID from base build

This is related to GH-14

emp deploy, looks like it deployed even on failure

packer; makefiles should look at Modified Since header of files before downloading

If there is a local file (say empire_base.tar.gz) it should see if that is older than what is in S3 via the Modified-Since header. If it is, then pull down the latest.

Scaling a non-existent dyno doesn't result in an error message

Doesn't do anything, but the hk client prints:

$ emp scale worker=1 -a acme-inc
Scaled acme-inc to worker=1:1X.

Better (zero?) downtime release strategy

When deploying a new release of a web app, we need to drain connections going to the old containers before killing them.

Move JobStates into a JobsService.

https://github.com/remind101/empire/blob/master/manager.go#L194-L234

packer: need better configuration management for images

Right now we use shell scripts, but that's limited and not awesome. Should move to something like salt or puppet.

hk dynos timezone bug

Clearly a UTC to PST issue:

ben@Bens-MacBook-Air:empire (master)
$ hk scale web=3 -a acme-inc
Scaled acme-inc to web=3:1X.
ben@Bens-MacBook-Air:empire (master)
$ hk dynos -a acme-inc
acme-inc.2.web.1    active    8h  "./bin/web"
acme-inc.2.web.2    unknown   8h  "./bin/web"
acme-inc.2.web.3    unknown   8h  "./bin/web"
ben@Bens-MacBook-Air:empire (master)
$ hk dynos -a acme-inc
acme-inc.2.web.1    active   8h  "./bin/web"
acme-inc.2.web.2    active   8h  "./bin/web"
acme-inc.2.web.3    active   8h  "./bin/web"

Add updated_at for hk dynos output

Old release not unscheduled?

Not sure how this happened, but I still have an old release running after 1 minute:

empire $ empire dynos -a acme-inc
acme-inc.2.web.1    active   5m  "./bin/acme-inc -port=$PORT"
acme-inc.3.web.1    active   4m  "./bin/acme-inc -port=$PORT"

Add a `deploys`/`deployments` table to tracking deployments.

Also need to make the /deploys endpoint async, since pulling the slug can take some time. Clients can poll a Deployment for it's status.

minion: read its configuration to update supervisord files

It should be possible for minion to just use the empire packages now instead of reading directly from consul.

Deleting an app doesn't unschedule unit files.

scheduler: port from legion to empire

Updating configs creates a new release with wrong formation

It looks like the new release is created with a default formation instead of the previous release's formation. Also, the old releases jobs aren't cleaned up afaik.

Add a postgres.service for Vagrant

Would be nice to add a postgres.service to cluster/user-data and automatically migrate the database.

Ability to restart processes

Docker container GC

Just something to think about. We'll probably eventually need something to GC old unused containers.

empire_controller AWS image needs to build off base

Because we do some stuff in base (install consul, docker, etc) we need to make sure that the empire_controller AWS image builds off of it. This is a little more difficult because there's no artifact to push up into S3 for the build to work with.

Another option is to have the empire_controller boxes run the base setup script as well, ensuring that things are installed.

Add an `hk deploy` plugin

Would be nice to be able to:

$ hk deploy ejholmes/acme-inc:ec238137...2671c0f9a02

Deploy by tag as well as image id?

Ability to run one off processes

Need this for things like migrations.

Create unit for auto-pulling base images

This is to cut down on startup times of docker images on empire.

router: dynamic healthchecks per app

We need to be able to determine when a backend is failing and not send traffic to that backend.

Make ProcfileExtract extract the procfile from the containers WORKINGDIR

Better metadata support

See GH-120

Right now we hardcode 'role=empire_minion' on all jobs in Empire, but it'd be good if:

a. we could change that (with the flag, like @ejholmes mentions)
b. we could pass along more metadata, for better control over where things get scheduled.

Convert router docker container to use remind101/base as it's base.

Ubuntu has some 'known issues' with their docker images (maybe these have been resolved, I haven't dug into it) so most folks use phusion's baseimage. That's the base of our remind101/base image. It'd be good if the router docker image was based off of this as well.

See https://github.com/phusion/baseimage-docker for info on how to launch a daemon, etc.

You can look at any of the other of our images in github.com/remind101/docker_images for examples of using it.

Support for start and stop specific dyno.

scale endpoint creates duplicate jobs

Deploying a repo that would create an app that already exists results in a 500

Steps to repro:

Create app acme-inc
Deploy quay.io/ejholmes/acme-inc

Because the deploy won't find an app with a matching repo, it will try to create acme-inc but will fail with:

Error: pq: duplicate key value violates unique constraint "apps_pkey"

hk apps creation date is wrong

We should save a created_at if we aren't already and return it in the api

Delete jobs from jobs table when unscheduling

Need to use `ExecStartPre=-` format.

Extract `package scheduler` into a `package container`

package container would be something that could potentially be used by other systems that want to schedule containers onto a cluster. The basic abstraction might look like:

type Image struct {
    Repo string
    ID   string
}

type Limits struct {
    // If provided, represents the maximum amount of bytes to allow the
    // container to use.
    Memory *int
}

type Container struct {
    // The name of the container
    Name string

    // Environment variables to set in the container
    Environment map[string]string

    // The command to run.
    Command string

    // The image to create the container from.
    Image Image

    // Any limits that this container should have.
    Limits Limits

    // Constraints represents constraints about what machine this container
    // is scheduled onto. The semantics of the keys and values depends on
    // the scheduler implementation.
    Constraints map[string]string
}

// ContainerState represents the state of a container in a cluster.
type ContainerState struct {
    *Container

    // The state of the container. "running", "failed", etc.
    State

    // The machine that the container is schedule on.
    Machine string
}

type Scheduler interface {
    // Schedule schedules containers onto the cluster.
    Schedule(...*Container) error

    // Unschedule unschedules containers from the cluster.
    Unschedule(...string) error

    // SetState sets the desired state of a container.
    SetState(string) error

    // ContainerStates returns the state of the containers in the cluster.
    ContainerStates() ([]*ContainerState, error)

    // ContainerState returns a ContainerState for the given container.
    ContainerState(string) (*ContainerState, error)

    // Restart restarts a container.
    Restart(string) error
}

And the goal would be to support fleet and swarm, and hopefully be generic enough to support both docker and rocket.

Security implications of using etcd as empire's persistence layer

We'll want to protect empire's data from unwanted read/write access eventually, meaning etcd might not be the right choice. However, ACL is on the roadmap, so it might work out.

Our etcd persistence code should stay pretty lightweight and be swappable if possible.

Proposal: Use app id instead of repo for deployment

I'm thinking it will be more convenient to reference the app id (assuming app Id will be a name and not a uuid, rename apps.ID to apps.Name?) when we deploy.

// Current
POST /apps { "repo":"remind101/r101-api" }
POST /deploys { "image":{ "id":"0123456789abcdef0123456789abcdef", "repo":"remind101/r101-api" } }

// Proposed
POST /apps { "id":"api", "repo":"remind101/r101-api" }
POST /deploys/api { "image":{ "id":"0123456789abcdef0123456789abcdef" } }