paz-sh / paz Goto Github PK

An open-source, in-house service platform with a PaaS-like workflow, built on Docker, CoreOS, Etcd and Fleet. This repository houses the documentation and installation scripts.

Home Page: http://paz.sh

License: Other

Shell 100.00%

paz's People

Contributors

Stargazers

Watchers

Forkers

bfirsh solualexis alexjpaz viktor-evdokimov justintung lemonhall cloudnuts the1sky noyiba baishuiz jamesblunt josephbolus nishp1 harikt sheltowt pdaniel tomgco rkprater looksgood hcxiong mariotristan codefresh aoj sublimino jtmarmon yonglehou esaul ventaapps jacyzon enzor hyperbolic2346 pierredup rzs840707 gitter-badger pdaniel-frk praneybehl elenw bradbann billfei bri11y wsw2008new dw250100785 yldio hanicker z3cr3t nuaays tmaone mu-l steveefemsc

paz's Issues

Please add me to yldio/pazteam on quay.io - user: csabapalfi

Remove registry credentials from installation script and user-data files

It's no longer needed and is causing problems for those who never had the credentials set up before going public.

See #2

Integration script errors w/ "cannot find command checkRequiredEnvVars"

Looks like an error I introduced in 2cdec8f

start_runlevel.sh doesn't die if etcd is unreachable

This script is used by install-vagrant.sh and integration.sh.

cAdvisor unit reporting as "failed"

...yet it is actually running fine.

Looking at the logs the error appears to be due to a port conflict. Given that 8080 is only used by cAdvisor I suspect that the systemd unit file is misconfigured to demonise it yet continually try to start it again.

Installation script should be able to find credentials with or without "https://" prefix

Make installation script able to find your quay.io credentials with or without protocol prefix.

create-swap.service failed and how I solved

I have used the minimum Digital Ocean Droplet (512MB RAM), and the default userdata always gives me this error.

Solved changing both occurrences of Environment="SWAPFILE=/2GiB.swap" to 1GiB and ExecStart=/usr/bin/fallocate -l 2048m ${SWAPFILE} to 1024m in userdata.

Re-running Paz cluster

Right now there's no way of running paz other than executing the install-vagrant.sh script (that I'm aware of), which destroys the current cluster and creates a new one (which takes it's time).

Running vagrant up on the coreos-vagrant folder is giving me random Connection timeout messages and even if I can start the cluster without any apparent problems, ssh'ing into each machines shows:

Failed Units: 2
  cadvisor.service
  paz-dnsmasq.service

Access web app

I have installed paz on digital ocean, but how to access the web app? Entering the ip direct in the URL (port 80) doesn't work. Tested on stable, beta and alpha versions of CoreOS.

Roadmap

This is a place to discuss the short-to-medium term roadmap. Let's aim to distil it to a list of a handful of items that are doable in a month or two.

To get us started:

CLI
Monitoring with Heapster, InfluxDB and Grafana
Centralised logging
Something to observe what's running like Kubernetes' Replication Controller
Deployment history in the UI somewhere
Use Weave & WeaveDNS to simplify the complex HAProxy plumbing and service discovery
Proper test runner/framework (maybe Gulp)

Etcd Unavailable when bootstrapping with vagrant

I ran into an issue when bootstrapping a vagrant coreos cluster with paz. Specifically, when running the script install-vagrant.sh, etcd was not available in time.

As a hack, I added a 5 second sleep right before launchAndWaitForUnits which "fixed" the issue on my machine.

installation instructions breakage and solution

The instructions say to run the following command:

brew install etcdctl fleetctl

Which promptly causes brew to complain that etcdctl doesn't exist. A little more digging suggested that this library was merged into etcd:

Homebrew/legacy-homebrew#37236

Is this project active?

Hi, I would like to ask if this project is still active since I am evaluating Docker related Paas. Thanks.

paz.sh

Currently the website paz.sh is still an email capture with a few descriptions about paz, will this be open-sourced so we can update the site with more up to date details and documentation? =D

Split up install and integration test scripts into usable pieces

Currently, once you've installed Paz you've no way of reinstalling it or fixing it if it fails without tearing it down and starting again (which takes ages); apart from SSHing in and fixing it manually of course.

Split up the installation script into set up cluster w/ varant & cloud-config, tear down cluster, install/reinstall units, wait for units to start.

Use systemd unit dependencies rather than unitfile runlevel directory structure

Currently it's confusing, and I believe it can all be achieved with systemd.

The original intention was just to prevent things from starting before their dependencies were up.

Move to progrium/registrator instead of announce sidekicks

https://github.com/gliderlabs/registrator

Services don't (always?) automatically restart when they stop

If a host is taken out, it's Fleet's responsibility to reschedule. If a container dies, it's systemd's responsibility. Therefore, investigate systemd unit files for internal Paz services and ensure they are configured to automatically restart when they exit.

Use Fleet machine metadata for "environments"

Let's say you want dev, QA, staging and production clusters. Rather than have multiple clusters of Paz, they could be the same cluster but use Fleet machine metadata to schedule units only on hosts containing units from their environment.

e.g. 4 environments, each a 3-node cluster, you may have the following metadata for them:

Host	Name	Metadata
host1	dev1	environment=dev
host2	dev1	environment=dev
host3	dev1	environment=dev
host4	qa1	environment=qa
host5	qa2	environment=qa
host6	qa3	environment=qa
host7	staging1	environment=staging
host8	staging2	environment=staging
host9	staging3	environment=staging
host10	prod1	environment=prod
host11	prod2	environment=prod
host12	prod3	environment=prod

More can be read about Fleet scheduling with metadata here: https://coreos.com/docs/launching-containers/launching/launching-containers-fleet/#schedule-based-on-machine-metadata

Credit to @rimusz for the idea.

Investigate running Paz's internal services under rkt

Suggestion from @rimusz

It would make images smaller and startup time faster, makes Paz more reliable (knocking out the Docker daemon kills everything, even the management plane) and is a good way to get started with rkt with a view to supporting it in the future for user services.

How to install PAZ in Azure Cloud

I created a coreOS cluster with 3 nodes. I am able to manually run services and use docker builds. Can anyone help me to install paz?
One more question, Is PAZ production ready now?

@lukebond : came to know about PAZ from your london presentation. can you please suggest something?

Bare metal scripts and documentation

I'm interested in trying paz, but I have an existing coreos cluster on bare metal. I assume I just need to wget some unit files to pull down and run paz, but I don't see anything in the documentation about this.

It seems as simple as clone the repo and run

scripts/start-runlevel.sh 1 && scripts/start-runlevel.sh 2

but I would expect some documentation if it were that simple. Is the documentation just missing for bare metal?

Implement centralised logging

We want all logs for all services to be tail-able together and individually, from the command line and also displayable within the UI. Searching would also be good.

From @sublimino in #19:

one-command cluster monitoring (i.e. journal -f on all units, got some fleet jiggerypokery to do this as there are silly TTY complications)

Is https://github.com/gliderlabs/logspout an option?

Implement CLI

Let's discuss the scope and features of the Paz command-line interface functionality.

Naturally, it needs to be called paz.sh :)

Proposed functionality:

Installation:
- Provision a Vagrant/VirtualBox cluster running Paz
- Provision a DigitalOcean cluster running Paz
Register SSH keys
List status of Paz internal units
List declared services
Show status of running services
Add/edit/delete/scale services
Show status of hosts
Administer Paz configuration (e.g. domain/DNS)
Scale the cluster

Implement service watcher that can start/remove units

We need a little service that watches what's running in the cluster and:

...will start new service if:

there are fewer instances of a service running compared to what is declared in the service directory (e.g. one died and didn't restart)

...will stop services if:

instances of a newer version of the service have been deployed and are all healthy, it can kill off the old ones

This is something like the Kubernetes replication controller.

installation

I'm going to do an installation and try to go by the book instead of trying to guess so that the onboarding of new developers can get easier.

Installation doesn't fail if there is no Internet connection

$ ./integration.sh
Starting Paz integration test script

Checking for existing Vagrant cluster

Creating a new Vagrant cluster
Cloning into 'coreos-vagrant'...
fatal: unable to access 'https://github.com/coreos/coreos-vagrant/': Could not resolve host: github.com
../scripts/helpers.sh: line 35: cd: coreos-vagrant: Not a directory
Can't open config.rb.sample: No such file or directory.
A Vagrant environment or target machine is required to run this
command. Run `vagrant init` to create a new Vagrant environment. Or,
get an ID of a target machine from `vagrant global-status` to run
this command on. A final option is to change to a directory with a
Vagrantfile and to try again.
A Vagrant environment or target machine is required to run this
command. Run `vagrant init` to create a new Vagrant environment. Or,
get an ID of a target machine from `vagrant global-status` to run
this command on. A final option is to change to a directory with a
Vagrantfile and to try again.
Waiting for Vagrant cluster to be ready...
CoreOS Vagrant cluster is up
mkdir: unitfiles: File exists
cp: ../unitfiles/*: No such file or directory
mkdir: scripts: File exists
cp: ../scripts/start-runlevel.sh: No such file or directory

...and so on. Should just be a matter of adding -e to the shebang line in integration.sh.

paz-dnsmasq container not removed on failure

If paz-dnsmasq fails to start for whatever reason, or has previously stopped (e.g. reboot), it can't be restarted later because there is no docker rm in the unit file.

Fix incoming...

Exporting/Importing cluster configuration (templating)

I started playing around with paz, and i am finding very useful for some side projects i am working on.
One thing I am missing is the possibility to export the current cluster configuration in an external file , just to be able to reimport it later, instead of recreating everything anew via the UI.
The service should be available both from the UI and via API.
I will give it a try to implement it this weekend , so wish me luck!
Any tip in particular to where should i start from?

Upgrade `lab` across repos for leak fix

The leaks reported in the tests appears to be a lab issue, an update should fix:

hapijs/lab#343

Getting "Deploy failed" when creating any service...

Hi,

Trying to test out paz, but when I try adding any service, prior to adding an app, I just get "Deploy Failed".

I am trying the vagrant cluster. Looks like things installed well, added entries to hosts, and accessed the web panel without issue, but when I try to add the demo-api service from the docs, or the registry container (to make a private registry in the cluter), I get a "deploy failed" error immediately as I click the deploy button...

fleetctl list units looks like everything is up everywhere?

UNIT                    MACHINE             ACTIVE  SUB
$ fleetctl -strict-host-key-checking=false -endpoint=http://172.17.8.101:4001 list-units
paz-orchestrator-announce.service   6bf8fd0d.../172.17.8.101    active  running
paz-orchestrator.service        6bf8fd0d.../172.17.8.101    active  running
paz-scheduler-announce.service      f069df3f.../172.17.8.102    active  running
paz-scheduler.service           f069df3f.../172.17.8.102    active  running
paz-service-directory-announce.service  6bf8fd0d.../172.17.8.101    active  running
paz-service-directory.service       6bf8fd0d.../172.17.8.101    active  running
paz-web-announce.service        51e62345.../172.17.8.103    active  running
paz-web.service             51e62345.../172.17.8.103    active  running

PS: what is the "public facing" setting that is when we choose a service? (I tried both true and false with same results) It would probably make sense to add a line to http://paz.readme.io/v1.0/docs/deploying-your-first-application-using-paz to explain what "public facing" does.

Improve documentation

The following needs improvement:

What is Paz?
Getting started / installation
How to deploy stuff via CI, Docker hub etc.
Technical detail on how it works under the hood

How to contact developers etc. should be added to CONTRIBUTING.md

What is the best / preferred way to contact us, the developers?

IRC?
Gitter?
Email?
Telegram?

Also thoughts on a paz mailing list?

Implement an out-of-the-box monitoring solution

Let's discuss what will become the out-of-the-box monitoring solution for Paz.

Currently we're using cAdvisor from the Kubernetes project. This is a good solution but when used in isolation it is limited because it doesn't do storage and search of historical data.

Heapster is the evolution of this project and builds upon cAdvisor to provide a cluster-aware, searchable cAdvisor (effectively). At first glance it appears to be a good solution.

I'm open to anything else, this is not my area of expertise.

Discuss?

Adding custom entries to haproxy is problematic

Currently the haproxy confd template is generated with run.sh in the haproxy container. This means that the only way to change the template is to eclipse the script with your own to generate a different config file. This works fine, but has the unfortunate side-effect of the template only updating when the container is restarted.

This is just a quality of life type of thing.

nginx proxy on front of paz?

I have a historical nginx setup, which proxies all my servers. What I do is publish into etcd and I have confd watching that and writing out my nginx config file. I do this to keep requests to certain services locked to internal access only and other services are public. I'm thinking that this matches in a way haproxy, but just as a stop-gap until I convert over I was planning on using nginx in front of paz(haproxy).

This seems to work, but I do see some issues. The first is that occasionally the page refresh fails and the second issue is that the services tab just errors. Looking through the network requests I was able to find that I needed to expose paz-web, paz-orchestrator, and paz-orchestrator-socket. I also found that I needed to pass websocket connections with

proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_http_version 1.1;

But I'm not sure where to being to find out why things are still failing.

Also, please advise if it would just be easier to convert my services to haproxy. I'm not against that at all. I am concerned about the availability of haproxy, but I assume I can add in some ip restrictions for the proxy sites?

paz-scheduler-announce.service inactive dead

I have tried on stable, beta and alpha versions of CoreOS in Digital Ocean, and alway the same issue.

Is this normal?

Use a proper test/task runner & better define test strategy and coverage targets

This is a place to discuss all of this. /cc @sublimino

Publish docker files to the docker registry as well as quay.io

Split unit files up into chained steps

I picked up this tip from the Giantswarm guys.

Currently our unit files do docker pull, docker kill, docker run, docker stop etc. all in messy multi-line bash statements in one unit file. If we separate these into separate unit files, eg. one for pulling, one for starting, one for stopping, etc., and chain them together with systemd requires/after directives then it makes neater unit files as well as making it easier to insert steps in between, such as mounting an EBS volume, starting/connecting to a weave network, etc. It will also unroll on stop/kill and allow you to disconnect these things in reverse order.

Move paz repositories to paz-sh Github organisation

":( something went wrong"

Hi guys, I attempted to bring up a vagrant cluster last night and the paz-web.paz dashboard failed with the above message. I know very little about javascript/ember so I'm not sure how much more debugging I can do, I'm not even sure where to start at this point.

fleetctl --version
fleetctl version 0.9.1
etcdctl --version
etcdctl version 2.0.4
vagrant --version
Vagrant 1.7.2

I started it up with:

./scripts/install-vagrant.sh 
Installing Paz on Vagrant

Checking for existing Vagrant cluster

Creating a new Vagrant cluster
Cloning into 'coreos-vagrant'...
remote: Counting objects: 351, done.
remote: Total 351 (delta 0), reused 0 (delta 0), pack-reused 351
Receiving objects: 100% (351/351), 79.37 KiB | 0 bytes/s, done.
Resolving deltas: 100% (152/152), done.
Checking connectivity... done.
==> core-01: Box 'coreos-beta' not installed, can't check for updates.
==> core-02: Box 'coreos-beta' not installed, can't check for updates.
==> core-03: Box 'coreos-beta' not installed, can't check for updates.
Bringing machine 'core-01' up with 'virtualbox' provider...
Bringing machine 'core-02' up with 'virtualbox' provider...
Bringing machine 'core-03' up with 'virtualbox' provider...
==> core-01: Box 'coreos-beta' could not be found. Attempting to find and install...
    core-01: Box Provider: virtualbox
    core-01: Box Version: >= 308.0.1
==> core-01: Loading metadata for box 'http://beta.release.core-os.net/amd64-usr/current/coreos_production_vagrant.json'
    core-01: URL: http://beta.release.core-os.net/amd64-usr/current/coreos_production_vagrant.json
==> core-01: Adding box 'coreos-beta' (v607.0.0) for provider: virtualbox
    core-01: Downloading: http://beta.release.core-os.net/amd64-usr/607.0.0/coreos_production_vagrant.box
    core-01: Calculating and comparing box checksum...
==> core-01: Successfully added box 'coreos-beta' (v607.0.0) for 'virtualbox'!
==> core-01: Importing base box 'coreos-beta'...
==> core-01: Matching MAC address for NAT networking...
==> core-01: Checking if box 'coreos-beta' is up to date...
==> core-01: Setting the name of the VM: coreos-vagrant_core-01_1425861758586_22045
==> core-01: Clearing any previously set network interfaces...
==> core-01: Preparing network interfaces based on configuration...
    core-01: Adapter 1: nat
    core-01: Adapter 2: hostonly
==> core-01: Forwarding ports...
    core-01: 22 => 2222 (adapter 1)
==> core-01: Running 'pre-boot' VM customizations...
==> core-01: Booting VM...
==> core-01: Waiting for machine to boot. This may take a few minutes...
    core-01: SSH address: 127.0.0.1:2222
    core-01: SSH username: core
    core-01: SSH auth method: private key
    core-01: Warning: Connection timeout. Retrying...
==> core-01: Machine booted and ready!
==> core-01: Setting hostname...
==> core-01: Configuring and enabling network interfaces...
==> core-01: Running provisioner: file...
==> core-01: Running provisioner: shell...
    core-01: Running: inline script
==> core-02: Box 'coreos-beta' could not be found. Attempting to find and install...
    core-02: Box Provider: virtualbox
    core-02: Box Version: >= 308.0.1
==> core-02: Loading metadata for box 'http://beta.release.core-os.net/amd64-usr/current/coreos_production_vagrant.json'
    core-02: URL: http://beta.release.core-os.net/amd64-usr/current/coreos_production_vagrant.json
==> core-02: Adding box 'coreos-beta' (v607.0.0) for provider: virtualbox
==> core-02: Importing base box 'coreos-beta'...
==> core-02: Matching MAC address for NAT networking...
==> core-02: Checking if box 'coreos-beta' is up to date...
==> core-02: Setting the name of the VM: coreos-vagrant_core-02_1425861790309_80904
==> core-02: Fixed port collision for 22 => 2222. Now on port 2200.
==> core-02: Clearing any previously set network interfaces...
==> core-02: Preparing network interfaces based on configuration...
    core-02: Adapter 1: nat
    core-02: Adapter 2: hostonly
==> core-02: Forwarding ports...
    core-02: 22 => 2200 (adapter 1)
==> core-02: Running 'pre-boot' VM customizations...
==> core-02: Booting VM...
==> core-02: Waiting for machine to boot. This may take a few minutes...
    core-02: SSH address: 127.0.0.1:2200
    core-02: SSH username: core
    core-02: SSH auth method: private key
    core-02: Warning: Connection timeout. Retrying...
==> core-02: Machine booted and ready!
==> core-02: Setting hostname...
==> core-02: Configuring and enabling network interfaces...
==> core-02: Running provisioner: file...
==> core-02: Running provisioner: shell...
    core-02: Running: inline script
==> core-03: Box 'coreos-beta' could not be found. Attempting to find and install...
    core-03: Box Provider: virtualbox
    core-03: Box Version: >= 308.0.1
==> core-03: Loading metadata for box 'http://beta.release.core-os.net/amd64-usr/current/coreos_production_vagrant.json'
    core-03: URL: http://beta.release.core-os.net/amd64-usr/current/coreos_production_vagrant.json
==> core-03: Adding box 'coreos-beta' (v607.0.0) for provider: virtualbox
==> core-03: Importing base box 'coreos-beta'...
==> core-03: Matching MAC address for NAT networking...
==> core-03: Checking if box 'coreos-beta' is up to date...
==> core-03: Setting the name of the VM: coreos-vagrant_core-03_1425861823848_20722
==> core-03: Fixed port collision for 22 => 2222. Now on port 2201.
==> core-03: Clearing any previously set network interfaces...
==> core-03: Preparing network interfaces based on configuration...
    core-03: Adapter 1: nat
    core-03: Adapter 2: hostonly
==> core-03: Forwarding ports...
    core-03: 22 => 2201 (adapter 1)
==> core-03: Running 'pre-boot' VM customizations...
==> core-03: Booting VM...
==> core-03: Waiting for machine to boot. This may take a few minutes...
    core-03: SSH address: 127.0.0.1:2201
    core-03: SSH username: core
    core-03: SSH auth method: private key
    core-03: Warning: Connection timeout. Retrying...
==> core-03: Machine booted and ready!
==> core-03: Setting hostname...
==> core-03: Configuring and enabling network interfaces...
==> core-03: Running provisioner: file...
==> core-03: Running provisioner: shell...
    core-03: Running: inline script
Waiting for Vagrant cluster to be ready...
CoreOS Vagrant cluster is up

Configuring SSH
Identity added: /home/thecatwasnot/.vagrant.d/insecure_private_key (/home/thecatwasnot/.vagrant.d/insecure_private_key)

Starting paz runlevel 1 units
Unit paz-scheduler.service launched on 7641f8b0.../172.17.8.101
Unit paz-orchestrator.service launched on 53f5997f.../172.17.8.102
Unit paz-service-directory-announce.service launched on b9bc6257.../172.17.8.103
Unit paz-service-directory.service launched on b9bc6257.../172.17.8.103
Unit paz-scheduler-announce.service launched on 7641f8b0.../172.17.8.101
Unit paz-orchestrator-announce.service launched on 53f5997f.../172.17.8.102
Successfully started all runlevel 1 paz units on the cluster with Fleet
Waiting for runlevel 1 services to be activated...
Activating: 0 | Active: 6 | Failed: 0.  
All runlevel 1 units successfully activated!

Waiting for orchestrator, scheduler and service directory to be announced

Starting paz runlevel 2 units
Unit paz-web.service launched on 53f5997f.../172.17.8.102
Unit paz-web-announce.service launched on 53f5997f.../172.17.8.102
Successfully started all runlevel 2 paz units on the cluster with Fleet
Waiting for runlevel 2 services to be activated...
Activating: 0 | Active: 8 | Failed: 0...
All runlevel 2 units successfully activated!

You will need to add the following entries to your /etc/hosts:
172.17.8.101 paz-web.paz
172.17.8.101 paz-scheduler.paz
172.17.8.101 paz-orchestrator.paz
172.17.8.101 paz-orchestrator-socket.paz

Paz installation successful

I did edit /etc/hosts
fleet reports everything OK:

vagrant ssh core-01
CoreOS beta (607.0.0)
Update Strategy: No Reboots
core@core-01 ~ $ fleetctl list-units
UNIT                    MACHINE             ACTIVE  SUB
paz-orchestrator-announce.service   53f5997f.../172.17.8.102    active  running
paz-orchestrator.service        53f5997f.../172.17.8.102    active  running
paz-scheduler-announce.service      7641f8b0.../172.17.8.101    active  running
paz-scheduler.service           7641f8b0.../172.17.8.101    active  running
paz-service-directory-announce.service  b9bc6257.../172.17.8.103    active  running
paz-service-directory.service       b9bc6257.../172.17.8.103    active  running
paz-web-announce.service        53f5997f.../172.17.8.102    active  running
paz-web.service             53f5997f.../172.17.8.102    active  running

This morning I tried running the integration test:

./integration.sh 
Starting Paz integration test script
./integration.sh: line 18: checkRequiredEnvVars: command not found

Checking for existing Vagrant cluster

Creating a new Vagrant cluster
Cloning into 'coreos-vagrant'...
remote: Counting objects: 351, done.
remote: Total 351 (delta 0), reused 0 (delta 0), pack-reused 351
Receiving objects: 100% (351/351), 79.37 KiB | 0 bytes/s, done.
Resolving deltas: 100% (152/152), done.
Checking connectivity... done.
==> core-01: Checking for updates to 'coreos-beta'
    core-01: Latest installed version: 607.0.0
    core-01: Version constraints: >= 308.0.1
    core-01: Provider: virtualbox
==> core-01: Box 'coreos-beta' (v607.0.0) is running the latest version.
==> core-02: Checking for updates to 'coreos-beta'
    core-02: Latest installed version: 607.0.0
    core-02: Version constraints: >= 308.0.1
    core-02: Provider: virtualbox
==> core-02: Box 'coreos-beta' (v607.0.0) is running the latest version.
==> core-03: Checking for updates to 'coreos-beta'
    core-03: Latest installed version: 607.0.0
    core-03: Version constraints: >= 308.0.1
    core-03: Provider: virtualbox
==> core-03: Box 'coreos-beta' (v607.0.0) is running the latest version.
Bringing machine 'core-01' up with 'virtualbox' provider...
Bringing machine 'core-02' up with 'virtualbox' provider...
Bringing machine 'core-03' up with 'virtualbox' provider...
==> core-01: Importing base box 'coreos-beta'...
==> core-01: Matching MAC address for NAT networking...
==> core-01: Checking if box 'coreos-beta' is up to date...
==> core-01: Setting the name of the VM: coreos-vagrant_core-01_1425905935661_73514
==> core-01: Clearing any previously set network interfaces...
==> core-01: Preparing network interfaces based on configuration...
    core-01: Adapter 1: nat
    core-01: Adapter 2: hostonly
==> core-01: Forwarding ports...
    core-01: 22 => 2222 (adapter 1)
==> core-01: Running 'pre-boot' VM customizations...
==> core-01: Booting VM...
==> core-01: Waiting for machine to boot. This may take a few minutes...
    core-01: SSH address: 127.0.0.1:2222
    core-01: SSH username: core
    core-01: SSH auth method: private key
    core-01: Warning: Connection timeout. Retrying...
==> core-01: Machine booted and ready!
==> core-01: Setting hostname...
==> core-01: Configuring and enabling network interfaces...
==> core-01: Running provisioner: file...
==> core-01: Running provisioner: shell...
    core-01: Running: inline script
==> core-02: Importing base box 'coreos-beta'...
==> core-02: Matching MAC address for NAT networking...
==> core-02: Checking if box 'coreos-beta' is up to date...
==> core-02: Setting the name of the VM: coreos-vagrant_core-02_1425905966683_94058
==> core-02: Fixed port collision for 22 => 2222. Now on port 2200.
==> core-02: Clearing any previously set network interfaces...
==> core-02: Preparing network interfaces based on configuration...
    core-02: Adapter 1: nat
    core-02: Adapter 2: hostonly
==> core-02: Forwarding ports...
    core-02: 22 => 2200 (adapter 1)
==> core-02: Running 'pre-boot' VM customizations...
==> core-02: Booting VM...
==> core-02: Waiting for machine to boot. This may take a few minutes...
    core-02: SSH address: 127.0.0.1:2200
    core-02: SSH username: core
    core-02: SSH auth method: private key
    core-02: Warning: Connection timeout. Retrying...
==> core-02: Machine booted and ready!
==> core-02: Setting hostname...
==> core-02: Configuring and enabling network interfaces...
==> core-02: Running provisioner: file...
==> core-02: Running provisioner: shell...
    core-02: Running: inline script
==> core-03: Importing base box 'coreos-beta'...
==> core-03: Matching MAC address for NAT networking...
==> core-03: Checking if box 'coreos-beta' is up to date...
==> core-03: Setting the name of the VM: coreos-vagrant_core-03_1425905998600_89301
==> core-03: Fixed port collision for 22 => 2222. Now on port 2201.
==> core-03: Clearing any previously set network interfaces...
==> core-03: Preparing network interfaces based on configuration...
    core-03: Adapter 1: nat
    core-03: Adapter 2: hostonly
==> core-03: Forwarding ports...
    core-03: 22 => 2201 (adapter 1)
==> core-03: Running 'pre-boot' VM customizations...
==> core-03: Booting VM...
==> core-03: Waiting for machine to boot. This may take a few minutes...
    core-03: SSH address: 127.0.0.1:2201
    core-03: SSH username: core
    core-03: SSH auth method: private key
    core-03: Warning: Connection timeout. Retrying...
==> core-03: Machine booted and ready!
==> core-03: Setting hostname...
==> core-03: Configuring and enabling network interfaces...
==> core-03: Running provisioner: file...
==> core-03: Running provisioner: shell...
    core-03: Running: inline script
Waiting for Vagrant cluster to be ready...
CoreOS Vagrant cluster is up

Configuring SSH
Identity added: /home/thecatwasnot/.vagrant.d/insecure_private_key (/home/thecatwasnot/.vagrant.d/insecure_private_key)

Starting paz runlevel 1 units
Unit paz-scheduler.service launched on 14dbc022.../172.17.8.101
Unit paz-scheduler-announce.service launched on 14dbc022.../172.17.8.101
Unit paz-orchestrator.service launched on 4f6c57a6.../172.17.8.103
Unit paz-orchestrator-announce.service launched on 4f6c57a6.../172.17.8.103
Unit paz-service-directory.service launched on 2c75bccd.../172.17.8.102
Unit paz-service-directory-announce.service launched on 2c75bccd.../172.17.8.102
Successfully started all runlevel 1 paz units on the cluster with Fleet
Waiting for runlevel 1 services to be activated...
Activating: 0 | Active: 6 | Failed: 0.. 
All runlevel 1 units successfully activated!

Waiting for orchestrator, scheduler and service directory to be announced

Starting paz runlevel 2 units
Unit paz-web.service launched
Unit paz-web-announce.service launched on 14dbc022.../172.17.8.101
Successfully started all runlevel 2 paz units on the cluster with Fleet
Waiting for runlevel 2 services to be activated...
Activating: 1 | Active: 8 | Failed: 0...
All runlevel 2 units successfully activated!

You will need to add the following entries to your /etc/hosts:
172.17.8.101 paz-web.paz
172.17.8.101 paz-scheduler.paz
172.17.8.101 paz-orchestrator.paz
172.17.8.101 paz-orchestrator-socket.paz

Adding service to directory
{"doc":{"name":"demo-api","description":"Very simple HTTP Hello World server","dockerRepository":"lukebond/demo-api","config":{"publicFacing":false,"numInstances":3,"ports":[],"env":{}}}}
Deploying new service with the /hooks/deploy endpoint
{"statusCode":200}
Waiting for service to announce itself

Which hung for hours (was still waiting when I returned 8 hours later)
I've now also tried changing my version of etcdctl to match the one on coreos and no joy.

Investigate simplifying, changing or replacing the Etcd/HAProxy service discovery layer

It's currently a bit brittle and difficult to understand / remember how it works.

Some options:

Weave
Kubernetes
Keep what we have

For me Weave would be really helpful, but there are multiple ways we could use it and we should have a discussion with the Weave developers about this. Some options:

Use Weave to give each container a unique IP address and dispense with random Docker ports, simplifying the existing HAProxy "magic" we have
Use WeaveDNS as the service discovery mechanism and ditch HAProxy/Confd altogether. This would greatly simplify our stack, leaving the hard networking issues to the experts, at the cost of losing the ability to leverage HAProxy's zero-downtime-deployment features (which we're not really utilising yet, until #32 is done)

Slow announcer for orchestrator

$ fleetctl -strict-host-key-checking=false -endpoint=http://172.17.8.101:4001 journal paz-orchestrator-announce.service

Feb 26 17:38:59 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:00 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:01 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:02 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:03 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:04 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:05 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:06 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:07 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory
Feb 26 17:39:08 core-01 sh[1195]: grep: HostIp:0.0.0.0: No such file or directory

However this timed out and then:

Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service start-pre operation timed out. Terminating.
Feb 26 17:41:31 core-01 systemd[1]: Failed to start paz-orchestrator announce.
Feb 26 17:41:31 core-01 systemd[1]: Unit paz-orchestrator-announce.service entered failed state.
Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service failed.
Feb 26 17:41:31 core-01 systemd[1]: paz-orchestrator-announce.service holdoff time over, scheduling restart.
Feb 26 17:41:31 core-01 systemd[1]: Stopping paz-orchestrator announce...
Feb 26 17:41:31 core-01 systemd[1]: Starting paz-orchestrator announce...
Feb 26 17:41:31 core-01 sh[26833]: Waiting for 49153/tcp...
Feb 26 17:41:31 core-01 systemd[1]: Started paz-orchestrator announce.
Feb 26 17:41:31 core-01 sh[26857]: Connected to 172.17.8.101:49153/tcp and 172.17.8.101:49154, publishing to etcd..

However the service seemed to be up and running:

$ fleetctl -strict-host-key-checking=false -endpoint=http://172.17.8.101:4001 journal paz-orchestrator.service

-- Logs begin at Thu 2015-02-26 16:40:27 UTC, end at Thu 2015-02-26 17:44:04 UTC. --
Feb 26 17:05:35 core-01 systemd[1]: Started paz-orchestrator: Main API for all paz services and monitor of services in etcd..
Feb 26 17:05:37 core-01 bash[16821]: {}
Feb 26 17:05:37 core-01 bash[16821]: { disabled: 'true',
Feb 26 17:05:37 core-01 bash[16821]: provider: 'dnsimple',
Feb 26 17:05:37 core-01 bash[16821]: email: '[email protected]',
Feb 26 17:05:37 core-01 bash[16821]: apiKey: '312487532487',
Feb 26 17:05:37 core-01 bash[16821]: domain: 'paz' }
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"msg":"Starting server","time":"2015-02-26T17:05:37.887Z","src":{"file":"/usr/src/app/server.js","line":205},"v":0}
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"msg":"paz-orchestrator is now running on port 9000","time":"2015-02-26T17:05:37.921Z","src":{"file":"/usr/src/app/server.js","line":194},"v":0}
Feb 26 17:05:37 core-01 bash[16821]: {"name":"paz-orchestrator_log","hostname":"1add37e0c392","pid":9,"level":30,"svcdir-url":"http://paz-service-directory.paz","msg":"","time":"2015-02-26T17:05:37.921Z","src":{"file":"/usr/src/app/server.js","line":195},"v":0}