nchammas / flintrock Goto Github PK

View Code? Open in Web Editor NEW

631.0 33.0 114.0 778 KB

A command-line tool for launching Apache Spark clusters.

License: Apache License 2.0

Shell 1.90% Python 95.32% HCL 2.78%

apache-spark ec2 apache-spark-cluster orchestration spark-ec2

flintrock's Introduction

Flintrock is a command-line tool for launching Apache Spark clusters.

Flintrock around the web

Flintrock has been featured in a few talks, guides, and papers around the web.

Talks:
- Flintrock: A faster, better spark-ec2 (slides)
Guides:
- Running Spark on a Cluster: The Basics (using Flintrock)
  - Part 1: Start a Spark Cluster and Use the spark-shell
  - Part 2: Dependencies, S3, and Deploying via spark-submit
- Spark with Jupyter on AWS
- Building a data science platform for R&D, part 2 – Deploying Spark on AWS using Flintrock
- AWS EC2를 활용 스파크 클러스터 생성
Papers:
- "Birds in the Clouds": Adventures in Data Engineering

Usage

Here's a quick way to launch a cluster on EC2, assuming you already have an AWS account set up. Flintrock works best with Amazon Linux. You can get the latest AMI IDs from here.

flintrock launch test-cluster \
    --num-slaves 1 \
    --spark-version 3.5.0 \
    --ec2-key-name key_name \
    --ec2-identity-file /path/to/key.pem \
    --ec2-ami ami-0588935a949f9ff17 \
    --ec2-user ec2-user

If you persist these options to a file, you'll be able to do the same thing much more concisely:

flintrock configure
# Save your preferences via the opened editor, then...
flintrock launch test-cluster

Once you're done using a cluster, don't forget to destroy it with:

flintrock destroy test-cluster

Other things you can do with Flintrock include:

flintrock login test-cluster
flintrock describe test-cluster
flintrock add-slaves test-cluster --num-slaves 2
flintrock remove-slaves test-cluster --num-slaves 1
flintrock run-command test-cluster 'sudo yum install -y package'
flintrock copy-file test-cluster /local/path /remote/path

To see what else Flintrock can do, or to see detailed help for a specific command, try:

flintrock --help
flintrock <subcommand> --help

That's not all. Flintrock has a few more features that you may find interesting.

Accessing data on S3

We recommend you access data on S3 from your Flintrock cluster by following these steps:

Setup an IAM Role that grants access to S3 as desired. Reference this role when you launch your cluster using the --ec2-instance-profile-name option (or its equivalent in your config.yaml file).
Reference S3 paths in your Spark code using the s3a:// prefix. s3a:// is backwards compatible with s3n:// and replaces both s3n:// and s3://. The Hadoop project recommends using s3a:// since it is actively developed, supports larger files, and offers better performance.
Make sure Flintrock is configured to use Hadoop/HDFS 2.7+. Earlier versions of Hadoop do not have solid implementations of s3a://. Flintrock's default is Hadoop 3.3.6, so you don't need to do anything here if you're using a vanilla configuration.
Call Spark with the hadoop-aws package to enable s3a://. For example:
```
spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.6 my-app.py
pyspark --packages org.apache.hadoop:hadoop-aws:3.3.6
```
If you have issues using the package, consult the hadoop-aws troubleshooting guide and try adjusting the version. As a rule of thumb, you should match the version of hadoop-aws to the version of Hadoop that Spark was built against (which is typically Hadoop 3.2 or 2.7), even if the version of Hadoop that you're deploying to your Flintrock cluster is different.

With this approach you don't need to copy around your AWS credentials or pass them into your Spark programs. As long as the assigned IAM role allows it, Spark will be able to read and write data to S3 simply by referencing the appropriate path (e.g. s3a://bucket/path/to/file).

Installation

Before using Flintrock, take a quick look at the copyright notice and license and make sure you're OK with their terms.

Flintrock requires Python 3.8 or newer, unless you are using one of our standalone packages. Flintrock has been thoroughly tested only on OS X, but it should run on all POSIX systems. A motivated contributor should be able to add Windows support without too much trouble, too.

Release version

To get the latest release of Flintrock, simply install it with pip.

Since Flintrock is a command-line application rather than a library, you may prefer to install it using pipx, which automatically takes care of installing Flintrock to an isolated virtual environment for you.

pipx install flintrock

This will install Flintrock and place it on your path. You should be good to go now!

You'll probably want to get started with the following two commands:

flintrock --help
flintrock configure

Standalone version (Python not required!)

We used to publish standalone versions of Flintrock that don't require you to have Python installed on your machine. Since Flintrock 2.1.0, we have stopped publishing these standalone builds.

If you used these standalone packages, please chime in on this issue and share a bit about your environment and use case.

Community-supported distributions

Flintrock is also available via the following package managers:

Homebrew: brew install flintrock

These packages are not supported by the core contributors and may be out of date. Please reach out to the relevant communities directly if you have trouble using these distributions to install Flintrock. You can always find the latest release of Flintrock on GitHub and on PyPI.

Development version

If you like living on the edge, install the development version of Flintrock:

pipx install git+https://github.com/nchammas/flintrock

If you want to contribute, follow the instructions in our contributing guide on how to install Flintrock.

Use Cases

Experimentation

If you want to play around with Spark, develop a prototype application, run a one-off job, or otherwise just experiment, Flintrock is the fastest way to get you a working Spark cluster.

Performance testing

Flintrock exposes many options of its underlying providers (e.g. EBS-optimized volumes on EC2) which makes it easy to create a cluster with predictable performance for Spark performance testing.

Automated pipelines

Most people will use Flintrock interactively from the command line, but Flintrock is also designed to be used as part of an automated pipeline. Flintrock's exit codes are carefully chosen; it offers options to disable interactive prompts; and when appropriate it prints output in YAML, which is both human- and machine-friendly.

Anti-Use Cases

There are some things that Flintrock specifically does not support.

Managing permanent infrastructure

Flintrock is not for managing long-lived clusters, or any infrastructure that serves as a permanent part of some environment.

For starters, Flintrock provides no guarantee that clusters launched with one version of Flintrock can be managed by another version of Flintrock, and no considerations are made for any long-term use cases.

If you are looking for ways to manage permanent infrastructure, look at tools like Terraform, Ansible, or Ubuntu Juju. You might also find a service like Databricks useful if you're looking for someone else to host and manage Spark for you. Amazon also offers Spark on EMR.

Launching non-Spark-related services

Flintrock is meant for launching Spark clusters that include closely related services like HDFS.

Flintrock is not for launching external datasources (e.g. Cassandra), or other services that are not closely integrated with Spark (e.g. Tez).

If you are looking for an easy way to launch other services from the Hadoop ecosystem, look at the Apache Bigtop project.

Launching out-of-date services

Flintrock will always take advantage of new features of Spark and related services to make the process of launching a cluster faster, simpler, and easier to maintain. If that means dropping support for launching older versions of a service, then we will generally make that tradeoff.

Features

Polished CLI

Flintrock has a clean command-line interface.

flintrock --help
flintrock describe
flintrock destroy --help
flintrock launch test-cluster --num-slaves 10

Configurable CLI Defaults

Flintrock lets you persist your desired configuration to a YAML file so that you don't have to keep typing out the same options over and over at the command line.

To setup and edit the default config file, run this:

flintrock configure

You can also point Flintrock to a non-default config file by using the --config option.

Sample `config.yaml`

provider: ec2

services:
  spark:
    version: 3.5.0

launch:
  num-slaves: 1

providers:
  ec2:
    key-name: key_name
    identity-file: /path/to/.ssh/key.pem
    instance-type: m5.large
    region: us-east-1
    ami: ami-0588935a949f9ff17
    user: ec2-user

With a config file like that, you can now launch a cluster with just this:

flintrock launch test-cluster

And if you want, you can even override individual options in your config file at the command line:

flintrock launch test-cluster \
    --num-slaves 10 \
    --ec2-instance-type r5.xlarge

Fast Launches

Flintrock is really fast. It can launch a 100-node cluster in about three minutes (give or take a few seconds due to AWS's normal performance variability).

Advanced Storage Setup

Flintrock automatically configures any available ephemeral storage on the cluster and makes it available to installed services like HDFS and Spark. This storage is fast and is perfect for use as a temporary store by those services.

Tests

Flintrock comes with a set of automated, end-to-end tests. These tests help us develop Flintrock with confidence and guarantee a certain level of quality.

Low-level Provider Options

Flintrock exposes low-level provider options (e.g. instance-initiated shutdown behavior) so you can control the details of how your cluster is setup if you want.

No Custom Machine Image Dependencies

Flintrock is built and tested against vanilla Amazon Linux and CentOS. You can easily launch Flintrock clusters using your own custom machine images built from either of those distributions.

Anti-Features

Support for out-of-date versions of Python, EC2 APIs, etc.

Supporting multiple versions of anything is tough. There's more surface area to cover for testing, and over the long term the maintenance burden of supporting something non-current with bug fixes and workarounds really adds up.

There are projects that support stuff across a wide cut of language or API versions. For example, Spark supports multiple versions of Java, Scala, R, and Python. The people behind these projects are gods. They take on an immense maintenance burden for the benefit and convenience of their users.

We here at project Flintrock are much more modest in our abilities. We are best able to serve the project over the long term when we limit ourselves to supporting a small but widely applicable set of configurations.

Motivation

Note: The explanation here is provided from the perspective of Flintrock's original author, Nicholas Chammas.

I got started with Spark by using spark-ec2. It's one of the biggest reasons I found Spark so accessible. I didn't need to spend time upfront working through some setup guide before I could work on a "real" problem. Instead, with a simple spark-ec2 command I was able to launch a large, working cluster and get straight to business.

As I became a heavy user of spark-ec2, several limitations stood out and became an increasing pain. They provided me with the motivation for this project.

Among those limitations, the most frustrating ones were:

Slow launches: spark-ec2 cluster launch times increase linearly with the number of slaves being created. For example, it takes spark-ec2 over an hour to launch a cluster with 100 slaves. (SPARK-4325, SPARK-5189)
No support for configuration files: spark-ec2 does not support reading options from a config file, so users are always forced to type them in at the command line. (SPARK-925)
Un-resizable clusters: Adding or removing slaves from an existing spark-ec2 cluster is not possible. (SPARK-2008)
Custom machine images: spark-ec2 uses custom machine images, making it difficult for users to bring their own image. And since the process of updating those machine images is not automated, they have not been updated in years. (SPARK-3821)

I built Flintrock to address all of these shortcomings, which it does.

Why build Flintrock when we have EMR?

I started work on Flintrock months before EMR added support for Spark. It's likely that, had I considered building Flintrock a year later than I did, I would have decided against it.

Now that Flintrock exists, many users appreciate the lower cost of running Flintrock clusters as compared to EMR, as well as Flintrock's simpler interface. And for my part, I enjoy working on Flintrock in my free time.

Why didn't you build Flintrock on top of an orchestration tool?

People have asked me whether I considered building Flintrock on top of Ansible, Terraform, Docker, or something else. I looked into some of these things back when Flintrock was just an idea in my head and decided against using any of them for two basic reasons:

Fun: I didn't have any experience with these tools, and it looked both simple enough and more fun to build something "from scratch".
Focus: I wanted a single-purpose tool with a very limited focus, not a module or set of scripts that were part of a sprawling framework that did a lot of different things.

These are not necessarily the right reasons to build "from scratch", but they were my reasons. If you are already comfortable with any of the popular orchestration tools out there, you may find it more attractive to use them rather than add a new standalone tool to your toolchain.

About the Flintrock Logo

The Flintrock logo was created using Highbrow Cafetorium JNL and this icon. Licenses to use both the font and icon were purchased from their respective owners.

flintrock's People

Contributors

Stargazers

Watchers

Forkers

broxtronix ericmjonas gitter-badger engrean agilemobiledev neilkdev waterytowers marcuscollins ldolberg adotmob revpoint pletelli asafcombo silky aimran sgvandijk serialx jperezdiaz pwangjing hahnicity devsisters ereed-tesla vaishaal arokem jsensarma scotthb pragnesh guilherme-pombo chaordic eshioji lwoloszy tomerk umit lfreeman seufagner akaltsikis park-seungho tahirh ncherel pilgrimkst himansh1306 perados yuj sasikanth-lanka novelari jeffols jinwg josesaribeiro mizie tdsmith heathermiller capdevc rlaabs gdtm86 jungrae-prestolabs dircsem boechat107 sarunar milandhore mmmika lambdaml jitkasempin audreyaltman rahulmod keerath ilovejs joyoyoyoyoyo liangyali pranoot mozachiu tvision-insights dirtscraper rluta devraj21 sfcoy jcabraham kamulau prashantsinghsavvy mdgreenwald bwarminski cloudcomputingcourse iskibinskahfa ralfca pombredanne matthewfranglen thekaar cloud-service-implementation-and-sdk mukaiguy sanjoy-bose malcolmgreaves jreissup veepionyc apronoob88 thecodemasterk amitbansal26 chadac antoinerondelet luhhujbb stjordanis parkopedia

flintrock's Issues

Support launching clusters into private VPCs

Some users work in environments where they want to or have to launch clusters into VPCs with no public subnets.

This seems to be a fairly common use-case, so I think we should support it if it does not add too much complexity.

A design goal for this feature should be to automatically do the appropriate setup whether the subnet we are launching into is public or private. That means, if possible, the user shouldn't have to specify anything.

I think this is possible because we know what VPC we're launching into (either the user's default VPC, or an explicitly specified one), and we know what subnet we're launching into (either the VPC's default subnet, or an explicitly specified one). Flintrock should be able to query AWS for information about the subnet and figure out whether to use public or private addresses automatically.

Raise exceptions instead of sys.exiting in various places

I believe the only place we should explicitly be calling sys.exit() is in our "main" block.

Methods should just raise exceptions when they have issues, and exception-handling code blocks can do their thing before re-raising the exception. A premature sys.exit() would potentially interrupt some exception-handling code higher up in the stack.

Once we know there is no more exception handling possible -- which is in our "main" block -- we can safely exit with an appropriate exit code.

More intuitive error messages for configuration file problems

I'm just returning to using flintrock, and I recloned and edited a new config.yaml, and recived several errors along the way.

Identity file check

the identity file field does not recognize the unix homedir convention of ~/.ssh/foo.bar.baz but will still launch the cluster, and the result can be confusing. Can we explicitly test for the existence of the file?

region / ami mismatch

by default you might want to change a region, but it's not obvious you have to also change your ami and when you run you receive this message:

boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidAMIID.NotFound</Code><Message>The image id '[ami-61bbf104]' does not exist</Message></Error></Errors><RequestID>f27df620-5903-4403-b1e1-77309b265e86</RequestID></Response>

This is a little bit more helpful, but still a bit confusing.

username / ami mismatch

Finally, if you use the wrong user with your ami everything works until flintrock tries to ssh in and you get this confusing paramiko error:

Exception: Error reading SSH protocol banner[Errno 54] Connection reset by peer
Traceback (most recent call last):
  File "/Users/jonas/anaconda/envs/py3k/lib/python3.5/site-packages/paramiko/transport.py", line 1724, in _check_banner
    buf = self.packetizer.readline(timeout)
  File "/Users/jonas/anaconda/envs/py3k/lib/python3.5/site-packages/paramiko/packet.py", line 326, in readline
    buf += self._read_timeout(timeout)
  File "/Users/jonas/anaconda/envs/py3k/lib/python3.5/site-packages/paramiko/packet.py", line 479, in _read_timeout
    x = self.__socket.recv(128)
ConnectionResetError: [Errno 54] Connection reset by peer

Enforce CLI option dependencies in a user-friendly way

If you are launching clusters into EC2 (i.e. when --provider is set to ec2), then --ec2-region and --ec2-ami are required.

At the moment we don't enforce this dependency directly in the CLI, partly because Click doesn't directly support option dependencies like this. So if you forget to specify --ec2-ami, it gets caught somewhere in the bowels of Flintrock as opposed to at the front gate, and some ugly error gets barfed out.

This is a user experience problem, and it will get worse when we add support for other providers like GCE since there will be more opportunities for users to forget to specify required options (hah!).

We should fix this.

Flesh out Flintrock library interface

I love flintrock and here in the @amplab we're evaluating it as a replacement for the standard spark ec2 scripts. Are there any plans to make it available as a library? I'd love to be able to

import flintrock
fc = flintrock.clutser("fooconfig.yaml")
fc.get_master()
fc.get_slaves()

etc for integration into fabric (http://www.fabfile.org/ ) scripts, which we're starting to use extensively to automate devops things.

Properly reconfigure Spark on cluster restart

If you stop and then start a Flintrock cluster you need to do a little work to make sure that Spark comes back up correctly.

During cluster launch Spark is installed via class methods. These methods need to be refined a bit to cover this case (and eventually, related cases like adding or removing nodes).

Allow users to specify instance type (e.g. m3.xlarge) and region (e.g. us-east-1) without having to specify an AMI

It's cumbersome to have to look up the right AMI to use when you don't really care about it.

Many users will just want something that works, and they will care more about the instance type, which defines how much power they get out of the cluster, and the region, which affects their latency and which they may have restricted choices on (e.g. due to regulations, company policy, etc.).

It would be good to follow spark-ec2's example here and make it so that an appropriate AMI is automatically chosen by Flintrock given an instance type and region.

We can do this by having a separate data file (probably in YAML) with mappings of AWS region to AMI. We could probably also get away with just doing this for HVM AMIs, since Amazon seems to be positioning them as the future.

Parallelize tests

This is a follow-up to #58, where we found that we need some work done to xdist to make it usable for running Flintrock's acceptance tests in parallel.

Once that work is done, we'll be able to add a few options here and we should be all set.

Depends on pytest-dev/pytest-xdist#18.

Testing on travis with live aws

There's the possibility that automated testing with travis could get expensive -- on the other hand, I've personally been tripped up by accidentally committing broken code, and as the project gets more traction I probably won't be the only one. Are there any plans to automate the testing of flintrock against real EC2?

Implement `describe --master-hostname-only`

We have an option to describe clusters by their master hostname only, but it's not implemented.

The intention behind this option is that it provides an easily parseable way to get the cluster master for use by other tools, without them needing to parse any YAML.

Package and distribute Flintrock for users and developers

#2 is where, among other things, we settle on the general approach for distributing Flintrock. This issue here is where we sort out the details.

Here is a rough outline of what packaging and distributing Flintrock should cover:

Per #2, we probably want to distribute some kind of binary package so people don't need to have Python installed to run Flintrock. This is for most users.
- We probably need to look into tools like cx_Freeze and family.
- PyInstaller recently made a 3.0 release and seems to have the best momentum and feature set behind it. They fix issues quickly and recently merged in hooks for boto which will be released in 3.1.
For people who want to use Flintrock as a library or otherwise integrate deeply with it, we need to give them some way to install Flintrock with pip. I'm not sure if pip can also cover the binary distribution, or if we'll need two separate ways of installing Flintrock, one for users and one for developers.

So there's some research to be done here:
- Do we need user vs. developer installation methods, or can one method cover both?
- Is it possible to automatically install Flintrock's dependencies to a virtual environment when installing Flintrock via pip? That way developers don't have to worry about doing this themselves.
- Make sure to cover the Windows setuptools config so that Flintrock can be installed on Windows properly.
- Review the Python Packaging User Guide.

Allow users to set the git commit to install Spark from as 'latest'

Oftentimes, I imagine users wanting to install Spark from git may just care about getting the latest version on master to test a bug fix or new feature.

In that case, it's a hassle to have to look up the latest commit when Flintrock can just do it for you. Set the desired commit to latest and Flintrock will lookup the latest commit automatically.

There are 3 approaches we can take to implement a convenience feature like this:

Import some Python library for interacting with git and use that to query the latest version. Advantage: No implicit user dependencies. Disadvantage: It's another Flintrock dependency we have to vet.
Use git at the command line as such:
```
git ls-remote https://github.com/apache/spark master
```
Advantage: Very simple. Disadvantage: We've created an implicit dependency on git being installed on the machine running Flintrock.
Query the GitHub API. Advantage: Still simple. Disadvantage: Now we have a dependency on GitHub for something that is strictly git functionality. This may not be a big deal since Spark itself, and likely most forks of it, are hosted on GitHub.

Depends on #25.

Add shortcut to open Spark web UI

I'm on the fence about this, but it might be good to add shortcuts to open common interfaces to Flintrock.

For example, I often find myself opening the Spark web UI. It would be cool if I could do that with some variation of:

./flintrock login cluster --spark-web-ui

Or perhaps a new command is appropriate here:

./flintrock open cluster spark-web-ui

Flintrock would then open a new browser tab and point it at the Spark web UI. We can do this quite easily with click.launch().

Going down this route opens the door to adding other shortcuts which may be useful.

Add support for Apache Zeppelin module (or alternative web-based notebook)

Would it be useful for users if Flintrock offered support for Apache Zeppelin as an optional module?

I haven't used Zeppelin, but the idea is that Flintrock would install it on the cluster during launch and the user would be able to just point their browser somewhere and get coding. The user works locally and the cluster is just a remote execution environment.

Alternatives to adding a Zeppelin module include:

Adding a Spark Notebook module.
Performing additional configuration to making running IPython against a remote Spark shell easy.
Doing nothing.

I'm not familiar with any of these things (well, apart from doing nothing), so I'll have to play around with them to understand the differences. I like the fact that Zeppelin appears well integrated with Spark and supports coding in multiple languages, not just Scala or just Python.

If you have input on which way to go and why, I'd love to hear it. What does your workflow look like when you are throwing together a quick experiment or prototype and you need a cluster?

Mystery: Why doesn't SPARK_MASTER_IP accept actual IP addresses?

This is a mystery that someone can take on for fun or for glory.

If I change these two blocks of code from this:

master_host=master_instance.public_dns_name,
slave_hosts=[i.public_dns_name for i in slave_instances],

to this:

master_host=master_instance.ip_address,
slave_hosts=[i.ip_address for i in slave_instances],

then Spark fails to launch. master_host, in particular, gets plugged into SPARK_MASTER_IP in this template, which seems to set off the problem.

For whatever reason, DNS names work but IP addresses don't. I'm not sure why. Spark's documentation suggests that IP addresses should work.

I've probably misunderstood something about how to configure Spark. Another possibility is that there is a documentation or code bug in Spark itself that needs to be fixed.

One clue I've come across but not tested out is the fact that SPARK_MASTER_HOST is checked here, even though it is not mentioned anywhere else in the Spark codebase. I have a suspicion that SPARK_MASTER_HOST should instead be SPARK_MASTER_IP.

What I can say for certain is that this file is where some master configurations get set, and I have traced code there from start-master.sh. So it's probably a good place to start digging.

Allow clusters to be assigned to multiple EC2 security groups on launch

If your cluster needs to access other resources on AWS, you'll probably want to assign it to the security groups that give it the appropriate permissions.

Command-line option: --ec2-security-group (Internal destination: ec2_security_groups, with an s)

You can specify this option multiple times.

Config file option:

providers:
ec2:
  security-groups:
    - group1
    - group2

Flintrock does nothing to these groups when you destroy your cluster.

The intention is that all Flintrock-managed infrastructure is short-lived, and long-lived pieces like a special security group get managed elsewhere. Flintrock just takes care of assigning your cluster to that group on launch.

Split out provider- and service-specific code to separate Python modules

We should split up flintrock.py into a few modules that each focus on some area.

I think we should have a module for each provider (e.g. EC2, GCE) which covers all the major methods related to that provider like provisioning nodes, destroying resources, and so forth.

Additionally, we should have a Python module for each Flintrock module. A Flintrock module is a service or application you install on the cluster, like Spark. So methods that install, configure, or otherwise manage given a service like that go in one file.

Allow clusters to be assigned IAM roles during launch

When you launch a cluster you want to access a dataset somewhere, and that somewhere is often S3. You need credentials to access non-public locations on S3.

Instead of copying around AWS keys or doing some other janky setup, the Amazon-recommended way is to assign the instances in your cluster to an appropriate IAM role that lets them access what they need.

We should support this.

Refactor internal representation of cluster

Flintrock currently has a murky internal representation of a cluster. This leads to a bit of a mess in various methods, especially describe_ec2() and get_cluster_instances_ec2().

We need to refactor this representation to clean up the internals and simplify code, especially in describe_ec2(). This will also be a prerequisite to adding any new providers like GCE (#10), since Flintrock will get too messy without its own, provider-agnostic representation of a cluster.

One of the ideas I've thought about is to convert the ClusterInfo named tuple into a FlintrockCluster abstract class. We could then have provider-specific implementations of that class (e.g. EC2, GCE) and move the logic to launch, stop, start, describe clusters and so forth into that class.

Improve root volume setup and add some EBS volume support

There are probably a few things that need to be done here which need to be fleshed out, including:

Tuning the size or type of the root volume (magnetic vs. SSD; ext4 setup; etc.)
Configuring additional EBS volumes (persistent storage)
Configuring additional instance store volumes (ephemeral storage); good as scratch space
Picking good defaults for this stuff
Deciding on how much of it should be configurable

Add support for HDFS module

HDFS is a common part of people's "Big Data" workflows. Typically, their data lives there so that's where their work starts from.

In Flintrock's case, people are more likely to want to use HDFS as a temporary store for data pulled in from other places like S3 to speed up access to it by Spark. spark-ec2 directly supports this use case with both ephemeral and permanent HDFS installations provided on launched clusters.

Flintrock should offer something similar.

Upgrade to boto3

There is a new boto in town, boto3.

Is it worth upgrading? Probably; it looks like the future of boto. And the earlier we do it the easier it'll probably be.

Are there any immediate benefits? How about long-term benefits?

Validate CLI input upfront

Right now if you provide a bad path to your identity file, for example, Flintrock will only tell you after you've requested instances and wasted your money.

Issues like this -- where early validation can help avoid wasted money -- we should catch upfront:

Bad path to identity file (check using Click file path argument)
Bad Spark version
add more here

We can do this with the help of Click callbacks for validation.

This is related to #5 in that it's a CLI user experience issue.

Add GCE as a supported provider

Flintrock was designed from the start to make it possible to support multiple cloud providers within one tool.

Google's Compute Engine (GCE) is most likely the next cloud provider that Flintrock will support.

It will take quite a bit of work to do this, and in the process Flintrock will be forced to refine its internals which will make it more maintainable in the long run. This is a good thing.

Work on this should probably start only after #6 and #8 have been resolved, so that there is a slightly better definition of the APIs and structure that the GCE provider can follow. #11 is another blocker for this issue.

Add support for adding or removing nodes from an existing Flintrock cluster

You have a Flintrock cluster and you just want to add or remove a few nodes. Right now you either have to do that manually or you have to destroy your cluster and start over.

We should allow users to add or remove nodes from an existing cluster. This is a little tricky because the following things need to be take care of:

Spark, HDFS, and other installed services need to be reconfigured correctly to account for the added or removed nodes.
We should not support multiple masters or replacing the master.
Flintrock needs to detect the existing cluster configuration to know how to configure any added slaves. For example, what version of Spark should be installed?
- A new SSH key for intra-cluster communication needs to be generated, or the existing one needs to somehow be reused for the new nodes.
We will probably need to add some basic Flintrock manifest file to the master on launch which captures these details. That way Flintrock can automatically detect the cluster configuration and add nodes correctly.

Work on this issue depends on #8.

Configure Spark by default not to be noisy

Spark by default spits out a lot of INFO messages. I rarely find them helpful. I wonder if the typical person does.

I think Flintrock should configure Spark by default not to spit out these messages. This can be done by setting the root logging level to WARN via a Flintrock template.

Default invocation on python3 anaconda install fails

I run flintrock inside a conda environment with python3, so

venv/bin/python3 -Wdefault::DeprecationWarning flintrock.py "$@"

fails. Thus, alas, flintrock doesn't work out-of-the-box. Is it possible to instead just call python3 ?(see https://github.com/nchammas/flintrock/blob/master/flintrock#L3 )

Add support for debug output that shows stage statistics

Sometimes, especially if you are debugging something, you want to see a little more information on what Flintrock is doing.

One type of information I've often found myself wanting to see, for example, are statistics on how long it took to do the various sub-steps in an operation. How long is it taking to download Hadoop? How long does it take the Spark master to come up? And so forth.

So first we just need some minimal plumbing to support configurable debug output. From there we can add stuff like these statistics.

Evaluate AsyncSSH as a replacement for Paramiko + asyncio thread executor

Right now Flintrock doesn't really use the full power of asyncio. We cheat by making regular SSH calls using Paramiko in asyncio's thread executor. So all we get out of asyncio right now is perhaps a nicer API to the threading library.

The real reason Flintrock uses asyncio is that, as an orchestration tool, it manages remote resources. That means, fundamentally, Flintrock's main job is to do I/O against multiple servers across the network as quickly as possible. Most of Flintrock's interactions with the cluster nodes it manages are carried out over SSH. (The major exception is perhaps the work it does talking to the cloud provider.)

This use case sounds exactly like what asyncio is for.

AsyncSSH is a library that offers SSH over asyncio. It would replace both Paramiko as well as our use of the asyncio thread executor. More importantly, it promises to offer asynchronous SSH without the burden of threading.

I've had my eye on it for some time and have experimented with it on several occasions. The API is clean and the maintainer is active and responsive.

I want to seriously evaluate AsyncSSH as a replacement for Paramiko + thread executor, focusing on the following factors:

Does it make Flintrock easier to maintain, or at least not more complicated to maintain?
Does it make cluster operations, especially cluster launches, faster? How much faster?
What is the risk to depending on this relatively new library that does not yet have as much adoption or popularity as Paramiko?

Covert tests to Python and run in parallel where possible

Both lint.sh and test-integration.sh should be converted into Python so that they can be run on any platform using a Python test runner like pytest.

Secondly, the various tests in test-integration.sh should be parallelized where possible, launching multiple clusters in parallel to cover the same tests in less time.

Do not allow clusters with duplicate names to be launched

Right now Flintrock incorrectly allows you to launch a cluster with a name that is already taken.

Migrate from Amazon Linux to CentOS

Flintrock has thus far been built and tested on the assumption that the cluster nodes will be running Amazon Linux. Since we want to make it possible to support cloud providers other than Amazon, we need to get off Amazon Linux.

CentOS is very similar to Amazon Linux and is the obvious choice for switching. Ubuntu is another possibility but I will not pursue that unless there is some compelling argument for it over CentOS.

Switching to CentOS should mostly involve tweaking the shell scripts that run remotely during cluster launches and testing everything out to make sure everything still works.

Create stand-alone Flintrock executable that people can run without needing Python installed

Many people will find Flintrock annoying to use because they are on older versions of Python or don't have Python at all on their systems, whereas Flintrock requires a very recent version of Python. We can ask these people to install compatible versions of Python so they can run Flintrock, but that's a burden users shouldn't have to bear.

In addition to the usual pip install ..., Flintrock should offer a stand-alone package that people can install and run without needing any specific version of Python installed, or even Python installed at all. This should broaden the audience of users who will be able to use Flintrock.

I will probably use PyInstaller to build this. It has good momentum behind it and supports (I think) everything Flintrock needs for a stand-alone package.

Previously discussed in #15.

Enumerate alternatives to Flintrock

EMR supports Spark. We should add a few lines to our README about when using Flintrock over EMR might make sense.

Potential reasons -- which I need to research and confirm -- include:

Better command-line experience.
Slightly faster support of newer Spark versions. (e.g. ~~EMR is still on 1.5.0.~~ EMR now supports 1.5.2.)
Support for deploying Spark at a specific git hash. (#25)
More control over certain options.
Support for alternate cloud providers. (Well, eventually.)
Lower cost.

If you have your reasons one way or the other (EMR vs. Flintrock), share them here.

Clean up handling of exceptions raised within coroutines

There are a few things that can be improved about how we handle exceptions within the coroutines that do all of the setup work on the remote hosts.

Cleaner error output when a coroutine fails.
Closing the event loop in a finally block, per: http://stackoverflow.com/a/30766124/
Failing as soon as any coroutine errors out.

Example issue: Bad usernames (#65)

Setup automated testing for pull requests

It's hard to test Flintrock since most of what it does it orchestrate remote resources, but for starters we can run the lint script on all incoming pull requests.

I'll probably go with Travis CI for Linux and OS X, and AppVeyor for Windows.

Automated testing status:

Linux

This is working after #12.
OS X

Per #17, I'm gonna push this off until Travis directly supports Python.
Windows

Setting up testing for a Python project on AppVeyor is not trivial, but there are several guides to make it easier:
And here are several appveyor.yml files to review:

Give users an easy way to authorize new addresses to access a cluster

By default, Flintrock only authorizes the host it's running on to access launched clusters. So if you launch a cluster and then go somewhere else where your IP is gonna be different, you won't be able to login to the cluster anymore.

We should offer a way for users to authorize custom addresses on launch, and perhaps also a way for users to authorize their current address to access existing clusters.

Cannot destroy instances in other regions

Currently if we create a cluster with a region specified by yaml we can't delete them. Such as:

(py3k)dhcp-47-226:flintrock jonas$ python flintrock.py --config ../config.yaml launch flintrocktest2

Launching 2 instances...
[52.24.137.212] SSH online.
[52.24.137.212] Installing Spark...
[52.11.113.72] SSH online.
[52.11.113.72] Installing Spark...
All 2 instances provisioned.
[52.24.137.212] Configuring Spark master...
Spark Health Report:
  * Master: ALIVE
  * Workers: 1
  * Cores: 32
  * Memory: 239.2 GB
launch_ec2 finished in 0:02:30.

(py3k)dhcp-47-226:flintrock jonas$ python flintrock.py --config ../config.yaml login flintrocktest2
Warning: Permanently added 'ec2-52-24-137-212.us-west-2.compute.amazonaws.com,52.24.137.212' (RSA) to the list of known hosts.
Last login: Sun Oct 25 02:58:59 2015 from 50.141.29.11

       __|  __|_  )
       _|  (     /   Amazon Linux AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2015.09-release-notes/
No packages needed for security; 21 packages available
Run "sudo yum update" to apply all updates.
[ec2-user@ip-172-31-18-251 ~]$ w
 03:00:14 up 2 min,  1 user,  load average: 0.17, 0.10, 0.04
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
ec2-user pts/0    50.141.29.11     03:00    2.00s  0.00s  0.00s w
[ec2-user@ip-172-31-18-251 ~]$ logout

Connection to ec2-52-24-137-212.us-west-2.compute.amazonaws.com closed.
(py3k)dhcp-47-226:flintrock jonas$ python flintrock.py --config ../config.yaml destroy flintrocktest2

No such cluster.

Note that we can still describe the cluster


(py3k)dhcp-47-226:flintrock jonas$ python flintrock.py --config ../config.yaml describe flintrocktest2
1 cluster found.

---
flintrocktest2:
  state: running
  node-count: 2
  nodes:
    - ec2-52-24-137-212.us-west-2.compute.amazonaws.com
    - ec2-52-11-113-72.us-west-2.compute.amazonaws.com

Incoming PR to fix it.

Allow for clusters to be launched without Spark installed

Perhaps this is not a common use case, but Flintrock's internals should be cleaned up to properly support launching a cluster that does not have Spark installed.

Currently, there is some ambiguity over how the --no-install-spark option should work. Does it kick in automatically if the user's config.yaml file has no Spark module specified?

Add Flintrock commands for remotely executing shell commands on cluster nodes

Users launching clusters often need to do their own additional setup to install packages they need or the like.

We should offer one, maybe two new Flintrock commands to make it easy for users to run commands remotely on all the nodes of their cluster. I'm imagining they will work like stripped-down, Flintrock-specific versions of pssh.

Here's a proposal for what those commands might look like:

./flintrock run-command my-cluster 'yum install -y package'
./flintrock run-script my-cluster /path/to/local/script.sh

For starters I want to keep these commands as simple as possible. Just run the shell command or script on each of the cluster nodes in parallel and report whether they all passed or not. In the event of a failure, I might not even want to show which specific node or nodes failed, at least for an initial cut of this feature.

Anything more, like getting the output back from each of the nodes, will be pushed off to later, or redirected to utility tools like pssh which are better suited to the task.

Downloading Hadoop from Apache mirrors is fragile

Ok, it was taking me forever to figure out why sometimes my cluster installs were hanging, appearing to never complete. It turns out that some of the different mirrors returned by the mirror checking script are incredibly slow (100kB/sec slow) . This has resulted in clusters taking up to 45min to launch. Is there any way around this mirror search? It is extra complicated because different invocations of the mirror-finding script can return different mirrors on different ec2 nodes in a cluster, resulting in very erratic behavior.

Support running Flintrock from Windows

This is an umbrella issue for all the things that need to be done to support running Flintrock on Windows:

Remove dependency on ssh-keygen.
Remove dependency on ssh for interactive login.
Get AppVeyor build setup.

Add Flintrock command to copy files to all nodes of a cluster

Related to #4, people often do their own customized setup after they launch a cluster. Oftentimes this involves copying files to each node of the cluster, or something similar.

We should perhaps offer a Flintrock command to make this easy:

./flintrock copy-files my-cluster /local/path/spec /remote/path/spec

A counter proposal by @broxtronix is to place a copy-dir command during launch on the cluster master. The command makes it easy to copy something already on the master to all the slaves. This is what spark-ec2 does.

I'll have to better understand the use cases and flows to make an informed decision.

Add ability to run-command against just the master

Similar to copy-file (#31), we should allow users to run commands against just the master.

For example:

./flintrock run-command cluster --master-only -- ./hadoop/bin/hdfs -put /file /

This is useful for some operations that involve HDFS and Spark, where it doesn't make sense to run the same command from multiple nodes.

Settle on supported Python versions and primary release method

I'd like to require Python 3.4+ for Flintrock. The rationale for this is partly covered in the README and CONTRIBUTING docs.

To expand a bit, I want to be able to use the latest and greatest that Python has to offer and not bear the burden and restrictions of supporting multiple Python versions at once. At the same time, I would love not to have to force users to run a specific version of Python.

The best way I know of to accomplish both goals is to release binaries as the primary way that users consume and use Flintrock. That way, we can use what we want to build Flintrock, and users get something that just works without having to worry about what version of Python they have installed. Only people who want to hack on Flintrock or extend it in some way will have to use Python 3.4+. I think this is an acceptable requirement to impose.

I haven't done this before, so I'll have to research and experiment a bit to make sure this is actually practical. If you have any alternatives to suggest please chime in.

Settle on supported Python versions
Settle on a first cut of how to release Flintrock
Add friendly error handling if needed to catch when users use unsupported versions of Python

Add option to download Hadoop from a custom URL

As a follow-up to the discussion in #66, perhaps we should add an option to let users download Hadoop from a custom URL.

Command-line option: --hdfs-download-source

Config file option:

modules:
hdfs:
  version: 2.7.1
  download-source: "http://www.apache.org/dyn/closer.lua/hadoop/common/hadoop-{v}/hadoop-{v}.tar.gz?as_json"

{v} will be replaced by the HDFS version passed in to Flintrock, either via the config file or via the command line.

The .../dyn/closer.lua/... value will be Flintrock's internal default, which the user can replace with a specific Apache mirror, or some other source. The only requirements are that the package be downloadable from the cluster without authentication, and that the URL contain the {v} template somewhere.

I would like to deprecate this option as soon as we have a more robust way of downloading Hadoop quickly and reliably, since that's the main motivation for adding this option in.

Related to #66, #69, #84.

Show plan before executing it; add --dry-run option to just show plan and exit

Not sure how useful this would be, but one feature I've seen elsewhere that might make sense in Flintrock is to have some concept of an "execution plan".

Just as Flintrock is about to do something, it lays out its plan. This makes it clear what options are taking effect, which could be helpful when you have a config file and command-line options to resolve. This would just be some text shown at the start of any operation.

For example:

$ ./flintrock launch shooper-dooper-clushter
Using config file: /path/to/flintrock/config.yaml
Launching:
 - 1 master, 2 slaves
 - Spark installed, HDFS not installed
 - us-east-1 region

This would fit well with the changes proposed in #27.

An additional idea would be to have some kind of --dry-run option which just shows the plan and exits.

Allow Spark to be installed at a given git hash, and maybe even from a different repository

Flintrock currently supports installing only released versions of Spark.

For development and experimentation purposes, we should let users launch clusters with Spark installed at a specific git hash. People may want to test a new change in master that hasn't been released yet, or they may want to test a personal fork they have of Spark.

Flintrock will have to build Spark as part of the cluster launch to support this feature.

Implement a more lightweight display of launch/start/etc. progress

Flintrock's output is already much cleaner compared to spark-ec2's:

Launching 2 instances...
[52.91.67.xxx] SSH online.
[52.91.67.xxx] Installing Spark...
[52.91.213.xxx] SSH online.
[52.91.213.xxx] Installing Spark...
All 2 instances provisioned.
[52.91.67.xxx] Configuring Spark master...
Spark Health Report:
  * Master: ALIVE
  * Workers: 1
  * Cores: 1
  * Memory: 2.7 GB            
launch_ec2 finished in 0:02:52.

But this still needs some improvement. If you were to launch a 100-node cluster, you would get more than 200 lines of output: 100 lines showing when SSH comes online, and 100 lines for installing Spark.

This is noise.

A better user experience would be to show some kind of progress bar for each "stage" of the operation that Flintrock is executing.

So for cluster launches, we would show a single progress bar that advances as SSH becomes available across the nodes of the cluster, and another progress bar that advances as Spark is installed across the nodes of the cluster.

Maybe something like this:

SSH online       [####################..........]  75%
Spark installed  [######........................]  25%

I'll look into Click's progress bar utility, though it looks like I will probably have to find a workaround for this issue with having multiple progress bars being updated simultaneously.

Other considerations to keep in mind:

This must work well when Flintrock's output is redirected to a file.
We need some affordance that lets people see what node, if any, is hanging up an operation.

Support for spot instances

In the todo at the top of the flintrock script, we see:

* Support for spot instances.
        - Show wait reason (capcity oversubscribed, price too low, etc.).

Currently it seems like a lot of the machinery for spot instances is in place, including threading the requested spot price through to launch_ec2. Would you accept a PR that actually made this live? the spark_ec2.py way is to launch the root node as non-spot and the workers as spot instances -- would this work for a first version?