The baseball from bryantrobbins

UI: Allow configuration of Leaderboard output type

Available parameters:

Column
Direction
Size

Infra: Add DNS to cloudformation

Backend autoscaling

Scale backend capacity according to number of pending jobs in the queue

API: Trigger worker task via submitJob call

Push UI container to ECR repo

Prove that this can be done from the existing Packer setup, via a post-processor that looks something like this (from Packer Docker documentation):

"post-processors": [
[
{
"type": "docker-tag",
"repository": "12345.dkr.ecr.us-east-1.amazonaws.com/packer",
"tag": "0.7"
},
{
"type": "docker-push",
"login": true,
"login_email": "none",
"login_username": "AWS",
"login_password": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"login_server": "https://12345.dkr.ecr.us-east-1.amazonaws.com/"
}
]
]

To support this, need ability to provide ECR repo location and credentials via environment variables, such as:

https://www.packer.io/docs/templates/user-variables.html

Worker: Add validation before calling R commands

Use validator from shared Python package

Worker: Look into options for hosting worker

Looks like it might be a great fit for running the worker jobs in Docker containers:
http://docs.aws.amazon.com/batch/latest/userguide/Batch_GetStarted.html

Infra: Add Hubot to BuildHost

Worker: Job Error Propagation

The current worker implementation does not update the DB entry in the case of errors (and stops processing all messages when the first error is encountered).

Things should not be this way. We should generically try/except so that job status is always updated and job processing continues to the next job as appropriate.

Propagation should also include an error message.

UI: Viewing Dataset Metadata

There are a number of publicly available datasets with Baseball statistics.

As a key feature of its UI, Baseball Workbench should have a "Metadata Viewer" layout. The UI can retrieve dataset metadata from an API call.

Metadata can be assumed to include:

Dataset Name: A unique identifier for this dataset, such as "Lahman.Hitters"
Dataset Description: A text description of the dataset
Row Description: A text description of what each row in the dataset represents (all rows will represent the same thing, so only one description is required)
Column Metadata: A unique Name, text description, and data type (String, Count, Ratio) for each column in the dataset

API: Submit Job should enforce usage limits through API key

Possible Limits for Alpha:

Maximum number of jobs per hour/day/week
Maximum number of columns, rows in leaderboard output
Maximum number of custom column definitions per job

Add Consul to BuildHost

The EC2 Container Service from AWS does not handle service discovery. This forces you to explicitly map host ports (e.g., so that dependent containers can communicate over known ports).

I would like to add a Consul server to the BuildHost, and to use Consul to look up container IP and Port information in real-time. This will avoid having to hard-code the EC2 instance host ports when defining ECS-hosted components.

An example and further explanation are given here:
https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/

Deploy Metadata API container

Should have an ECR repo created in base stack.
Should have a metadata-image job, which calls Packer (just as ui-image does) to create an image.
Should have a Consul Template configured on the BuildServer, defining backends as an Nginx "upstream".
Should have service definition added to the DEV template, taking arguments for container count and version as the UI container configurations do now.

Add ECS and ECR components to CloudFormation

Need at least an ECS cluster and ECR repository in the Standard template (https://github.com/bryantrobbins/standard-aws).

Create a new CF template specific to baseball which defines ECS Tasks for each Docker-hosted component (UI, API, and Worker containers).

API: SubmitJob

Write a Lambda function which writes a job's configuration info to a DynamoDB table, and places a notification of that job on a queue for processing.

UI: Allow job submission, including API key

Submission includes:

Selection of data source
All defined transformations, in order
Output configuration
API key

API: Propagate errors from Python through API Gateway

Backend deployment

Deploy backend into a homogeneous group of Docker containers

Worker: Incorporate Hackathon contributions to R scripts

Data export script (add to data export process)
R package installations (add to image build process)
Generic "worker" script (add to container)

Deploy UI container to ECS cluster

UI: Default Use Case Flow

The Baseball Workbench UI should allow users to describe, execute, and export statistical analysis. To support this goal, there should be a flow of the following activities:

Select Initial Dataset, from a list of available Public datasets. For example: Lahman.Hitters
Define one or more new columns, in terms of columns from the Initial Dataset and basic arithmetic (add, subtract, multiply, divide). For example: RC = (H + BB) * TB / (AB + BB)
Define one or more row filters, in terms of column names, values, and comparators, to be applied to the updated data set (Initial + New Columns) prior to export. For example: Year > 1955
Define Exported Artifact, from a list of available export types and their options. _For example: Histogram of RC _
Click "Generate"
Receive temporary link to exported files.

The available datasets are:

Individual tables from the Lahman database (e.g., Hitters, Pitchers, Teams, etc.)
Retrosheet Gamelogs database (Regular Season, Postseason, or All-Star)

The supported Export Types are:

Table ordered by X (ASC or DSC)
Histogram of X
Scatter Plot of X vs. Y

API: Add stand-alone Validate Config API call

Allows UI to validate configuration before attempting to submit job

(Uses same validation code from shared Python package)

Worker: Add image build to Jenkins DSL

Create Dataset Metadata API

The system will know about a fixed number of datasets. We need an API for dataset metadata, driven by config files and serving back row and column info to the Metadata Viewer frontend.

Metadata viewing is completely separate from the Job Submission API that will be necessary for manipulating data. It could even be deployed as its own microservice.

UI: Add Export components

The user should be able to select from a pre-defined set of export types, and provide any necessary options for each type.

The only available type for the MVP should be:

Ordered Table: requires choice of 1 to 5 columns to be in the table, and requires choice of order by column and order direction (ascending or descending)

Infra: Add database

Trying to decide between Postgres, MongoDB, and Riak. Would like to pick one and add it to the existing infrastructure, for now.

My current favorite is Riak, which is similar to AWS DynamoDB (based off of the Dynamo paper).

Requirements:

Support for multiple nodes
Ability to run nodes in Docker containers, across different Docker hosts
No data loss on single node failure
Ability to backup and restore from tarball
Low-latency querying interface

UI: Allow configuration of Row Sum transformation

Parameters:

Columns (columns to group by)

Note this sums over all Numeric columns and drops all String columns not ID'ed as grouping columns.

API: GetJobInfo

Write a Lambda function which returns a job's info by doing a lookup in a DynamoDB table

UI: Display output SVG from protected S3 bucket

UI: Update export to poll API for job info

UI: Allow configuration of Column Select transformation

Parameters:

Column list

Move common infrastructure code to standard-aws repo

Backend proof of concept

Write a backend which:

Grabs a pending job's ID from a queue
Retrieves the job configuration from a database
Validates the job configuration
Executes the job in R
Pushes output files to S3
Updates the job status in a database

API: Add getJobInfo method for getting job details

Should work against the data persisted by submitJob.

UI: Equation Editor

To support the definition of new statistics from existing columns in a dataset, Baseball Workbench should have an Equation Editor.

Users should be able to easily add references to columns from their datasource.
Users should be able to easily add references to custom columns previously defined.
Attempts to add references to non-existent columns should result in a client-side error.
Users should be able to use simple mathematical operators: Add, Subtract, Multiply, Divide
Attempts to use unsupported operators (or include any extraneous characters) should result in a client-side error.

UI: Allow configuration of Row Select transformation

Parameters:

Column
Operator (lt, gt, le, ge, eq, ne)
Criteria (String or Number)

API: GetDatasets

Write a Lambda function which serves back a JSON block of dataset metadata.

API: Update submitJob to persist job info in DB

Add Nginx Reverse Proxy w/Consul Template to BuildHost

Basically, we need this:

https://medium.com/@ladislavGazo/easy-routing-and-service-discovery-with-docker-consul-and-nginx-acfd48e1a291#.17sfplbji

The reverse proxy picks up backend services as they register with Consul.

Automated Build Server Configuration

Acceptance criteria:

Single command creates all AWS resources
r10k runs to retrieve required puppet modules
Local puppet apply runs to configure Jenkins server
Single t2.nano server with Jenkins instlaled
Jenkins security enabled, with "admin" and "bryan" accounts created
Docker and Packer installed, to support image builds
Jenkins plugins installed to support Git checkout
Jenkins seed job configured to checkout from specified github repo and runs file with given name

Add encrypted hieradata to local puppet runs

Because the "standard-aws" repo is separate, it would be nice to be able to supply project-specific details to that standard setup during launch, so that the configured server(s) can have project-specific details.

API: Add validation to SubmitJob API

Before successfully writing to DynamoDB and placing a message on the queue, the SubmitJob API call should validate the parameters of the requested job.

Here is a sample JSON configuration object for a job:

{
  "dataset": "Lahman_Batting",
  "transformations": [
    {
      "type": "columnSelect",
      "columns": [
        "HR",
        "lgID"
      ]
    },
    {
      "type": "rowSelect",
      "column": "yearID",
      "operator": ">=",
      "criteria": "2000"
    },
    {
      "type": "columnDefine",
      "column": "custom",
      "expression": "2*(HR)"
    },
    {
      "type": "rowSum",
      "columns": [
        "playerID",
        "yearID",
        "lgID"
      ]
    }
  ],
  "output": {
    "type": "leaderboard",
    "column": "HR",
    "direction": "desc"
  }
}

Below is a list of required validations.

Dataset:

Dataset ID should be from set of allowed set of datasets (currently just "Lahman_Batting")

Output:

Output parameter "type" should be from allowed set of output types (currently just "leaderboard")
Output parameter "column" should be the name of a single column from the set of selected and/or defined columns as of the end of all transformations
Output parameter direction must be one of "desc" or "asc"

ColumnSelect and RowSum Transformation:

Entries in the "columns" list should be the name of an existing column, with respect to any previously executed transformations.
After the ColumnSelect transformation, all columns not present in the "columns" list are lost.
After the RowSum transformation, all string-valued columns not present in the "columns" list are lost.

RowSelect Transformation:

"column" should be the name of an existing column, with respect to any previously executed transformations.
"operator" should be one of <, >, <=, >=, =, or !=.
"criteria" should be either a number or string, and not an expression.
The type of the criteria (number or string) should match the type of the corresponding column chosen.

ColumnDefine Transformation:

"column" should be a unique name for the new column being defined, and should not conflict with the name of any existing column, with respect to any previously executed transformations
"expression" should be a valid mathematical expression using only scalar values (strings or numbers) or the names of existing columns, with respect to any previously executed transformations.
"expression" may use the following numerical operators: +, -, *, /, ^
After the ColumndDefine transformation, a new column with the given name is added.

UI: Allow configuration of Column Define transformation

Parameters:

Column (name of column)
Expression (must conform to grammar)

In general, the expression should be either a string constant OR a mathematical expression which allows:

Plus, Minus, Multiply, Divide
Exponents
Parentheses
Column references of the form $('COL')

Add Registrator and Consul Agent to ECS hosts

This setup will allow Docker containers to register with Consul as they come up:
https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/

bryantrobbins / baseball Goto Github PK

baseball's People

Contributors

Stargazers

Watchers

Forkers

baseball's Issues

Recommend Projects

Recommend Topics

Recommend Org