bryantrobbins / baseball Goto Github PK
View Code? Open in Web Editor NEWAn upcoming web-based tool for sabermetrics.
License: Apache License 2.0
An upcoming web-based tool for sabermetrics.
License: Apache License 2.0
Available parameters:
Scale backend capacity according to number of pending jobs in the queue
Prove that this can be done from the existing Packer setup, via a post-processor that looks something like this (from Packer Docker documentation):
"post-processors": [
[
{
"type": "docker-tag",
"repository": "12345.dkr.ecr.us-east-1.amazonaws.com/packer",
"tag": "0.7"
},
{
"type": "docker-push",
"login": true,
"login_email": "none",
"login_username": "AWS",
"login_password": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"login_server": "https://12345.dkr.ecr.us-east-1.amazonaws.com/"
}
]
]
To support this, need ability to provide ECR repo location and credentials via environment variables, such as:
Use validator from shared Python package
Looks like it might be a great fit for running the worker jobs in Docker containers:
http://docs.aws.amazon.com/batch/latest/userguide/Batch_GetStarted.html
The current worker implementation does not update the DB entry in the case of errors (and stops processing all messages when the first error is encountered).
Things should not be this way. We should generically try/except so that job status is always updated and job processing continues to the next job as appropriate.
Propagation should also include an error message.
There are a number of publicly available datasets with Baseball statistics.
As a key feature of its UI, Baseball Workbench should have a "Metadata Viewer" layout. The UI can retrieve dataset metadata from an API call.
Metadata can be assumed to include:
Possible Limits for Alpha:
The EC2 Container Service from AWS does not handle service discovery. This forces you to explicitly map host ports (e.g., so that dependent containers can communicate over known ports).
I would like to add a Consul server to the BuildHost, and to use Consul to look up container IP and Port information in real-time. This will avoid having to hard-code the EC2 instance host ports when defining ECS-hosted components.
An example and further explanation are given here:
https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/
Should have an ECR repo created in base stack.
Should have a metadata-image job, which calls Packer (just as ui-image does) to create an image.
Should have a Consul Template configured on the BuildServer, defining backends as an Nginx "upstream".
Should have service definition added to the DEV template, taking arguments for container count and version as the UI container configurations do now.
Need at least an ECS cluster and ECR repository in the Standard template (https://github.com/bryantrobbins/standard-aws).
Create a new CF template specific to baseball which defines ECS Tasks for each Docker-hosted component (UI, API, and Worker containers).
Write a Lambda function which writes a job's configuration info to a DynamoDB table, and places a notification of that job on a queue for processing.
Submission includes:
Deploy backend into a homogeneous group of Docker containers
The Baseball Workbench UI should allow users to describe, execute, and export statistical analysis. To support this goal, there should be a flow of the following activities:
The available datasets are:
The supported Export Types are:
Allows UI to validate configuration before attempting to submit job
(Uses same validation code from shared Python package)
The system will know about a fixed number of datasets. We need an API for dataset metadata, driven by config files and serving back row and column info to the Metadata Viewer frontend.
Metadata viewing is completely separate from the Job Submission API that will be necessary for manipulating data. It could even be deployed as its own microservice.
The user should be able to select from a pre-defined set of export types, and provide any necessary options for each type.
The only available type for the MVP should be:
Trying to decide between Postgres, MongoDB, and Riak. Would like to pick one and add it to the existing infrastructure, for now.
My current favorite is Riak, which is similar to AWS DynamoDB (based off of the Dynamo paper).
Requirements:
Parameters:
Note this sums over all Numeric columns and drops all String columns not ID'ed as grouping columns.
Write a Lambda function which returns a job's info by doing a lookup in a DynamoDB table
Parameters:
Write a backend which:
Should work against the data persisted by submitJob.
To support the definition of new statistics from existing columns in a dataset, Baseball Workbench should have an Equation Editor.
Parameters:
Write a Lambda function which serves back a JSON block of dataset metadata.
Basically, we need this:
The reverse proxy picks up backend services as they register with Consul.
Acceptance criteria:
Because the "standard-aws" repo is separate, it would be nice to be able to supply project-specific details to that standard setup during launch, so that the configured server(s) can have project-specific details.
Before successfully writing to DynamoDB and placing a message on the queue, the SubmitJob API call should validate the parameters of the requested job.
Here is a sample JSON configuration object for a job:
{
"dataset": "Lahman_Batting",
"transformations": [
{
"type": "columnSelect",
"columns": [
"HR",
"lgID"
]
},
{
"type": "rowSelect",
"column": "yearID",
"operator": ">=",
"criteria": "2000"
},
{
"type": "columnDefine",
"column": "custom",
"expression": "2*(HR)"
},
{
"type": "rowSum",
"columns": [
"playerID",
"yearID",
"lgID"
]
}
],
"output": {
"type": "leaderboard",
"column": "HR",
"direction": "desc"
}
}
Below is a list of required validations.
Dataset:
Output:
ColumnSelect and RowSum Transformation:
RowSelect Transformation:
ColumnDefine Transformation:
Parameters:
In general, the expression should be either a string constant OR a mathematical expression which allows:
This setup will allow Docker containers to register with Consul as they come up:
https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.