Coder Social home page Coder Social logo

bryantrobbins / baseball Goto Github PK

View Code? Open in Web Editor NEW
22.0 22.0 9.0 2.85 MB

An upcoming web-based tool for sabermetrics.

License: Apache License 2.0

Shell 7.71% R 0.20% Groovy 4.37% HTML 15.15% JavaScript 38.48% CSS 0.53% Python 31.48% Vue 2.09%

baseball's People

Contributors

bryantrobbins avatar diffley avatar gallagherrchris avatar jacobsheppard avatar leemarc00 avatar mchao47 avatar smxjrz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

baseball's Issues

Push UI container to ECR repo

Prove that this can be done from the existing Packer setup, via a post-processor that looks something like this (from Packer Docker documentation):

"post-processors": [
[
{
"type": "docker-tag",
"repository": "12345.dkr.ecr.us-east-1.amazonaws.com/packer",
"tag": "0.7"
},
{
"type": "docker-push",
"login": true,
"login_email": "none",
"login_username": "AWS",
"login_password": "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"login_server": "https://12345.dkr.ecr.us-east-1.amazonaws.com/"
}
]
]

To support this, need ability to provide ECR repo location and credentials via environment variables, such as:

https://www.packer.io/docs/templates/user-variables.html

Worker: Job Error Propagation

The current worker implementation does not update the DB entry in the case of errors (and stops processing all messages when the first error is encountered).

Things should not be this way. We should generically try/except so that job status is always updated and job processing continues to the next job as appropriate.

Propagation should also include an error message.

UI: Viewing Dataset Metadata

There are a number of publicly available datasets with Baseball statistics.

As a key feature of its UI, Baseball Workbench should have a "Metadata Viewer" layout. The UI can retrieve dataset metadata from an API call.

Metadata can be assumed to include:

  • Dataset Name: A unique identifier for this dataset, such as "Lahman.Hitters"
  • Dataset Description: A text description of the dataset
  • Row Description: A text description of what each row in the dataset represents (all rows will represent the same thing, so only one description is required)
  • Column Metadata: A unique Name, text description, and data type (String, Count, Ratio) for each column in the dataset

Add Consul to BuildHost

The EC2 Container Service from AWS does not handle service discovery. This forces you to explicitly map host ports (e.g., so that dependent containers can communicate over known ports).

I would like to add a Consul server to the BuildHost, and to use Consul to look up container IP and Port information in real-time. This will avoid having to hard-code the EC2 instance host ports when defining ECS-hosted components.

An example and further explanation are given here:
https://aws.amazon.com/blogs/compute/service-discovery-via-consul-with-amazon-ecs/

Deploy Metadata API container

Should have an ECR repo created in base stack.
Should have a metadata-image job, which calls Packer (just as ui-image does) to create an image.
Should have a Consul Template configured on the BuildServer, defining backends as an Nginx "upstream".
Should have service definition added to the DEV template, taking arguments for container count and version as the UI container configurations do now.

API: SubmitJob

Write a Lambda function which writes a job's configuration info to a DynamoDB table, and places a notification of that job on a queue for processing.

UI: Default Use Case Flow

The Baseball Workbench UI should allow users to describe, execute, and export statistical analysis. To support this goal, there should be a flow of the following activities:

  • Select Initial Dataset, from a list of available Public datasets. For example: Lahman.Hitters
  • Define one or more new columns, in terms of columns from the Initial Dataset and basic arithmetic (add, subtract, multiply, divide). For example: RC = (H + BB) * TB / (AB + BB)
  • Define one or more row filters, in terms of column names, values, and comparators, to be applied to the updated data set (Initial + New Columns) prior to export. For example: Year > 1955
  • Define Exported Artifact, from a list of available export types and their options. _For example: Histogram of RC _
  • Click "Generate"
  • Receive temporary link to exported files.

The available datasets are:

  • Individual tables from the Lahman database (e.g., Hitters, Pitchers, Teams, etc.)
  • Retrosheet Gamelogs database (Regular Season, Postseason, or All-Star)

The supported Export Types are:

  • Table ordered by X (ASC or DSC)
  • Histogram of X
  • Scatter Plot of X vs. Y

Create Dataset Metadata API

The system will know about a fixed number of datasets. We need an API for dataset metadata, driven by config files and serving back row and column info to the Metadata Viewer frontend.

Metadata viewing is completely separate from the Job Submission API that will be necessary for manipulating data. It could even be deployed as its own microservice.

UI: Add Export components

The user should be able to select from a pre-defined set of export types, and provide any necessary options for each type.

The only available type for the MVP should be:

  • Ordered Table: requires choice of 1 to 5 columns to be in the table, and requires choice of order by column and order direction (ascending or descending)

Infra: Add database

Trying to decide between Postgres, MongoDB, and Riak. Would like to pick one and add it to the existing infrastructure, for now.

My current favorite is Riak, which is similar to AWS DynamoDB (based off of the Dynamo paper).

Requirements:

  • Support for multiple nodes
  • Ability to run nodes in Docker containers, across different Docker hosts
  • No data loss on single node failure
  • Ability to backup and restore from tarball
  • Low-latency querying interface

API: GetJobInfo

Write a Lambda function which returns a job's info by doing a lookup in a DynamoDB table

Backend proof of concept

Write a backend which:

  • Grabs a pending job's ID from a queue
  • Retrieves the job configuration from a database
  • Validates the job configuration
  • Executes the job in R
  • Pushes output files to S3
  • Updates the job status in a database

UI: Equation Editor

To support the definition of new statistics from existing columns in a dataset, Baseball Workbench should have an Equation Editor.

  • Users should be able to easily add references to columns from their datasource.
  • Users should be able to easily add references to custom columns previously defined.
  • Attempts to add references to non-existent columns should result in a client-side error.
  • Users should be able to use simple mathematical operators: Add, Subtract, Multiply, Divide
  • Attempts to use unsupported operators (or include any extraneous characters) should result in a client-side error.

API: GetDatasets

Write a Lambda function which serves back a JSON block of dataset metadata.

Automated Build Server Configuration

Acceptance criteria:

  • Single command creates all AWS resources
  • r10k runs to retrieve required puppet modules
  • Local puppet apply runs to configure Jenkins server
  • Single t2.nano server with Jenkins instlaled
  • Jenkins security enabled, with "admin" and "bryan" accounts created
  • Docker and Packer installed, to support image builds
  • Jenkins plugins installed to support Git checkout
  • Jenkins seed job configured to checkout from specified github repo and runs file with given name

Add encrypted hieradata to local puppet runs

Because the "standard-aws" repo is separate, it would be nice to be able to supply project-specific details to that standard setup during launch, so that the configured server(s) can have project-specific details.

API: Add validation to SubmitJob API

Before successfully writing to DynamoDB and placing a message on the queue, the SubmitJob API call should validate the parameters of the requested job.

Here is a sample JSON configuration object for a job:

{
  "dataset": "Lahman_Batting",
  "transformations": [
    {
      "type": "columnSelect",
      "columns": [
        "HR",
        "lgID"
      ]
    },
    {
      "type": "rowSelect",
      "column": "yearID",
      "operator": ">=",
      "criteria": "2000"
    },
    {
      "type": "columnDefine",
      "column": "custom",
      "expression": "2*(HR)"
    },
    {
      "type": "rowSum",
      "columns": [
        "playerID",
        "yearID",
        "lgID"
      ]
    }
  ],
  "output": {
    "type": "leaderboard",
    "column": "HR",
    "direction": "desc"
  }
}

Below is a list of required validations.

Dataset:

  • Dataset ID should be from set of allowed set of datasets (currently just "Lahman_Batting")

Output:

  • Output parameter "type" should be from allowed set of output types (currently just "leaderboard")
  • Output parameter "column" should be the name of a single column from the set of selected and/or defined columns as of the end of all transformations
  • Output parameter direction must be one of "desc" or "asc"

ColumnSelect and RowSum Transformation:

  • Entries in the "columns" list should be the name of an existing column, with respect to any previously executed transformations.
  • After the ColumnSelect transformation, all columns not present in the "columns" list are lost.
  • After the RowSum transformation, all string-valued columns not present in the "columns" list are lost.

RowSelect Transformation:

  • "column" should be the name of an existing column, with respect to any previously executed transformations.
  • "operator" should be one of <, >, <=, >=, =, or !=.
  • "criteria" should be either a number or string, and not an expression.
  • The type of the criteria (number or string) should match the type of the corresponding column chosen.

ColumnDefine Transformation:

  • "column" should be a unique name for the new column being defined, and should not conflict with the name of any existing column, with respect to any previously executed transformations
  • "expression" should be a valid mathematical expression using only scalar values (strings or numbers) or the names of existing columns, with respect to any previously executed transformations.
  • "expression" may use the following numerical operators: +, -, *, /, ^
  • After the ColumndDefine transformation, a new column with the given name is added.

UI: Allow configuration of Column Define transformation

Parameters:

  • Column (name of column)
  • Expression (must conform to grammar)

In general, the expression should be either a string constant OR a mathematical expression which allows:

  • Plus, Minus, Multiply, Divide
  • Exponents
  • Parentheses
  • Column references of the form $('COL')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.