Coder Social home page Coder Social logo

artikandri / march_madness Goto Github PK

View Code? Open in Web Editor NEW

This project forked from haanh764/march_madness

0.0 0.0 0.0 4.68 MB

A simple data warehouse system to serve data in form of graphs to user. Written almost entirely in Python. BulmaCSS is used for styling.

Python 96.65% CSS 3.35%

march_madness's Introduction

Data Warehouse Project: March Madness Analytics Documentation

This repository contains source code for our ELT pipeline based on AWS. AWS S3 is used as data lake, there are 3 separated buckets serving:

  • Source data
  • ML models and required data for ML model
  • Dashboard data and dashboard content

buckets.png

The whole architecture is as followed

arch.png

1. Main Components

AWS Components

  • S3: used as data lake
  • Sagemaker: it is a fully managed Machine Learning service, which make it easier to build, train, and deploy ML model. We use it to build, train and deploy ML model to predict the winning probability of each basketball team. The result of probabilities will be shown on dashboard
  • Lambda: is AWS's function work as a service (FaaS) platform. We deploy our dashboard (a web application) serverless as an AWS Lambda function. The AWS Lambda function only runs when it is requested, and it could be scaled down or up based on demand.

Other components

  • Zappa: is a Python serverless microframework for AWS Lambda and API Gateway. It allows us to build, deploy, and manage serverless applications. We use it to deploy and manage our dashboard on AWS Lambda.
  • Dash + Plotly: Dash is a Python framework for building web-based analytical applications. It allows us to build interactive and responsive web applications using Python code. We use Dash server to build our dashboard as an wsgi application. Our charts are created with Plotly (a library provides interactive chart creation ability).

2. Project Structure

march_madness
│   README.md
│   requirements.txt
|   zappa_settings.json
|   app.py
│
└───data_transformation
│   │   transformation.py
|
└───assets
│   
└───ml
    │   handled_new_data.py
    │   model.py
    |   train.py
    |   utils.py
  • requirements.txt: contains required libraries
  • zappa_settings.json: contains zappa config for dashbpard deployment on AWS Lambda
  • app.py: main Dash app for dashboard, contains source code for creating charts
  • data_transformation: folder contains source code to transform source data into data needed for dashboard
  • ml: folder contains source code for ML model
  • assets: contains styling for dashboard

3. Install and Setup

Setup environment:

  • Clone repository: git clone https://github.com/haanh764/march_madness.git
  • Create new conda environment: conda create -n march-madness python=3.9.15
  • Activate march-madness environment: conda activate march-madness
  • Change to project directory and install the requirements.txt: cd march_madness/ && pip install -r requirements.txt
  • Export env path to virtual env path: export VIRTUAL_ENV=/opt/conda/envs/march-madness (the path depends on your computer, you can check it py using which python)

Setup AWS:

  • Signup for AWS account and create AWS Access Keys (access key id and secret access key). Keep the created keys in credentials file, under /home/.aws folder (for Linux). The format of credentials file is as below (change default to your profile name)
[default]
aws_access_key_id = ************************
aws_secret_access_key = ************************
  • Create groups for dashboard and for sagemaker
  • For each group create and assign inliine policy to it. For the custom policy of dashboard, add the policy to create, invoke, and execute Lambda function. The sample of custom policy for it is as below (the *************** is the AWS account number):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:PassRole",
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "iam:PutRolePolicy"
            ],
            "Resource": "arn:aws:iam::***************:role/*-ZappaLambdaExecutionRole"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:ListVersionsByFunction",
                "logs:DescribeLogStreams",
                "events:PutRule",
                "s3:CreateBucket",
                "cloudformation:DescribeStackResource",
                "s3:ListBucket",
                "lambda:GetFunctionConfiguration",
                "apigateway:DELETE",
                "events:ListRuleNamesByTarget",
                "s3:AbortMultipartUpload",
                "apigateway:PATCH",
                "events:ListRules",
                "cloudformation:UpdateStack",
                "events:RemoveTargets",
                "lambda:DeleteFunction",
                "cloudfront:UpdateDistribution",
                "logs:FilterLogEvents",
                "s3:DeleteObject",
                "apigateway:GET",
                "lambda:GetAlias",
                "events:ListTargetsByRule",
                "cloudformation:ListStackResources",
                "events:DescribeRule",
                "s3:ListBucketMultipartUploads",
                "apigateway:PUT",
                "logs:DeleteLogGroup",
                "lambda:InvokeFunction",
                "lambda:GetFunction",
                "route53:ListHostedZones",
                "lambda:UpdateFunctionConfiguration",
                "s3:ListMultipartUploadParts",
                "cloudformation:DescribeStacks",
                "events:DeleteRule",
                "events:PutTargets",
                "s3:PutObject",
                "s3:GetObject",
                "lambda:UpdateFunctionCode",
                "lambda:AddPermission",
                "cloudformation:CreateStack",
                "cloudformation:DeleteStack",
                "apigateway:POST",
                "lambda:DeleteFunctionConcurrency",
                "lambda:RemovePermission",
                "lambda:GetPolicy"
            ],
            "Resource": "*"
        }
    ]
}
  • Create user and attach user to the created group, this user will have the permission rules applied on dashboard application
  • Copy the attached user's keys to the credentials file as describe above.

Setup Zappa

  • The setup for Zappa settings could be done by editing file zappa_settings.json. You can generate a new setting file by running command zappa init (remember to activate the virtual environment venv where zappa is installed first). Edit the "profile_name": "default" to the profile name corresponding to profile name was specified in /.aws/credentials
  • Change the "aws_region" to corresponding aws regions, also define s3_bucket as the bucket that will be used for keeping the dashboard content.
  • The format of zappa_settings.json is as below:
{
    "dev": {
        "app_function": "app.app",
        "profile_name": "default",
        "project_name": "march-madness",
        "runtime": "python3.9",
        "s3_bucket": "dashboard-march-madness",
        "aws_region": "eu-central-1",
        "environment_variables": {"DASH_REQUESTS_PATHNAME_PREFIX": "/dev/"}
    }
}

Deploy Dashboard

  • To deploy the dashboard, run command zappa deploy dev
  • Or, after make change in the dashboard, to re-deploy the dashboard, run the command zappa update dev
  • Check this link to see the update dashboard: https://z8eqecfs2f.execute-api.eu-central-1.amazonaws.com/dev

4. Data Transformation

To transform source files before analysis run transformation.py file in the data_transformation folder. Running this script requires providing credentials for boto3 connection. This script will load data from march-madness-src bucket and prepare it for usage in the dashboard scripts. It will change attributes and values to human readable strings and clean data, by checking and replacing any null or NaN values. Data transformation is very memory intensive, and will fail if there isn't enough memory (~8GB).

5. Dashboard

The dashboard was created with the help of Dash library from Plotly. This library transforms Python code to interactive dashboard based on the React framework. The dashboard is later customized with the help of Bulma CSS Framework to add simple styling. The structure is as below:

app.py
assets
    images
        logo.png
    0.reset.css
    1.bulma.min.css
    2.app.css
    app.js

0.reset.css resets the html to their default styling, while bulma.min.css adds simple styling by Bulma. Additional stylings were done in 2.app.css. Almost all of the dashboard code was written in app.py. In this file, the previously transformed data were loaded, then processed a bit before being loaded as graph with Plotly.

Dashboard Screenshots:

dash1.png dash2.png dash3.png

6. Machine Learning model

Creating model

To create a new model, it is necessary to run train.py script located in ml folder. Note that running this script requires providing credentials for boto3 connection. The script will fetch result files from march-madness-src bucket and divide the data so that all years before 2006 are used just as data aggregation about teams the last four years are for validation and testing. Note that the code is by default run on a GPU and will fail if no GPU instance exists. Running this code will generate a model checkpoint, a scaler that should be used for new data, as well as a new file that contains data aggregated about all matches the model had access to so far that needs to be used for predictions on any new matches. The script will also automatically generate predictions for the matches in the test set (last two years). Note that the winning/losing team is anonymized to Team 1 and Team 2 (assigned randomly) in the new file.

Making predcitions on new data

To make a prediction for a match that did not happen yet, you can use predict_match function located in handle_new_data script. The function needs s3_client, team IDs, season, day, city ID, and location which is a letter that stays whether this game is a home advantage for team provided as team 1, as well as access to the data stored in the march-madness-models bucket. To make predictions on new file with match results, a script named handle_new_data has a function called handle_new_data that takes in a timestamp that was used for creating files with new results. Running this file will load the model, create predictions and update the data aggregation used for predictions. Note that this process may be slow as all predictions try to use as much match results as possible while not seeing the future, so it runs sequentially.

Changing the model

All the model parameters, including training ones can be accessed in the model.py script. Note that any changes to the model require creating a new model from scratch.

march_madness's People

Contributors

haanh764 avatar artikandri avatar daedheldir avatar rikain avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.