Coder Social home page Coder Social logo

mareoraft / tennis Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 1.0 20.16 MB

Tennis stats capstone project for The Data Incubator

Home Page: http://learnnation.org/tennis.html

Python 40.15% Dockerfile 4.68% JavaScript 45.54% HTML 2.71% CSS 3.15% SCSS 3.76%
tennis

tennis's Introduction

Tennis Capstone Project

This is the repo for my Tennis Capstone Project for The Data Incubator. The actual deployed web app is here. Both the frontend/ and backend/ directories contain README files for developers. THIS README describes the project itself.

Business Objective

Bring insightful tennis stats to tennis fans.

Tennis is the 4th most popular sport in the world [1]. The objective is to bring valuable information and insights about professional players to fans.

Information is valuable. We will provide both data and predictions to people through interactive visualizations. In addition to tennis hobbyists, people who gamble on tennis would find the info particularly beneficial. Professional tennis players themselves would find the info helpful in developing potential weaknesses to improve or exploit.

The web app could be monetized by offering basic features for free and charging for advanced features. For example, the basic version may only allow you to compare at most 5 players, but the advanced version may allow you to compare 14.

Data Ingestion

Data will be combined, processed, and updated periodically.

The data comes from two CSV files that are posted at [2]. I plan to add match-level stats in the future which will require additional data from [2], [3], [4], [5], or [6].

The data is loaded with pandas, widdled down, combined, and processed into the information we need. In particular, text-splitting and regular expressions are used to pull player info out of 1 column here; maps are used to create new columns from existing column combinations; and then data is aggregated per-player here. For the PageRank algorithm (see code here), point result information is aggregated per player-pair and a weighted directed graph is created. Networkx then computes the pagerank.

The ingestion pipeline is fully automated (it is enough to run this function) and I plan to rerun it periodically on the latest-and-greatest professional tennis data (the source data is updated every few months).

Visualizations

The project contains a bar chart which is used for both the stats-comparisons and the PageRank comparison. There exist 6 controls for interacting with the data as well as the zoom-interactivity of the amChart itself.

Interactive Website

Users interact with the project via a website. Users explore the data by choosing a (1) statistic, (2) normalization, (3) gender, and some other options. Users can click on info buttons to get explanations of the various choices and methods used to compute the data.

The user interactivity is client-side, and the client will make calls to the server to update the data as necessary. Tools used to achieve this include JavaScript, React, Material-UI, amCharts, Python 3, Flask, Pandas, and Networkx.

Analysis and Results

The statistics are calculated as follows:

"Points won" is the number of points a player has won. When normalized by percentage, it is divided by the number of points they have played. Service points won is the number of points a player won when serving. As a percentage, the denominator is the total number of service points they played. Aces is the number of aces a player hit. As a percentage, the denominator is the number of service points they played. Double faults is the number of double faults the player had. As a percentage, the denominator is the number of service points they played.

The GOAT algorithm is the Google PageRank algorithm applied to the following graph definition: Each player is represented by exactly 1 node. If A and B are nodes, then the directed edge (A, B) has an integer weight which is the number of points that player A lost to player B.

The time decay normalization is the same as the percentage normalization with the following difference: More recent points are weighted higher than points that happened a long time ago. We use a 1-year half-life exponential decay function, so that a point that occurred 1 year ago is only worth half as much as a point that happened today. In the percent normalization, a single point contributes 1 to the denominator and either 1 or 0 to the numerator. In the time decay normalization, a single point that occured y years ago contributes (1/2)^y to the denominator and either (1/2)^y or 0 to the numerator.

The following are some selected results from the analysis:

stat: aces double-faults points-won The GOAT Algorithm
normalization: percent percent percent raw count
#1 player: Ivo Karlovic 13.5% Goran Ivanisevic 4.1% Evgeny Donskoy 55.7% Roger Federer 4.5%
#2 player: Goran Ivanisevic 9.7% Noah Rubin 4.0% Thomas Muster 54.6% Rafael Nadal 3.1%
#3 player: John Isner 9.7% Matthew Ebden 4.0% Igor Sijsling 54.5% Novak Djokovic 2.7%

tennis's People

Contributors

mareoraft avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

tennis's Issues

add **time-decay** normalization

  • 1. Add a date column to df

  • 2. Add a weight column to df, or use a function in aggregation (weight = (1/2)^(today - date))

  • 3. Do an agg where you multiply things by weights before summing

MemoryError backend

The backend container in PROD hits a MemoryError after switching a few times between different dropdown options on the website. Actually, the FIRST error I see is:

DAMN ! worker 1 (pid: 13) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 15)
  • is it taking up more memory for each selection?
  • are we caching something too big?
  • is the server itself out of memory? --> well i freed up at least 1.3 GB, but maybe 2.6 GB, and issue persists

PROD flask server

We have yet to use a trustworthy server to serve the backend in PRODUCTION.

For development we are using the builtin one. Hence we kick it off with python3 main.py.

For prod I recall trying gunicorn but running into an issue. Let's try something else and return to gunicorn again if it doesn't work.

CORS error

Below is a copy of my SO question that I'm NOT posting because switching to a flask PROD server and using http://162.243.168.182:5001 instead of http://clementine:5001 appears to fix the CORS error.

subject:
Why is Flask-Cors not detecting my Cross-Origin domain in production?

body:
My website has a separate server for the front-end and back-end, and so my back-end server needs to open up CORS permissions so that the front-end can request data from it.

I am using Flask-Cors successfully in development, but it doesn't work when I deploy to production. (please note that I have looked at other flask-cors questions on SO, but none of them fit my situation)

Here is the relevant code that is working in development:

# 3rd party imports
import flask
from flask import Flask, request, redirect, send_from_directory, jsonify
from flask_cors import CORS

# Create the app
app = Flask(__name__)
CORS(app, origins=[
  'http://localhost:5001',
])

# Define the routes
@app.route('/')
def index():
  # no CORS code was necessary here
  app.logger.info(f'request is: {flask.request}')

What I've tried:

  • Adding my server's ip address 'http://162.243.168.182:5001' to the CORS list is not enough to resolve the issue, although I understand it should be there.
  • It seems that using '*' to allow ALL origins does not work either. (very suspicious!)

Please note that I am using a Docker container, so my environment between development and prod are almost identical. But what's different is that I'm on a different server and I've modified the front-end to send the request to the new IP address (resulting in the famous “Access-Control-Allow-Origin” header missing CORS error).

Now I'm wondering if the flask.request object is somehow missing information, and this causes Flask-Cors to not send the Access-Control-Allow-Origin header like it's supposed to. I can provide that logging info if you think it would help!

add SQL Database

Let's make an SQL DB (maybe SQLite) and populate the stats into it, then use SQLAlchemy or pyodbc to pull the data from the database and hand it over.

support ngrams

Add a feature to convert CDR3 sequences into overlapping n-gram sequences before comparison. So the n-gram sequences will be compared instead of the CDR3 sequences.

Example:

CDR3 seq     ->    n-gram seq
A,L,P        ->    A,L,P     (n=1)
             -> AL,LP        (n=2)   
             -> ALP          (n=3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.