Coder Social home page Coder Social logo

portfolio's Introduction

Table of Contents

  1. Data Science Portfolio
    1. Data Science Projects
    2. Shiny Dashboards
    3. Frequently Used Packages
    4. IDEs

Data Science Portfolio

This is a repository for some of my past projects in data science.

Data Science Projects

  • Imbalanced Classification on Bankruptcy Prediction: This project employs several techniques, including SMOTE sampling and ROC curve, to tackle the inherent imbalance in bankruptcy data and builds machine learning models for bankruptcy prediction. Repeated cross-validation and alternative performance measures are used to evaluate models. Best models are selected based on cv ROC AUC values. The project is carried out using R, especially focusing on the usage of model training package caret. Here is a link to the project repository.

  • Deep Dive into Airbnb Listing Data: This is the product of an one-day datathon hosted by Citadel LLC. We had seven hours to come up with a data analysis report using Airbnb listing data. We ended up being the third place out of 24 teams. Our report delves into aspects from investigating the relationships between keywords in listing names and listing price, to using machine learning techniques to predict listing price based on the housing condition.

  • Analyzing Movie Scores on IMDb and Rotten Tomatoes: I love movies. I'm also intrigued by how movie websites rate movies and the fact that their ratings on the same movie can be vastly different. This project uses data visualization in ggplot2 to demonstrate IMDb and RT's different behaviors in reviewing movies. I also made some top 10 lists of directors, actors or movies based movie ratings. Computational tools used in this project include SparkSQL, Python and R. Here is a link to the project repository.

  • Overwatch Player Statistics and Hero Usage Analysis: Scraped data from the Overwatch career profile pages of 16,500 Overwatch players. Used data visualization to investigate the relationships between player performance, hero usage and skill rating. I also performed PCA and clustering analysis to categorize different types of players based on their hero usage patterns. Link to the project repository.

  • Injury Analysis for U of M Soccer Team: I was in a multidisciplinary team of four U of M students. We were tasked by the university athletics department to analyze soccer team's training data and injury log. The objective is to find out correlations between training intensity and injury occurrence.

Shiny Dashboards

These are Shiny web apps I created for some of my projects. The web pages can take some time to load as the underlying data manipulation and plotting can be complex sometimes.

  • MovieStats: An interactive web app to learn insights from movie ratings: This is an interative web application I created using R Shiny. Data comes from MovieLens Latest Datasets, OMDb API, and information scraped from IMDb. The goal is to provide a visually compelling tool to deliver insights into movie ratings, box office, as well as analytics for individual actors and directors. The app also features some interesting movie facts hidden with data. The interactive data visualizations in this app are built with Plotly.

  • Low-value Imaging Measures Specification: A dashboard I created during my internship at the University of Michigan Health System. The purpose of this dashboard is to display information of the measures we used to identify low-value imagings. It contains both high level descriptions of measures and detailed tables of ICD codes used in the measures. The ICD-10 tables are currently faulty due to ongoing development of the underlying R package that I wrote for the project.

Frequently Used Packages

  • R
    • tidyverse: A collection of packages that can tackle complex data manipulation, cleaning and visualization tasks, including ggplot2 for plotting, dplyr and tidyr for data manipulation and cleaning, lubridate for handling date objects and stringr for string operations.
    • caret: Stands for "Classification and Regression Training. A set of functions that streamline the entire process of model fitting. It covers most popular statistical models and machine learning techniques such as logistic regression, random forests, SVM, neural networks, and xgBoost.
    • rvest: A great package that enables powerful and flexible web scraping in R.
    • shinydashboard: All my shiny apps are built around this package. The package includes functions that make creating organized and appealing layouts a very easy and pleasant experience.
  • Python
    • numpy: Vectorization in Python.
    • pandas: Introduces data series and data frames to Python. Makes working with data more pleasant in Python.
    • scikit-learn: Machine learning library for Python. Just like caret in R, includes most tools in model fitting.

IDEs

  • RStudio: I used it when I first started to learn R. It has a great layout and the best R code completion I've seen. But eventually I ditched it for the flexibility and efficiency of Vim and the constant need to go into terminal.
  • Vim + Tmux + Slime: I used Vim for all my text file editing later on. Tmux can help me create custom layout and Slime plugin for Vim can send any code from Vim to a terminal/console. I used this combo for both R and Python. It works well with IPython and Pyspark as well.
  • Spacemacs: Spacemacs is a community-driven Emacs distribution that takes the advantages of Vim and Emacs. It can launch terminals, R and Python sessions all within the same screen. And all commands are just a few key strokes away. Most of my data science work can be done in Spacemacs and I rarely need to go to another screen other than Chrome. On top of that, it has wonderful documentation both online and built-in.

My setup for the last two can be found here.

portfolio's People

Contributors

songxh0424 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.