Coder Social home page Coder Social logo

misc-data-science-projects's Introduction

misc-data-science-projects

Here's my collection of various tutorials and annotated noteboks involving data structures, transformation, analysis, visualization, and/or machine learning. I primarily use these notebooks for my own code snippet glossary, but most are written in an explanatory fashion and so anyone reading this is free to re-use any code here in any manner they wish.

Classification

Text Classification: Predicting Star Ratings Using Yelp Dataset

Link to notebook.

Link to dataset.

Supervised classification for the official Yelp dataset that tries to predict how many stars a review will get based on text content. Uses word count vectorization, SVM classification, and visualizing of word frequency and feature importance. (Python).

Binary Classification: Defaulting on Loans with Kaggle Credit Dataset

Link to notebook.

Link to dataset.

Kaggle project: predicting whether or not a prospective borrower will default on a loan based on credit data. A "real world" problem in the sense of throwing multiple models against the wall and testing what quantitatively sticks. Includes different imputation methods, outlier detection, feature engineering, cross-validation, AUC evaluation, and ultimately uses XGBoost. (R).

Image Classification: Recognizing Handwritten Arabic Letters

Link to notebook.

Quick proof-of-concept for an SVC that can identify images of handwritten Arabic letters. Mostly just experimenting with image processing libraries here. (Python).

Linear Regression

Multiple Regression: Boston Housing

Link to notebook.

Link to dataset.

Multiple regression run on the classic Boston Housing dataset. Analyzes factors like crime rate, economic status, commercial zoning, and others and how they may affect housing value in different neighborhoods in 1970s Boston. Includes linear regression, multiple regression, feature selection, learning curves. (Python).

Exploring SAT Scores in New York

Link to notebook.

Link to dataset.

Regression analysis of SAT scores in New York Public Schools against factors like income, ethnicity, crime levels, and other variables that may indicate discrimination. Dataset sourced from the New York Open data project. (R).

Exploratory Analysis / Data Visualization

Annotated Heat Maps and Dendrograms in R

Link to notebook.

Heatmap visualization with annotations and dendrograms of categorical-heavy data from a chemistry PhD research experiment. Mostly visual and includes hierarchical clustering. (R).

Choropleths & Other Maps with Spatial Data Visualization

Link to notebook.

Color-coded choropleth mapping with spatial data using two examples: crime levels across neighborhoods in San Franisco and agricultural production across different parishes in Uganda. Techniques used here include fuzzy matching, GIS shapefile manipulation with rgdal, and choropleth and heatmaps with ggplot2. (R).

EDA: Investigation on US Substance Trends

Link to notebook.

Link to dataset.

Exploratory analysis of the prevalence of recreational drug use in the US based on the publicly avalilable National Survey on Drug Use and Health (NSDUH) against various socioeconomic variables. Lots of data visualization in Matplotlib/Seaborn. This dataset contains only a single year but the NSDUH has a couple decades of annual data: analyzing trends over time may make a very interesting project extension for someone who can scrape it. (Python).

EDA: Census Comparison Between Three Cities in Tennessee

Link to notebook.

Simple data visualizations of income and demographics from three different cities in Tennessee: one rural town, one mid-sized suburban city, and one urban metropolitan area. Data retrieved from the 2016 American Community Survey released by the US Census Bureau. (Python).

Hash Tables

Arabic Verb Conjugator Tool

Link to notebook.

Complete script for generating Arabic verb conjugations, implemented using a neat (but maybe superfluous) hash table. (Python).

Wrappers and API Demos

Wrapper: Harvard Case Law API

Link to notebook.

Link to dataset.

Demo and Python wrapper contributed to the Harvard Case Law Project, a massive corpus database of "every volume designated as an official report of decisions by a court within the United States." Includes easy functions for pulling, searching, and downloading individual cases through the API. (Python).

Tutorial: Requests with NHL API

Link to notebook.

Tutorial for using the requests library to retrieve data from APIs. Simple notebook for reference and examples. (Python).

Tutorial: Google Cloud Natural Language API

Link to notebook.

Tutorial for the Google Cloud Natural Language API. Gives a demo of how to use the API to tokenize texts, parse them for sentiment scores, and demonstrates some possible ideas for visualizations (Python).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.