Coder Social home page Coder Social logo

chicago-crime-outliers's Introduction

chicago-crime-outliers

Look for outliers in Chicago's crime statistics.

Motivation

This repository is my solution to a data science challenge project given by an interviewer.

Idea

I will attempt to use the database of crimes in Chicago to perform on-line prediction on which crimes are outliers. For the purposes of this study, a crime is considered an outlier if the type of the crime is significantly mispredicted by a trained, on-line predictor.

In particular, this is the outlier detection procedure. Off-line the following are performed:

  • Train a classifier to predict the class C of a crime given 4 input variables (time of day, longitude, latitude, location description)
  • Calculate a windowed running average of the prior proportion of crimes pC for each class C.

Then, to predict, the procedure is performed for each incoming data point:

  • Predict the probabilities PC from the classifier.
  • Define an outlier score for each incident by calculating (PC - pC) / pC where C is the true class for this incident.
  • If an outlier score is very low, that means the data point is an outlier data point.
  • Alternatively, if an outlier score is very high, the data point is a "most probable" data point.

Conclusions

I spent around 4-5 hours on this project. Here are some brief conclusions of the study:

  • You can see some HTML pages with sample visualizations of crime locations in results/
  • In retrospect, attempting to do the project with on-line learning was difficult given so little available time. Ready-to-use online methods are not as easy-to-use and available as off-line methods. As such, mid-way through the project I switched to using an offline method.
  • Top outlier and most-probable data points (see results/) all fit into a few crime types (for instance, Arson tends to be highly represented in the top outliers). This seems to indicate that the normalization in the outlier score is not quite right and uncommon crimes could be leading the outlier table despite attempt at normalization.
  • Before switching to off-line random forest, I tried online logistic regression. As we would expect, with so few features it was highly biased.

Things that can be improved

Obviously, this is a first stab at the problem. Plenty of things can be improved:

  • More features can be used
  • Mapping of qualitative features to numeric values is slightly inappropriate (location type). Though random forest should handle this better than some other classifiers.
  • Visualization (map plotting) presently only plots close-by locations. It would be good to narrow down on similar time.
  • A more formalized approach to outlier detection could be used.

chicago-crime-outliers's People

Contributors

msipos avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.