Coder Social home page Coder Social logo

mlab_intuit's Introduction

ML@B-Intuit Collaboration: Predicting Life

Goal: Create a model that can detect when a user is going through an important event in their life using the user's emails.

Dependencies

Quick Start

The files of interest in terms of models are: random_forest_confusion_matrix.py, linear_model.py, and pca_plot.py. The featurization is done under the hood and is passed in as options to linear_model.py.

The following command will generate a ridge classifier with TF-IDF for text featurization.

python models/linear_model.py

Gathering Data

  • eparser.py

After generating your GYB directory, invoke eparser.py to store your parsed emails onto your local MongoDB in the unlabeled email collection.

Usage

python3 eparser/parser.py [path to folder with GYB emails]

Example

python3 eparser/parser.py ~/Documents/Berkeley/ML/Intuit/got-your-back-1.0/[email protected]/2016

Structure

email = {'From': "Email Sender",
        'Subject': "Email Subject",
        'Text': "Email Body Text",
        'To': "Email Recipient",
        '_id': "Datum Id"
        }

Data

We split our data into 20% testing and 80% training. The data files are python pickle files pickeled with Python 2.7, so to ensure that the data is loaded properly, run the model with Python 2.7. When loaded the data will be in the for of a list of python dictioraries.

Training Data: models/data/intuit_data Testing Data: models/data/intuit_test_data

from pickle import load
with open('models/data/intuit_data', 'rb') as f:
    data = load(f)

Labeling Data

  • labeller.py

This is a tool that allowed for the rapid labeling of emails, to generate a labeled dataset for supervised learning.

Usage

python labeller.py

alt text

Featurizing Data

Example usage

from featurizer import featurize
data = featurize(list_of_texts, mode='tfidf')
Word2Vec Similarity Models

Used as auxillary features as opposed to a standalone model. This model takes as input, the words which are chosen to semantically represent the labels and outputs a vector that represents the similarity scores of an email and each label.

from word2vec_model import featurize
feature_vector = featurize(email)

Clustering

Principle Component Analysis
  • pca_plot.py

Used to investigate the underlying structure of our featurization. We would like to know how many clusters exist intrinsically and see if they align well with our given labels.

Currently we are using PCA and looking at the clusters of the top 2 principle components. The featurization that this model decomposition uses is TF-IDF and BOW. Note: to display further plots with different featurizations, X out of the previous plot window.

Usage

python models/pca_plot.py
K-Means
  • kmeans.py

Used to segment data into 2 clusters, event and non-event. Computes accuracy of TF-IDF and BOW featurizations.

python models/kmeans.py
  • kmeans_pca.py

Used to segment dimension-reduced data into 2 clusters, event and non-event. PCA version allows clusters to be plotted. Note: to display further plots with different featurizations, X out of the previous plot window.

python models/kmeans_pca.py

Model Generation

Random Forests

Random forest are an effective model to prevent overfitting to the training data by diversifying the models in the ensemble. We use then to try and predict life events given data.

To generate our scored random forest and confusion matrix evaluation, run:

python models/random_forest_confusion_matrix.py
Linear Classification

We attempted to use a few linear models to do the email classification. The models we used were linear ridge classification and support vector classification. We can run these models with specific featurization, such as bag of words and tfidf.

python models/linear_model.py -m [svm/linear] -f [tfidf/bow]

mlab_intuit's People

Contributors

rykard95 avatar amogkam avatar mtrepte avatar gkswamy98 avatar

Stargazers

 avatar  avatar Irene Lee avatar  avatar

Watchers

James Cloos avatar  avatar  avatar Irene Lee avatar  avatar  avatar  avatar Ashwin Sreelal avatar Olivia Koshy avatar  avatar Arda Sahiner avatar  avatar Peter Wang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.