Machine Learning Algorithms from the Udacity Intro to Machine Learning Nanodegree
The Enron fraud is a big, messy and totally fascinating story about corporate malfeasance of nearly every imaginable type. The Enron email and financial datasets are also big, messy treasure troves of information, which become much more useful once you know your way around them a bit. Find out about the Enron data set in the explore_enron_data
Jupyter Notebook.
- The largest case of corporate fraud in American history: the Enron Corpus with real emails
- Use it to to try to figure out if there are patterns within the emails of people who were persons of interest in the fraud case to see if you can identify those patterns
- Using regressions: understand the relationship between the salaries of the people in Enron and their bonuses
- Clustering on the data (type of unsupervised learning): who within this organization was a member of the board of directors and who was just a regular employee
- Example: in netflix they use it to identify particular types of people by their movie choices (clusters of users)
- Outlier detection and removal to find certain lines in the data set that were bugs basically, clean out manually
- Indicted
- Settled without admitting guilt
- Testified in exchange for immunity
Model continuous data using linear regression and use regression to predict financial data for Enron employees and associates in the regressions_enron_data
Jupyter Notebook.
Outlier detection and removal in the enron_outliers
Jupyter Notebook.
- Fit a regression, take 10% of points that have the largest residuals, relative to your regression
- Remove them
- Re-train
- get acquainted with some of the outliers in the Enron finance data
- learn if/how to remove them.
Learn about what unsupervised learning is and find out how to use scikit-learn's k-means algorithm in the enron_clustering
Jupyter Notebook.
Apply MinMaxScaler
on the salary
and exercised_stock_options
features from the Enron dataset in the previous enron_clustering
Jupyter Notebook ro make better predictions abou POIs.
Find out how to use text data in your machine learning algorithm. Use sklearn TfidfVectorizer
to convert a collection of raw documents to a matrix of TF-IDF features. Check it out in the enron_text_learning
Jupyter Notebook.
When and why to use feature selection using sklearn classifier feature_importances_
attribute to find out outliers in text data in the enron_feature_selection
Jupyter Notebook.
Learn about data dimensionality and reducing the number of dimensions with principal component analysis (PCA) in the eigenfaces
Jupyter Notebook, an example that follows Faces recognition using eigenfaces and SVMs.
Learn more about testing, training, cross validation and parameter grid searches in the enron_validation Jupyter Notebook