Coder Social home page Coder Social logo

mini-project-for-decision-tree-'s People

Stargazers

 avatar

Watchers

 avatar

mini-project-for-decision-tree-'s Issues

Dataset

#!/usr/bin/python """ starter code for exploring the Enron dataset (emails + finances) loads up the dataset (pickled dict of dicts) the dataset has the form enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict } {features_dict} is a dictionary of features associated with that person you should explore features_dict as part of the mini-project, but here's an example to get you started: enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000 """ import pickle enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "r"))

In [6]:

enron_data['SKILLING JEFFREY K']['bonus']

Out[6]:

5600000

In [7]:

len(enron_data)

Out[7]:

146
In [9]:

len(enron_data['SKILLING JEFFREY K'])

Out[9]:

21

In [15]:

count = 0 for user in enron_data: if enron_data[user]['poi'] == True: count+=1 print count

18

In [20]:

%load ../final_project/poi_email_addresses.py

In [21]:

%load ../final_project/poi_names.txt

In [ ]:

http://usatoday30.usatoday.com/money/industries/energy/2005-12-28-enron-participants_x.htm (y) Lay, Kenneth (y) Skilling, Jeffrey (n) Howard, Kevin (n) Krautz, Michael (n) Yeager, Scott (n) Hirko, Joseph (n) Shelby, Rex (n) Bermingham, David (n) Darby, Giles (n) Mulgrew, Gary (n) Bayley, Daniel (n) Brown, James (n) Furst, Robert (n) Fuhs, William (n) Causey, Richard (n) Calger, Christopher (n) DeSpain, Timothy (n) Hannon, Kevin (n) Koenig, Mark (y) Forney, John (n) Rice, Kenneth (n) Rieker, Paula (n) Fastow, Lea (n) Fastow, Andrew (y) Delainey, David (n) Glisan, Ben (n) Richter, Jeffrey (n) Lawyer, Larry (n) Belden, Timothy (n) Kopper, Michael (n) Duncan, David (n) Bowen, Raymond (n) Colwell, Wesley (n) Boyle, Dan (n) Loehr, Christopher

In [18]:

def poiEmails(): email_list = ["[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]" "[email protected]", "[email protected]", "[email protected]", "[email protected]", "joe'.'[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "kevin'.'[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "ken'.'[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "'[email protected]", "[email protected]", "'david.delainey'@enron.com", "[email protected]", "delainey'.'[email protected]", "[email protected]", "[email protected]", "[email protected]", "ben'.'[email protected]", "[email protected]", "[email protected]", "[email protected]", "lawyer'.'[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]", "'[email protected]", "[email protected]", "[email protected]", "[email protected]", "[email protected]" ] return email_list

In [19]:

len(poiEmails())

Out[19]:

90

In [38]:

fo = open('../final_project/poi_names.txt','r')

In [39]:

fr = fo.readlines()

In [40]:

len(fr[2:])

Out[40]:

35

In [41]:

fo.close()

In [43]:

enron_data.keys()

Out[43]:

['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HORTON STANLEY C', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'UMANOFF ADAM S', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'MCCARTY DANNY J', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY REX', 'LEMAISTRE CHARLES', 'DEFFNER JOSEPH M', 'KISHKILL JOSEPH G', 'WHALLEY LAWRENCE G', 'MCCONNELL MICHAEL S', 'PIRO JIM', 'DELAINEY DAVID W', 'SULLIVAN-SHAKLOVITZ COLLEEN', 'WROBEL BRUCE', 'LINDHOLM TOD A', 'MEYER JEROME J', 'LAY KENNETH L', 'BUTTS ROBERT H', 'OLSON CINDY K', 'MCDONALD REBECCA', 'CUMBERLAND MICHAEL S', 'GAHN ROBERT S', 'MCCLELLAN GEORGE', 'HERMANN ROBERT J', 'SCRIMSHAW MATTHEW', 'GATHMANN WILLIAM D', 'HAEDICKE MARK E', 'BOWEN JR RAYMOND M', 'GILLIS JOHN', 'FITZGERALD JAY L', 'MORAN MICHAEL P', 'REDMOND BRIAN L', 'BAZELIDES PHILIP J', 'BELDEN TIMOTHY N', 'DURAN WILLIAM D', 'THORN TERENCE H', 'FASTOW ANDREW S', 'FOY JOE', 'CALGER CHRISTOPHER F', 'RICE KENNETH D', 'KAMINSKI WINCENTY J', 'LOCKHART EUGENE E', 'COX DAVID', 'OVERDYKE JR JERE C', 'PEREIRA PAULO V. FERRAZ', 'STABLER FRANK', 'SKILLING JEFFREY K', 'BLAKE JR. NORMAN P', 'SHERRICK JEFFREY B', 'PRENTICE JAMES', 'GRAY RODNEY', 'PICKERING MARK R', 'THE TRAVEL AGENCY IN THE PARK', 'NOLES JAMES L', 'KEAN STEVEN J', 'TOTAL', 'FOWLER PEGGY', 'WASAFF GEORGE', 'WHITE JR THOMAS E', 'CHRISTODOULOU DIOMEDES', 'ALLEN PHILLIP K', 'SHARP VICTORIA T', 'JAEDICKE ROBERT', 'WINOKUR JR. HERBERT S', 'BROWN MICHAEL', 'BADUM JAMES P', 'HUGHES JAMES A', 'REYNOLDS LAWRENCE', 'DIMICHELE RICHARD G', 'BHATNAGAR SANJAY', 'CARTER REBECCA C', 'BUCHANAN HAROLD G', 'YEAP SOON', 'MURRAY JULIA H', 'GARLAND C KEVIN', 'DODSON KEITH', 'YEAGER F SCOTT', 'HIRKO JOSEPH', 'DIETRICH JANET R', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'PAI LOU L', 'BAY FRANKLIN R', 'HAYSLETT RODERICK J', 'FUGH JOHN L', 'FALLON JAMES B', 'KOENIG MARK E', 'SAVAGE FRANK', 'IZZO LAWRENCE L', 'TILNEY ELIZABETH A', 'MARTIN AMANDA K', 'BUY RICHARD B', 'GRAMM WENDY L', 'CAUSEY RICHARD A', 'TAYLOR MITCHELL S', 'DONAHUE JR JEFFREY M', 'GLISAN JR BEN F']

In [42]:

enron_data['SKILLING JEFFREY K'].keys()

Out[42]:

['salary', 'to_messages', 'deferral_payments', 'total_payments', 'exercised_stock_options', 'bonus', 'restricted_stock', 'shared_receipt_with_poi', 'restricted_stock_deferred', 'total_stock_value', 'expenses', 'loan_advances', 'from_messages', 'other', 'from_this_person_to_poi', 'poi', 'director_fees', 'deferred_income', 'long_term_incentive', 'email_address', 'from_poi_to_this_person']

In [44]:

enron_data['PRENTICE JAMES']['total_stock_value']

Out[44]:

1095040

In [45]:

enron_data['COLWELL WESLEY']['from_this_person_to_poi']

Out[45]:

11

In [46]:

enron_data['SKILLING JEFFREY K']['exercised_stock_options']

Out[46]:

19250000

In [48]:

sorted(enron_data.keys())

Out[48]:

['ALLEN PHILLIP K', 'BADUM JAMES P', 'BANNANTINE JAMES M', 'BAXTER JOHN C', 'BAY FRANKLIN R', 'BAZELIDES PHILIP J', 'BECK SALLY W', 'BELDEN TIMOTHY N', 'BELFER ROBERT', 'BERBERIAN DAVID', 'BERGSIEKER RICHARD P', 'BHATNAGAR SANJAY', 'BIBI PHILIPPE A', 'BLACHMAN JEREMY M', 'BLAKE JR. NORMAN P', 'BOWEN JR RAYMOND M', 'BROWN MICHAEL', 'BUCHANAN HAROLD G', 'BUTTS ROBERT H', 'BUY RICHARD B', 'CALGER CHRISTOPHER F', 'CARTER REBECCA C', 'CAUSEY RICHARD A', 'CHAN RONNIE', 'CHRISTODOULOU DIOMEDES', 'CLINE KENNETH W', 'COLWELL WESLEY', 'CORDES WILLIAM R', 'COX DAVID', 'CUMBERLAND MICHAEL S', 'DEFFNER JOSEPH M', 'DELAINEY DAVID W', 'DERRICK JR. JAMES V', 'DETMERING TIMOTHY J', 'DIETRICH JANET R', 'DIMICHELE RICHARD G', 'DODSON KEITH', 'DONAHUE JR JEFFREY M', 'DUNCAN JOHN H', 'DURAN WILLIAM D', 'ECHOLS JOHN B', 'ELLIOTT STEVEN', 'FALLON JAMES B', 'FASTOW ANDREW S', 'FITZGERALD JAY L', 'FOWLER PEGGY', 'FOY JOE', 'FREVERT MARK A', 'FUGH JOHN L', 'GAHN ROBERT S', 'GARLAND C KEVIN', 'GATHMANN WILLIAM D', 'GIBBS DANA R', 'GILLIS JOHN', 'GLISAN JR BEN F', 'GOLD JOSEPH', 'GRAMM WENDY L', 'GRAY RODNEY', 'HAEDICKE MARK E', 'HANNON KEVIN P', 'HAUG DAVID L', 'HAYES ROBERT E', 'HAYSLETT RODERICK J', 'HERMANN ROBERT J', 'HICKERSON GARY J', 'HIRKO JOSEPH', 'HORTON STANLEY C', 'HUGHES JAMES A', 'HUMPHREY GENE E', 'IZZO LAWRENCE L', 'JACKSON CHARLENE R', 'JAEDICKE ROBERT', 'KAMINSKI WINCENTY J', 'KEAN STEVEN J', 'KISHKILL JOSEPH G', 'KITCHEN LOUISE', 'KOENIG MARK E', 'KOPPER MICHAEL J', 'LAVORATO JOHN J', 'LAY KENNETH L', 'LEFF DANIEL P', 'LEMAISTRE CHARLES', 'LEWIS RICHARD', 'LINDHOLM TOD A', 'LOCKHART EUGENE E', 'LOWRY CHARLES P', 'MARTIN AMANDA K', 'MCCARTY DANNY J', 'MCCLELLAN GEORGE', 'MCCONNELL MICHAEL S', 'MCDONALD REBECCA', 'MCMAHON JEFFREY', 'MENDELSOHN JOHN', 'METTS MARK', 'MEYER JEROME J', 'MEYER ROCKFORD G', 'MORAN MICHAEL P', 'MORDAUNT KRISTINA M', 'MULLER MARK S', 'MURRAY JULIA H', 'NOLES JAMES L', 'OLSON CINDY K', 'OVERDYKE JR JERE C', 'PAI LOU L', 'PEREIRA PAULO V. FERRAZ', 'PICKERING MARK R', 'PIPER GREGORY F', 'PIRO JIM', 'POWERS WILLIAM', 'PRENTICE JAMES', 'REDMOND BRIAN L', 'REYNOLDS LAWRENCE', 'RICE KENNETH D', 'RIEKER PAULA H', 'SAVAGE FRANK', 'SCRIMSHAW MATTHEW', 'SHANKMAN JEFFREY A', 'SHAPIRO RICHARD S', 'SHARP VICTORIA T', 'SHELBY REX', 'SHERRICK JEFFREY B', 'SHERRIFF JOHN R', 'SKILLING JEFFREY K', 'STABLER FRANK', 'SULLIVAN-SHAKLOVITZ COLLEEN', 'SUNDE MARTIN', 'TAYLOR MITCHELL S', 'THE TRAVEL AGENCY IN THE PARK', 'THORN TERENCE H', 'TILNEY ELIZABETH A', 'TOTAL', 'UMANOFF ADAM S', 'URQUHART JOHN A', 'WAKEHAM JOHN', 'WALLS JR ROBERT H', 'WALTERS GARETH W', 'WASAFF GEORGE', 'WESTFAHL RICHARD K', 'WHALEY DAVID A', 'WHALLEY LAWRENCE G', 'WHITE JR THOMAS E', 'WINOKUR JR. HERBERT S', 'WODRASKA JOHN', 'WROBEL BRUCE', 'YEAGER F SCOTT', 'YEAP SOON']

In [49]:

enron_data['SKILLING JEFFREY K']['total_payments']

Out[49]:

8682716

In [50]:

enron_data['LAY KENNETH L']['total_payments']

Out[50]:

103559793

In [51]:

enron_data['FASTOW ANDREW S']['total_payments']

Out[51]:

2424083

In [57]:

enron_data['FASTOW ANDREW S']['deferral_payments']

Out[57]:

'NaN'

In [59]:

count_salary = 0 count_email = 0 for key in enron_data.keys(): if enron_data[key]['salary'] != 'NaN': count_salary+=1 if enron_data[key]['email_address'] != 'NaN': count_email+=1 print count_salary print count_email

95
111

Dict to Array Conversion¶

A python dictionary can’t be read directly into an sklearn classification or regression algorithm; instead, it needs a numpy array or a list of lists (each element of the list (itself a list) is a data point, and the elements of the smaller list are the features of that point).
We’ve written some helper functions (featureFormat() and targetFeatureSplit() in tools/feature_format.py) that can take a list of feature names and the data dictionary, and return a numpy array.
In the case when a feature does not have a value for a particular person, this function will also replace the feature value with 0 (zero).

In [60]:

%load ../tools/feature_format.py

In [ ]:

#!/usr/bin/python """ A general tool for converting data from the dictionary format to an (n x k) python list that's ready for training an sklearn algorithm n--no. of key-value pairs in dictonary k--no. of features being extracted dictionary keys are names of persons in dataset dictionary values are dictionaries, where each key-value pair in the dict is the name of a feature, and its value for that person In addition to converting a dictionary to a numpy array, you may want to separate the labels from the features--this is what targetFeatureSplit is for so, if you want to have the poi label as the target, and the features you want to use are the person's salary and bonus, here's what you would do: feature_list = ["poi", "salary", "bonus"] data_array = featureFormat( data_dictionary, feature_list ) label, features = targetFeatureSplit(data_array) the line above (targetFeatureSplit) assumes that the label is the first item in feature_list--very important that poi is listed first! """ import numpy as np def featureFormat( dictionary, features, remove_NaN=True, remove_all_zeroes=True, remove_any_zeroes=False ): """ convert dictionary to numpy array of features remove_NaN=True will convert "NaN" string to 0.0 remove_all_zeroes=True will omit any data points for which all the features you seek are 0.0 remove_any_zeroes=True will omit any data points for which any of the features you seek are 0.0 """ return_list = [] for key in dictionary.keys(): tmp_list = [] append = False for feature in features: try: dictionary[key][feature] except KeyError: print "error: key ", feature, " not present" return value = dictionary[key][feature] if value=="NaN" and remove_NaN: value = 0 tmp_list.append( float(value) ) ### if all features are zero and you want to remove ### data points that are all zero, do that here if remove_all_zeroes: all_zeroes = True for item in tmp_list: if item != 0 and item != "NaN": append = True ### if any features for a given data point are zero ### and you want to remove data points with any zeroes, ### handle that here if remove_any_zeroes: any_zeroes = False if 0 in tmp_list or "NaN" in tmp_list: append = False if append: return_list.append( np.array(tmp_list) ) return np.array(return_list) def targetFeatureSplit( data ): """ given a numpy array like the one returned from featureFormat, separate out the first feature and put it into its own list (this should be the quantity you want to predict) return targets and features as separate lists (sklearn can generally handle both lists and numpy arrays as input formats when training/predicting) """ target = [] features = [] for item in data: target.append( item[0] ) features.append( item[1:] ) return target, features

In [62]:

count_NaN_tp = 0 for key in enron_data.keys(): if enron_data[key]['total_payments'] == 'NaN': count_NaN_tp+=1 print count_NaN_tp print float(count_NaN_tp)/len(enron_data.keys())

21
0.143835616438

In [65]:

count_NaN_tp = 0 for key in enron_data.keys(): if enron_data[key]['total_payments'] == 'NaN' and enron_data[key]['poi'] == True : print count_NaN_tp+=1 print count_NaN_tp print float(count_NaN_tp)/len(enron_data.keys())

0
0.0

In [66]:

len(enron_data.keys())

Out[66]:

146

In [69]:

count = 0 for user in enron_data: if enron_data[user]['poi'] == True and enron_data[user]['total_payments'] == 'NaN': count+=1 print count

0

%Load fainance

#!/usr/bin/python """ starter code for the regression mini-project loads up/formats a modified version of the dataset (why modified? we've removed some trouble points that you'll find yourself in the outliers mini-project) draws a little scatterplot of the training/testing data you fill in the regression code where indicated """ import sys import pickle sys.path.append("../tools/") from feature_format import featureFormat, targetFeatureSplit dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") ) ### list the features you want to look at--first item in the ### list will be the "target" feature features_list = ["bonus", "salary"] data = featureFormat( dictionary, features_list, remove_any_zeroes=True)#, "long_term_incentive"], remove_any_zeroes=True ) target, features = targetFeatureSplit( data ) ### training-testing split needed in regression, just like classification from sklearn.cross_validation import train_test_split feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42) train_color = "b" test_color = "b" ### your regression goes here! ### please name it reg, so that the plotting code below picks it up and ### plots it correctly ### draw the scatterplot, with color-coded training and testing points import matplotlib.pyplot as plt for feature, target in zip(feature_test, target_test): plt.scatter( feature, target, color=test_color ) for feature, target in zip(feature_train, target_train): plt.scatter( feature, target, color=train_color ) ### labels for the legend plt.scatter(feature_test[0], target_test[0], color=test_color, label="test") plt.scatter(feature_test[0], target_test[0], color=train_color, label="train") ### draw the regression line, once it's coded try: plt.plot( feature_test, reg.predict(feature_test) ) except NameError: pass plt.xlabel(features_list[1]) plt.ylabel(features_list[0]) plt.legend() plt.show()

Ages net_worthd

"%%writefile ages_net_worths.py

import numpy
import random

def ageNetWorthData():

random.seed(42)
numpy.random.seed(42)

ages = []
for ii in range(100):
    ages.append( random.randint(20,65) )
net_worths = [ii * 6.25 + numpy.random.normal(scale=40.) for ii in ages]

need massage list into a 2d numpy array to get it to work in LinearRegression

ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

from sklearn.cross_validation import train_test_split
ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths)

return ages_train, ages_test, net_worths_train, net_worths_test"

http://napitupulu-jon.appspot.com/posts/regression-ud.html#:~:text=%25%25writefile%20ages_net_worths.py%0A%0Aimport%20numpy,return%20ages_train%2C%20ages_test%2C%20net_worths_train%2C%20net_worths_test

Ages net_worthd .py

"# %%writefile regressionQuiz.py

import numpy
import matplotlib.pyplot as plt

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(ages_train, net_worths_train)

get Katie's net worth (she's 27)

sklearn predictions are returned in an array,

so you'll want to do something like net_worth = predict([27])[0]

(not exact syntax, the point is that [0] at the end)

km_net_worth = reg.predict([27])[0] ### fill in the line of code to get the right value

get the slope

again, you'll get a 2-D array, so stick the [0][0] at the end

slope = reg.coef_ ### fill in the line of code to get the right value

get the intercept

here you get a 1-D array, so stick [0] on the end to access

the info we want

intercept = reg.intercept_ ### fill in the line of code to get the right value

get the score on test data

test_score = reg.score(ages_test,net_worths_test) ### fill in the line of code to get the right value

get the score on the training data

training_score = reg.score(ages_train,net_worths_train) ### fill in the line of code to get the right value

def submitFit():
return {"networth":km_net_worth,
"slope":slope,
"intercept":intercept,
"stats on test":test_score,
"stats on training": training_score}"
http://napitupulu-jon.appspot.com/posts/regression-ud.html#:~:text=%23%20%25%25writefile%20regressionQuiz.py,on%20training%22%3A%20training_score%7D

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.