mattharrison / ml_pocket_reference Goto Github PK

View Code? Open in Web Editor NEW

236.0 10.0 136.0 3.15 MB

Resources for Machine Learning Pocket Reference

Jupyter Notebook 100.00%

ml_pocket_reference's Introduction

Welcome

Here you will find the source code for the book Machine Learning Pocket Reference

Code Examples

Every chapter has a notebook with the code from that notebook.

Thanks!

Thanks to readers for their support. If you enjoyed the book, please consider leaving a review on Amazon, or sharing it on social media.

Comments?

If you have comments or issues with the book, please consider filing an issue. The digital version may recieve updates. Big updates could be addressed in future versions of the book.

Thanks again! Matt Harrison

ml_pocket_reference's People

Contributors

Stargazers

Watchers

Forkers

garcer3 godelian vascorsilva venkataravi7 kautumn06 julienl3vesque gridl mariosoares kiwi4py peterfarouk01 peterleong premangshu2010 dfrsg lordharbar barliant mohanprasath lizequnwz musabaloyi bootstrap toashiqur ssjusa claytonbarrozo susanqisun ankitnigam1985 meanmachine1031 jaquelineld ssgarnaik nachoag76 matthewfried dataunirio tatianaesc alxace modakad andturken lily48 luiscarloseiras albertoburgosplaza sridevan clauporto shmrodrigues kristofleroux accolombini tjas plaban1981 tiagoheineck sateeshs fybird1234 seddikib williamanjo fasodre kor-sped atlasgooo gblee87 fabiobasson azurecloudmonk maxcodextc de-oracle a1ip datatrekkers farjkml123 gmaubach abdelrahimkoura rabujamra millsgt rodrigoclira qixinghuang alima26 gnisbet200 sou-rafael murffious powerserg79 iramarfalcao dajebbar moraessaur dee4402 priya-gittest ypy0318 ammydolphin cmarodrigo the-salty-medic pacov henriquerezer nikolayhristov gmoen04 sanderkraus emmajaneshaw imjuanan omsharma43 leokaplun kathuman hrachubi adrianopogianeli benjaminmcf novais04 innovationexploited samiravidal shinem1199 mekongdelta-mind 4austinpowers jppalmab-booksandresources

ml_pocket_reference's Issues

ch10.ipynb Cell #6

ch10.ipynb Cell #6
----> 1 from yellowbrick.features.importances import (
2 FeatureImportances,
3 )
4 fig, ax = plt.subplots(figsize=(6, 4))
5 fi_viz = FeatureImportances(lr)

ModuleNotFoundError: No module named 'yellowbrick.features.importances'

MLPR- Titanic Dataset (URL)

This URL doesn't contain any dataset? I get "URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>" when running this. Please advise.

url = (
"http://biostat.mc.vanderbilt.edu/"
"wiki/pub/Main/DataSets/titanic3.xls"
)
df = pd.read_excel(url)
orig_df = df

Newlink- "http://biostat.mc.vanderbilt.edu/" ----> "http://biostat.app.vumc.org/"

HI,Harrison
Page 85 of your great book: you write "For example, to convert the Titanic survival column to a blend of posterior probability of the target and the prior probability given the title (categorical) information, use the following code:"
but in Cell 28，you convert the Title column in the line te = ce.TargetEncoder(cols="Title"). 1) Do you mean to convert the Title column? 2) In this sentence, "the target" means survival? 3) "prior probability" means the probability of each kind of title's survival in the training data? Thanks.

chap3 broken link for titanic3.xls

http://biostat.mc.vanderbilt.edu seems to be down, thus the link http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls does not work.

It might be good to take a version of the dataset hosted on a public library like Zenodo (or other).

I am wondering if Kaggle has a license over the dataset that prevents from making it available publicly, or if they have a license that prevents such practices.

I used: https://github.com/joanby/python-ml-course/raw/master/datasets/titanic/titanic3.xls instead. It seems that it is the same dataset.

ch12.ipynb Cell #16

ch12.ipynb Cell #16
----> 1 import scikitplot
2 fig, ax = plt.subplots(figsize=(6, 6))
3 y_probas = dt.predict_proba(X_test)
4 scikitplot.metrics.plot_cumulative_gain(
5 y_test, y_probas, ax=ax

ModuleNotFoundError: No module named 'scikitplot'

ch19.ipynb Cell #10

---> 1 X.isna().any()

NameError: name 'X' is not defined

ch13.ipynb Cell #10

ch13.ipynb Cell #10
3 rf5, X, X.columns, features
4 )
----> 5 fig, _ = pdp.pdp_interact_plot(p, features)

TypeError: clabel() got an unexpected keyword argument 'contour_label_fontsize'

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\pdpbox\pdp.py in pdp_interact_plot(pdp_interact_out, feature_names, plot_type, x_quantile, plot_pdp, which_classes, figsize, ncols, plot_params)
773 fig.add_subplot(inter_ax)
774 _pdp_inter_one(pdp_interact_out=pdp_interact_plot_data[0], inter_ax=inter_ax, norm=None,
--> 775 feature_names=feature_names_adj, **inter_params)
776 else:
777 wspace = 0.3

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\pdpbox\pdp_plot_utils.py in _pdp_inter_one(pdp_interact_out, feature_names, plot_type, inter_ax, x_quantile, plot_params, norm, ticks)
330 # for numeric not quantile
331 X, Y = np.meshgrid(pdp_interact_out.feature_grids[0], pdp_interact_out.feature_grids[1])
--> 332 im = _pdp_contour_plot(X=X, Y=Y, **inter_params)
333 elif plot_type == 'grid':
334 im = _pdp_inter_grid(**inter_params)

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\pdpbox\pdp_plot_utils.py in _pdp_contour_plot(X, Y, pdp_mx, inter_ax, cmap, norm, inter_fill_alpha, fontsize, plot_params)
249 c1 = inter_ax.contourf(X, Y, pdp_mx, N=level, origin='lower', cmap=cmap, norm=norm, alpha=inter_fill_alpha)
250 c2 = inter_ax.contour(c1, levels=c1.levels, colors=contour_color, origin='lower')
--> 251 inter_ax.clabel(c2, contour_label_fontsize=fontsize, inline=1)
252 inter_ax.set_aspect('auto')
253

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\matplotlib\axes_axes.py in clabel(self, CS, *args, **kwargs)
6338
6339 def clabel(self, CS, *args, **kwargs):
-> 6340 return CS.clabel(*args, **kwargs)
6341 clabel.doc = mcontour.ContourSet.clabel.doc
6342

About Instance parameters, Chapter 10

p.108

multi_class='ovr' Use one versus rest for each class, or for 'multinomial', train one class.
What does "train one class" mean? Thanks.

ch08.ipynb Cell #4

ch08.ipynb Cell #4
TypeError: _generate_unsampled_indices() missing 1 required positional argument: 'n_samples_bootstrap'

  1 import rfpimp
  2 rfpimp.plot_dependence_heatmap(

----> 3 rfpimp.feature_dependence_matrix(X_train),
4 value_fontsize=12,
5 label_fontsize=14,

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\rfpimp.py in feature_dependence_matrix(X_train, rfmodel, zero, sort_by_dependence, n_samples)
712 rf = clone(rfmodel)
713 rf.fit(X,y)
--> 714 imp = permutation_importances_raw(rf, X, y, oob_regression_r2_score, n_samples)
715 """
716 Some importances could come back > 1.0 because removing that feature sends R^2

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\rfpimp.py in permutation_importances_raw(rf, X_train, y_train, metric, n_samples)
398 rf.fit(X_sample, y_sample)
399
--> 400 baseline = metric(rf, X_sample, y_sample)
401 X_train = X_sample.copy(deep=False) # shallow copy
402 y_train = y_sample

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\rfpimp.py in oob_regression_r2_score(rf, X_train, y_train)
453 n_predictions = np.zeros(n_samples)
454 for tree in rf.estimators_:
--> 455 unsampled_indices = _generate_unsampled_indices(tree.random_state, n_samples)
456 tree_preds = tree.predict(X[unsampled_indices, :])
457 predictions[unsampled_indices] += tree_preds

ch14.ipynb Cell #19

KeyError: 'weight'
2 fi_viz = FeatureImportances(xgr)
3 fi_viz.fit(bos_X_train, bos_y_train)
----> 4 fi_viz.poof()
5 #fig.savefig("images/mlpr_1406.png", dpi=300)

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\yellowbrick\base.py in poof(self, *args, **kwargs)
259 "this method is deprecated, please use show() instead", DeprecationWarning
260 )
--> 261 return self.show(*args, **kwargs)
262
263 ## ////////////////////////////////////////////////////////////////////

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\yellowbrick\base.py in show(self, outpath, clear_figure, **kwargs)
239
240 # Finalize the figure
--> 241 self.finalize()
242
243 if outpath is not None:

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\yellowbrick\model_selection\importances.py in finalize(self, **kwargs)
283
284 # Set the xlabel
--> 285 self.ax.set_xlabel(self._get_xlabel())
286
287 # Remove the ygrid

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\yellowbrick\model_selection\importances.py in get_xlabel(self)
332
333 # Label for coefficients
--> 334 if hasattr(self.estimator, "coef"):
335 if self.relative:
336 return "relative coefficient magnitude"

C:\ProgramData\Anaconda3\envs\tf_SSJ_gpu\lib\site-packages\xgboost\sklearn.py in coef_(self)
714 .format(self.booster))
715 b = self.get_booster()
--> 716 coef = np.array(json.loads(b.get_dump(dump_format='json')[0])['weight'])
717 # Logic for multiclass classification
718 n_classes = getattr(self, 'n_classes_', None)

troubles in install packages

can anyone provide a requirements.txt or a Pipefile ?

its very dificult run some notebooks without any problem of lib compatibilities.

ch07.ipynb Cell #10

ch07.ipynb Cell #10
ModuleNotFoundError: No module named 'fastai.structured'
1 X3 = X2.copy()
----> 2 from fastai.structured import scale_vars
3 scale_vars(X3, mapper=None)
4 X3.std()
5 X3.mean()

ch03 error with pandas_profiling

run on google colab

import pandas_profiling
pandas_profiling.ProfileReport(df)

TypeError Traceback (most recent call last)

in ()
1 import pandas_profiling
----> 2 pandas_profiling.ProfileReport(df)

1 frames

/usr/local/lib/python3.7/dist-packages/pandas_profiling/describe.py in describe(df, bins, check_correlation, correlation_threshold, correlation_overrides, check_recoded, pool_size, **kwargs)
390 if name not in names:
391 names.append(name)
--> 392 variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)
393 variable_stats.columns.names = df.columns.names

data leakage issue

In the notebook for chapter no. 14 the cell 12 has been scaling the variable using the standard scaler and they way it uses the whole feature set then there is a possibility of the data leakage after the splitting