Comments (13)
@rasbt, I pinged you on here so you can see how I respond to each point as I work on it. Thank you again for your feedback!
from data-analysis-and-machine-learning-projects.
Regarding the images: I pulled them from another repo that was Public Domain. However, looking at the original sources, it seems that they are not attribution free. I will have to fix that.
https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg
http://www.signa.org/index.pl?Display+Iris-setosa+2
http://www.signa.org/index.pl?Display+Iris-virginica+3
from data-analysis-and-machine-learning-projects.
Oh, I see that I was a little bit sloppy yesterday night ... Seems like the sentence "On a side-note, but you probably already now this: Most gradient-based optimization algos" got cut-off. What I wanted to say is even if features are on the same scale (e.g., cm), you still want to standardize the features prior to e.g., gradient descent; makes the learning easier because you have more balanced weight updates. Going into this would be way too much detail for the tutorial, but I would at least mention that people should check their features prior to using ML algos other than tree-based ones.
"When you plot the cross-val error, I could also print the standard deviation" I meant "would", not "could" :P
But estimating the variance is actually not that trivial, FYI, look at the papers:
- T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998.
- Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross-validation. The Journal of Machine Learning Research, 5:1089–1105, 2004.
from data-analysis-and-machine-learning-projects.
When you plot the cross-val error, you could also print the standard deviation
Isn't it better to plot the distribution? I showed the mean the first couple examples; perhaps I'll just replace those with a distplot.
from data-analysis-and-machine-learning-projects.
"It's obviously a problem that our model performs quite differently depending on the data it's trained on." Maybe it would be too much for this intro, but you could mention high variance (overfitting) and high bias (underfitting); I suspect the high variance here comes from the fact that you are only using 10 trees, in RF you typically use hundreds or thousands of trees since it is a special case of bagging with unpruned decision trees after all. Also, Iris may not be the best example for RF since it is a very simple dataset that does not have many features (the random sampling of features is e.g., the advantage of RF over regular bagging). In general, maybe consider starting this section with an unpruned decision tree instead of random forests. And in the end, conclude with random forests and explain why they are typically better (with respect to overfitting). Nice side effect: you can visualize the decision tree with GraphViz. If you decide to stick with RF, consider tuning the n_estimators parameter instead.
I agree that that's a bit more detail than I'd like to go into for this tutorial; I'll leave it to your book to explain that. :-)
from data-analysis-and-machine-learning-projects.
Maybe mention that random forests are scale-invariant, e.g., you could mention that a typical procedure in the data preprocessing pipeline (required by most ML algos) is to scale the features because you are using decision trees (I believe this is the only scale-invariant algo that is used in ML) -- maybe also explain what a decision tree is and how it relates to random forests in a few sentences. On a side-note, but you probably already now this: Most gradient-based optimization algos
This ties in nicely with #7. I'll add a note to that issue and check this one off.
from data-analysis-and-machine-learning-projects.
Isn't it better to plot the distribution? I showed the mean the first couple examples; perhaps I'll just replace those with a distplot.
Yes, that's probably even better in this context. I suggested the stddev because
np.mean(cross_val_score(random_forest_classifier, all_inputs, all_classes, cv=10))
0.95999999999999996
followed by the sentence
Now we have a much more consistent rating of our classifier's general classification accuracy.
The info is basically already contained in the plot, but this would maybe be a nice summary statistic. And it is useful in practice too when you are tuning parameters e.g., via k-fold cv or in nested cv using gridsearch, e.g,. as some sort of tie-breaker.
I agree that that's a bit more detail than I'd like to go into for this tutorial; I'll leave it to your book to explain that. :-)
Sure, but I think that it would maybe be more worthwhile for the reader to use a basic decision tree instead of the Random Forest ... the hyper-parameter tuning (tree depth) would be more intuitive I guess. You could print an unpruned tree with good training acc. but bad generalization performance, and then show how you can address this with pruning (max_depth). But this is just a thought :)
from data-analysis-and-machine-learning-projects.
Alright, check it out now. It starts with a decision tree classifier then builds up to a random forest.
I think this last commit addresses the rest of your points. Please let me know if I missed anything. :-)
from data-analysis-and-machine-learning-projects.
Wow, you seem really determined to turn this IPython notebook into a IPython book :)
Haha, if you are not busy enough, I have another batch for you!
-
Maybe use a table of contents so that people see in the beginning what to expect; also it helps to navigate through the document I think.
# Table of Contents
- [Your Markdown Section Header](#Your-Markdown-Section-Header) ...
And then, you could place a little "arrow" or so under each section header to jump back to the overview
[ go back ] (# Table-of-Contents )
-
Maybe mention in a few words that stratified k-fold keeps the class proportions per fold in contrast to regular k-fold
-
hm, unfortunately the graphviz part is not working (rendering) yet, maybe try png instead of pdf?
-
in general, maybe put a graphviz part directly after your first tree so that people know what a decision tree looks like, and maybe a second one after the hyperparam tuning so that they can see how the model changed?
-
[x]
around that limitation by creating a whole bunch of shallow decision trees (hence "forest")
sorry, that's technically not correct, you use shallow trees (aka decision stumps) in boosting, not in bagging & Random Forests. I would maybe introduce it as (of course with nicer wording):
If we have a decision tree that goes too deep, we saw that it can overinterpret the training data so that it does not perform well on new, unseen data (e.g., test data). (Decision trees are nonparametric models where the number of model params depends on the training set). It is important that we find the optimal tree depth during grid search. A powerful method to overcome this challenge is to build an ensemble of experts, a large number of deep decision trees, and combine their votes. This is actually how random forests work: we create many unpruned tree based on different subsets of the training data (note that they are bootstrapped) and different feature combinations to let the majority vote decide.
from data-analysis-and-machine-learning-projects.
Haha... oh dear, what have I gotten myself into? ;-)
Good suggestions - I addressed a couple with some quick fixes and will leave the rest for the weekend.
from data-analysis-and-machine-learning-projects.
With great resources (for the next gen data scientists) come great responsibilities! :D
from data-analysis-and-machine-learning-projects.
Alrighty, finally got around to most of these! Thanks again for the feedback.
from data-analysis-and-machine-learning-projects.
Wow looks awesome, and no prob, you are always welcome! Ah, one unfortunate caveat with how the GitHub IPython Nb rendering is implemented is that it doesn't support jumping between section via internal links (yet) -- but the TOC is still useful anyways :). Haha, I may call you Random F. Olson from now on, but there is maybe one little phrase that you can add in to make it technically unambiguous: Instead of "-- each trained on a random subset of the features" -> sth. like "-- each trained on a random subsets of training samples (drawn with replacement) and features (drawn without replacement)" Otherwise people may think that they'd use the "original" training set for each decision tree in the forest.
from data-analysis-and-machine-learning-projects.
Related Issues (20)
- KeyError while using seaborn plotting HOT 3
- US-Weather-History NOAA weather data convenient link
- Computing the optimal road trip across the U.S. - Single Python script link is broken HOT 1
- Example Machine Learning Notebook.ipynb HOT 1
- why did you not use Naives bayes? HOT 3
- follower factory's alpha value calculation doesn't work for small accounts HOT 4
- UsageError: Line magic function `%install_ext` not found.
- what is ML ? HOT 1
- Ball Outcome HOT 5
- your blog is broken, sir HOT 1
- .
- iris["species"].value_counts()
- Hi, I'm getting a keyerror of species, please advice after looking at this error HOT 1
- Best parameters result not reproducible
- ML notebook: Add interpretation section
- twitter got 400 in follower-factory
- Getting this error when i try to plot my dataframe 'callers' on sns HOT 1
- Optimal road trip
- Index Error HOT 2
- MY_PROJECTS
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data-analysis-and-machine-learning-projects.