Coder Social home page Coder Social logo

Comments (13)

rhiever avatar rhiever commented on May 27, 2024

@rasbt, I pinged you on here so you can see how I respond to each point as I work on it. Thank you again for your feedback!

from data-analysis-and-machine-learning-projects.

rhiever avatar rhiever commented on May 27, 2024

Regarding the images: I pulled them from another repo that was Public Domain. However, looking at the original sources, it seems that they are not attribution free. I will have to fix that.

https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg

http://www.signa.org/index.pl?Display+Iris-setosa+2

http://www.signa.org/index.pl?Display+Iris-virginica+3

from data-analysis-and-machine-learning-projects.

rasbt avatar rasbt commented on May 27, 2024

Oh, I see that I was a little bit sloppy yesterday night ... Seems like the sentence "On a side-note, but you probably already now this: Most gradient-based optimization algos" got cut-off. What I wanted to say is even if features are on the same scale (e.g., cm), you still want to standardize the features prior to e.g., gradient descent; makes the learning easier because you have more balanced weight updates. Going into this would be way too much detail for the tutorial, but I would at least mention that people should check their features prior to using ML algos other than tree-based ones.

"When you plot the cross-val error, I could also print the standard deviation" I meant "would", not "could" :P

But estimating the variance is actually not that trivial, FYI, look at the papers:

  • T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998.
  • Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross-validation. The Journal of Machine Learning Research, 5:1089–1105, 2004.

from data-analysis-and-machine-learning-projects.

rhiever avatar rhiever commented on May 27, 2024

When you plot the cross-val error, you could also print the standard deviation

Isn't it better to plot the distribution? I showed the mean the first couple examples; perhaps I'll just replace those with a distplot.

from data-analysis-and-machine-learning-projects.

rhiever avatar rhiever commented on May 27, 2024

"It's obviously a problem that our model performs quite differently depending on the data it's trained on." Maybe it would be too much for this intro, but you could mention high variance (overfitting) and high bias (underfitting); I suspect the high variance here comes from the fact that you are only using 10 trees, in RF you typically use hundreds or thousands of trees since it is a special case of bagging with unpruned decision trees after all. Also, Iris may not be the best example for RF since it is a very simple dataset that does not have many features (the random sampling of features is e.g., the advantage of RF over regular bagging). In general, maybe consider starting this section with an unpruned decision tree instead of random forests. And in the end, conclude with random forests and explain why they are typically better (with respect to overfitting). Nice side effect: you can visualize the decision tree with GraphViz. If you decide to stick with RF, consider tuning the n_estimators parameter instead.

I agree that that's a bit more detail than I'd like to go into for this tutorial; I'll leave it to your book to explain that. :-)

from data-analysis-and-machine-learning-projects.

rhiever avatar rhiever commented on May 27, 2024

Maybe mention that random forests are scale-invariant, e.g., you could mention that a typical procedure in the data preprocessing pipeline (required by most ML algos) is to scale the features because you are using decision trees (I believe this is the only scale-invariant algo that is used in ML) -- maybe also explain what a decision tree is and how it relates to random forests in a few sentences. On a side-note, but you probably already now this: Most gradient-based optimization algos

This ties in nicely with #7. I'll add a note to that issue and check this one off.

from data-analysis-and-machine-learning-projects.

rasbt avatar rasbt commented on May 27, 2024

Isn't it better to plot the distribution? I showed the mean the first couple examples; perhaps I'll just replace those with a distplot.

Yes, that's probably even better in this context. I suggested the stddev because

np.mean(cross_val_score(random_forest_classifier, all_inputs, all_classes, cv=10))
0.95999999999999996

followed by the sentence

Now we have a much more consistent rating of our classifier's general classification accuracy.

The info is basically already contained in the plot, but this would maybe be a nice summary statistic. And it is useful in practice too when you are tuning parameters e.g., via k-fold cv or in nested cv using gridsearch, e.g,. as some sort of tie-breaker.

I agree that that's a bit more detail than I'd like to go into for this tutorial; I'll leave it to your book to explain that. :-)

Sure, but I think that it would maybe be more worthwhile for the reader to use a basic decision tree instead of the Random Forest ... the hyper-parameter tuning (tree depth) would be more intuitive I guess. You could print an unpruned tree with good training acc. but bad generalization performance, and then show how you can address this with pruning (max_depth). But this is just a thought :)

from data-analysis-and-machine-learning-projects.

rhiever avatar rhiever commented on May 27, 2024

Alright, check it out now. It starts with a decision tree classifier then builds up to a random forest.

I think this last commit addresses the rest of your points. Please let me know if I missed anything. :-)

from data-analysis-and-machine-learning-projects.

rasbt avatar rasbt commented on May 27, 2024

Wow, you seem really determined to turn this IPython notebook into a IPython book :)

Haha, if you are not busy enough, I have another batch for you!

  • Maybe use a table of contents so that people see in the beginning what to expect; also it helps to navigate through the document I think.

     # Table of Contents
     - [Your Markdown Section Header](#Your-Markdown-Section-Header)
     ... 

And then, you could place a little "arrow" or so under each section header to jump back to the overview

[ go back ] (# Table-of-Contents )
  • Maybe mention in a few words that stratified k-fold keeps the class proportions per fold in contrast to regular k-fold

  • hm, unfortunately the graphviz part is not working (rendering) yet, maybe try png instead of pdf?

  • in general, maybe put a graphviz part directly after your first tree so that people know what a decision tree looks like, and maybe a second one after the hyperparam tuning so that they can see how the model changed?

  • [x]

    around that limitation by creating a whole bunch of shallow decision trees (hence "forest")

sorry, that's technically not correct, you use shallow trees (aka decision stumps) in boosting, not in bagging & Random Forests. I would maybe introduce it as (of course with nicer wording):

If we have a decision tree that goes too deep, we saw that it can overinterpret the training data so that it does not perform well on new, unseen data (e.g., test data). (Decision trees are nonparametric models where the number of model params depends on the training set). It is important that we find the optimal tree depth during grid search. A powerful method to overcome this challenge is to build an ensemble of experts, a large number of deep decision trees, and combine their votes. This is actually how random forests work: we create many unpruned tree based on different subsets of the training data (note that they are bootstrapped) and different feature combinations to let the majority vote decide.

from data-analysis-and-machine-learning-projects.

rhiever avatar rhiever commented on May 27, 2024

Haha... oh dear, what have I gotten myself into? ;-)

Good suggestions - I addressed a couple with some quick fixes and will leave the rest for the weekend.

from data-analysis-and-machine-learning-projects.

rasbt avatar rasbt commented on May 27, 2024

With great resources (for the next gen data scientists) come great responsibilities! :D

from data-analysis-and-machine-learning-projects.

rhiever avatar rhiever commented on May 27, 2024

Alrighty, finally got around to most of these! Thanks again for the feedback.

from data-analysis-and-machine-learning-projects.

rasbt avatar rasbt commented on May 27, 2024

Wow looks awesome, and no prob, you are always welcome! Ah, one unfortunate caveat with how the GitHub IPython Nb rendering is implemented is that it doesn't support jumping between section via internal links (yet) -- but the TOC is still useful anyways :). Haha, I may call you Random F. Olson from now on, but there is maybe one little phrase that you can add in to make it technically unambiguous: Instead of "-- each trained on a random subset of the features" -> sth. like "-- each trained on a random subsets of training samples (drawn with replacement) and features (drawn without replacement)" Otherwise people may think that they'd use the "original" training set for each decision tree in the forest.

from data-analysis-and-machine-learning-projects.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.