wildtreetech / advanced-comp-2017 Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 13.0 25.48 MB

💻 Material for a course on applied machine-learning for scientists. Taught at EPFL in spring 2017

Jupyter Notebook 99.90% Python 0.10%

advanced-comp-2017's People

Contributors

Stargazers

Watchers

Forkers

sharkovsky pstefko emix26 moondra olgatihhonova vinx89 dromescu jha-sudhakar jeromeku r-fndv lbelzile afcarl

advanced-comp-2017's Issues

Books and reading

https://github.com/ageron/handson-ml is the GitHub repository for Hands-on Machine Learning with Scikit-Learn and TensorFlow

https://github.com/amueller/introduction_to_ml_with_python GitHub repository for Introduction to Machine Learning with Python

Elements of statistical learning

Introduction to statistical learning

If you find a book you like, lecture notes or other reading material feel free to post it here.

Reference for time series analysis

Does anyone happen to come upon any good reference on time series analysis using machine learning techniques?

Definition of correlation of trees in a random forest

Hello,

I am going down the rabbit hole of the definition of correlation of decision trees in a random forest.

For those who don't have time to read this wall of text, here's a quick summary.

tl; dr: what is the exact definition of correlation between trees in a random forest?

And how do we interpret this value?

long explanation

At first, I naively thought one could define it as

definition 1: correlation = correlation in the predictions of all the trees in a forest

However, I was having some doubts about my intuition, and Shaina Race's comment on this Quora question confirmed my doubts.
Essentially, the way I understand her comment is: this definition is not in line with intuition, because why would I want correlation in the predictions to be low? Actually, If most of the trees are getting the right answers most of the time, correlation would be high but the model itself would be pretty good!
Moreover, this would not give any indication about the robustness nor the generalisation power of the ensemble. She seems to suggest another definition:

definition 2: correlation = correlation in the errors

This definition seems nicer, because it seems to be intuitively closer to a notion of robustness of the total ensemble.

Unfortunately, I was down the rabbit hole and could not stop sliding. I started thinking whether the exercise asked for correlation, but instead meant variance. What confused me is that on many sources (e.g. sklearn user guide) random forests are cited as a method to reduce variance, and not correlation. Now variance in this case has a pretty precise meaning

variance = variance in the predictions

However, I was a bit lost because I wasn't sure how to extend this notion from a regression problem to a classification problem (especially a multi-class classification problem). I found this paper by P. Domingos about the bias-variance-noise decomposition in a general setting, it seemed quite math-heavy but ultimately proved to be decently readable. However, I have questions about it, in particular how the constants in front of the bias and variance terms ($c_1$ and $c_2$ in the paper) affect our interpretation of their values.

Exercise from 24 April

The issue for the exercise from last week. Sorry for forgetting to create it. Feel free to remind me or just create one if it doesn't exist.

Final projects

Deadline: 29 May 2017 at 11:00am

If you are working on a final project and need credit for the course please post here with your name and what topic you are working on.

To hand in the project also post in this thread with a link to your work on github. It should contain the code to run the analysis as well as a short written report. The report should be in the style of a journal article that reports on your research.

Looking forward to seeing the results.

If you are doing a project but don't need credit, you can also post here but please make a little note saying "I don't need credit".

Lecture notes

It would be nice to have a scanned copy of the notes used during the lectures of May 1st and 8th (even a simple photo would work). Is that possible? Thank you.

Question on course: "Useless" variables, decision tree vs neural networks

Hello,

I had a follow-up question to today's discussion, although it may be covered in the next lecture

Today we saw that for a decision tree / random forest, it is best to not have "useless" variables, i.e. variables that offer little or no discriminating power. Therefore if we want to implement such an algorithm, we have to study beforehand the input variables and remove the useless ones. Right?

What about deep neural networks? At the end of the course, you mentioned how deep learning works best with raw data instead of high level features. Can we conclude that it "ignores" useless variables? Or would a large number of useless variables skew the training and result on e.g. overtraining?

Reservoir computing

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004967 via @sharkovsky

This thread is to collect links and discussions related to the ideas in the paper.

First thoughts: contextual bandits, reinforcement learning (in particular A3C), and the various "learning to learn" approaches.

There is also a graphical explanation here: https://sebastianraschka.com/faq/docs/decisiontree-error-vs-entropy.html

Exercise from 3 April

Post questions and if you need credit a link to your notebook in your repository on GitHub in this thread.

Exercise from 1 May

Francesco Cremonesi
https://github.com/sharkovsky/advanced-comp-2017/tree/exercise-four