lesteve / scikit-learn-tutorial Goto Github PK

View Code? Open in Web Editor NEW

41.0 13.0 22.0 34.83 MB

This repo has moved to https://github.com/INRIA/scikit-learn-mooc/

License: Creative Commons Zero v1.0 Universal

Makefile 0.02% Jupyter Notebook 99.00% Python 0.95% CSS 0.03% HTML 0.01%

scikit-learn-tutorial's Introduction

This repo has moved to: https://github.com/INRIA/scikit-learn-mooc/

scikit-learn-tutorial's People

Contributors

Stargazers

Watchers

Forkers

glemaitre ogrisel billy-odera bharatr21 abhijithanil asdf247 naoms adrientoulouse mathurinm gaelvaroquaux timyb chaoshengt iamokay brospars lfarhi lucyleeow alagarrigue falconzyx ptracton

scikit-learn-tutorial's Issues

Build failure during the pdf export

https://github.com/lesteve/scikit-learn-tutorial/runs/315274429. This has failed for a few days see https://github.com/lesteve/scikit-learn-tutorial/actions. There seems to be a TimeoutError, the PDF export takes 50s and does not complete. Maybe there is a timeout that can be increased with github actions ? ping @brospars.

Full error:

Run for f in *.html ; do remarkjs-pdf "$f"; done
Convert file:///home/runner/work/scikit-learn-tutorial/scikit-learn-tutorial/dist/index.html to index.pdf ...
Finished.
Convert file:///home/runner/work/scikit-learn-tutorial/scikit-learn-tutorial/dist/ml_concepts.html to ml_concepts.pdf ...
Finished.
Convert file:///home/runner/work/scikit-learn-tutorial/scikit-learn-tutorial/dist/overfit.html to overfit.pdf ...
TimeoutError: Navigation timeout of 30000 ms exceeded
    at /opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/node_modules/puppeteer/lib/LifecycleWatcher.js:142:21
  -- ASYNC --
    at Frame.<anonymous> (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/node_modules/puppeteer/lib/helper.js:111:15)
    at Page.goto (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/node_modules/puppeteer/lib/Page.js:675:49)
    at Page.<anonymous> (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/node_modules/puppeteer/lib/helper.js:112:23)
    at convertPdf (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/remarkjs-pdf.js:80:14)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async main (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/remarkjs-pdf.js:69:5) {
  name: 'TimeoutError'
}
##[error]Process completed with exit code 10.

Hackmd URL for plan.md

https://hackmd.io/6fFy3_y6SOWonOYeGaNiHw

How to use hackpad to synchronise with github

cc @GaelVaroquaux

Top right, click on the ... icon then "versions":

A modal appears, the name of the github file is shown on the top left (lesteve/scikit-learn/plan.md) and then on the right hand side the down/up arrows can pull/push to github.

Any troubles with that let me know!

My notes about possible improvements from Euroscipy tutorial

This is not very structured, so feel free to edit, comment, open other issues for bigger chunks of work:

Content

have a TOC per notebook?
tinyurl (or huit.re) with link for easier access to the github repo (README)
first notebook with TOC so that the binder goes directly to this notebook
Too-wide code: numerical_columns categorical_columns should cut at
'capital-loss' and 'marital-status'. I think we should have a special formatter
maybe black (I feel like it takes too much vertical space) or maybe yapf with
some nice settings.
Say that education-num is not the number of years of education (I say that we
could expect this, but I did not say this was not true)
Should we get rid of education-num everywhere, since this is the same as
education?
Young people work part-time. Say that non-working people (students) are not
part of the survey.
Put solutions in different folder? They interfere with the notebooks, you have
to say: open the 02 notebook but not the one with exercise ...
naming: df vs adult_census harmonize, maybe data is good enough.
Harmonize the way to get categorical_columns vs numerical_columns. Some code
use dtype some code use explicit column names.
Side-comments about the train test split, goal is not to memorize. Should
there be more details for the MOOC ? Or links to the first part about
overfitting vs underfitting.
02 exercise 01, not cross_val_score but use train_test_split
Different kind of preprocessing, add a link to user guide. Question was: what
happens if the data is not gaussian.
n_iter_ is a list for some reason ...

  print(
    f"The accuracy using a {model.__class__.__name__} is "
    f"{model.score(data_test, target_test):.3f} with a fitting time of "
    f"{elapsed_time:.3f} seconds in {model[-1].n_iter_} iterations"
  )
  The accuracy using a Pipeline is 0.818 with a fitting time of 0.809 seconds in [13] iterations

Cross-validation explanation plot: add legend for blue vs red. It looks like
there might be better images from scikit-learn documentation.
minor: sparse=False in OneHotEncoder just for visualization purposes (to see the numpy
array).
handle_unknown='ignore': explain more the reason: to put 0 in the categories if at test time, a category has not been seen in the train data.
For exercise, have a link to the similar example, e.g. OrdinalEncoder put a
link to what we did with OneHotEncoder.
Can we have link in notebooks to an other notebook, that works locally, on
binder, etc ...
Question about : pipeline with the scaler does it compute the mean on the
training, so you have to explain how the Pipeline works, calls .fit and
.transform. You don't have to explain maybe, you can just say the parameters
are modified only in the .fit (so not in the .predict)
Question about pipeline, why is it useful rather than just writing the code
yourself? You have to explain .fit and .fit_transform. Hmmm, maybe you can
just add a comment about why the Pipeline is useful in general.

Miscellaneous

Timings are very slow on binder:
0.7 s for LogisticRegression fit vs 5.6s on binder.
2 minutes (rather than ~10s on Olivier's machine) for Reference pipeline (no
numerical scaling and integer-coded categories)
02_basic_preprocessing_exercise_03_solution.ipynb

Suggestions

I have some minor suggestions:

01_tabular_data_exploration:

adult_census.profile_report() tells us that there are a few duplicate rows. It may be worthwhile explaining how these duplicate entries may affect/not affect prediction?

02_basic_preprocessing

convergence warning - you explain that this tells us that our model stopped learning because it reached the maximum number of iterations allowed and that scaling the data will help. Can you expand on what convergence means, why increasing the number of allowed iterations is a bad idea and why scaling the data helps?
explain what the StandardScaler does? Maybe not everyone knows the equation?

04_basic_parameters_tuning:

I think you need to explain more about the hyper-parameter C. Maybe even just give them a useful link to read on regularisation and overfitting?
For the last cell:

model = make_pipeline(
    preprocessor, LogisticRegressionCV(max_iter=1000, solver='lbfgs', cv=5)
)
score = cross_val_score(model, data, target, n_jobs=4, cv=5)

you don't provide a Cs argument like you do above and it might be worth mentioning that by default it tests a grid of 10 C values.

Inconsistent tree diagram and plot in data exploration notebook

https://nbviewer.jupyter.org/github/lesteve/scikit-learn-tutorial/blob/master/rendered_notebooks/01_tabular_data_exploration.ipynb

You would expect the high age high hours-per-week to be "high-earning". This is what the 2d plot shows (in red). The plot_tree (tree diagram) says value = [637, 572] so in the same order as all the other leaves.

Not clear where the problem is, plot_tree (scikit-learn) or plot_tree_decision_function (function in the notebook).

RFC Stop pushing the rendered notebook in the main repo?

Maybe instead we could have a CI job that check that the notebook render without errors and another CI jobs that run the rendering of the notebooks only on merge commits in master and commits and push the results into a dedicated "rendered" branch that we then use for nbviewer links. We could also render the HTML to make the rendered notebooks directly browsable as a website on https://lesteve.github.io/scikit-learn-tutorial

This way we would never have the diff with spurious output changes in pull request.

WDYT?

lesteve / scikit-learn-tutorial Goto Github PK

scikit-learn-tutorial's Introduction

This repo has moved to: https://github.com/INRIA/scikit-learn-mooc/

scikit-learn-tutorial's People

Contributors

Stargazers

Watchers

Forkers

scikit-learn-tutorial's Issues

Build failure during the pdf export

Hackmd URL for plan.md

How to use hackpad to synchronise with github

My notes about possible improvements from Euroscipy tutorial

Content

Miscellaneous

Suggestions

Inconsistent tree diagram and plot in data exploration notebook

RFC Stop pushing the rendered notebook in the main repo?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent