Coder Social home page Coder Social logo

lesteve / scikit-learn-tutorial Goto Github PK

View Code? Open in Web Editor NEW
41.0 13.0 22.0 34.83 MB

This repo has moved to https://github.com/INRIA/scikit-learn-mooc/

License: Creative Commons Zero v1.0 Universal

Makefile 0.02% Jupyter Notebook 99.00% Python 0.95% CSS 0.03% HTML 0.01%

scikit-learn-tutorial's Introduction

scikit-learn-tutorial's People

Contributors

agramfort avatar alagarrigue avatar amueller avatar brospars avatar gaelvaroquaux avatar glemaitre avatar hackmd-deploy avatar jrleeman avatar kastnerkyle avatar lesteve avatar lmcinnes avatar lucyleeow avatar miaodx avatar mppaskov avatar nelson-liu avatar ogrisel avatar rasbt avatar rhiever avatar scw avatar stavxyz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-learn-tutorial's Issues

Build failure during the pdf export

https://github.com/lesteve/scikit-learn-tutorial/runs/315274429. This has failed for a few days see https://github.com/lesteve/scikit-learn-tutorial/actions. There seems to be a TimeoutError, the PDF export takes 50s and does not complete. Maybe there is a timeout that can be increased with github actions ? ping @brospars.

Full error:

Run for f in *.html ; do remarkjs-pdf "$f"; done
Convert file:///home/runner/work/scikit-learn-tutorial/scikit-learn-tutorial/dist/index.html to index.pdf ...
Finished.
Convert file:///home/runner/work/scikit-learn-tutorial/scikit-learn-tutorial/dist/ml_concepts.html to ml_concepts.pdf ...
Finished.
Convert file:///home/runner/work/scikit-learn-tutorial/scikit-learn-tutorial/dist/overfit.html to overfit.pdf ...
TimeoutError: Navigation timeout of 30000 ms exceeded
    at /opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/node_modules/puppeteer/lib/LifecycleWatcher.js:142:21
  -- ASYNC --
    at Frame.<anonymous> (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/node_modules/puppeteer/lib/helper.js:111:15)
    at Page.goto (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/node_modules/puppeteer/lib/Page.js:675:49)
    at Page.<anonymous> (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/node_modules/puppeteer/lib/helper.js:112:23)
    at convertPdf (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/remarkjs-pdf.js:80:14)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async main (/opt/hostedtoolcache/node/12.13.1/x64/lib/node_modules/remarkjs-pdf/remarkjs-pdf.js:69:5) {
  name: 'TimeoutError'
}
##[error]Process completed with exit code 10.

How to use hackpad to synchronise with github

cc @GaelVaroquaux

Top right, click on the ... icon then "versions":
image

A modal appears, the name of the github file is shown on the top left (lesteve/scikit-learn/plan.md) and then on the right hand side the down/up arrows can pull/push to github.
image

Any troubles with that let me know!

My notes about possible improvements from Euroscipy tutorial

This is not very structured, so feel free to edit, comment, open other issues for bigger chunks of work:

Content

  • have a TOC per notebook?
  • tinyurl (or huit.re) with link for easier access to the github repo (README)
  • first notebook with TOC so that the binder goes directly to this notebook
  • Too-wide code: numerical_columns categorical_columns should cut at
    'capital-loss' and 'marital-status'. I think we should have a special formatter
    maybe black (I feel like it takes too much vertical space) or maybe yapf with
    some nice settings.
  • Say that education-num is not the number of years of education (I say that we
    could expect this, but I did not say this was not true)
  • Should we get rid of education-num everywhere, since this is the same as
    education?
  • Young people work part-time. Say that non-working people (students) are not
    part of the survey.
  • Put solutions in different folder? They interfere with the notebooks, you have
    to say: open the 02 notebook but not the one with exercise ...
  • naming: df vs adult_census harmonize, maybe data is good enough.
  • Harmonize the way to get categorical_columns vs numerical_columns. Some code
    use dtype some code use explicit column names.
  • Side-comments about the train test split, goal is not to memorize. Should
    there be more details for the MOOC ? Or links to the first part about
    overfitting vs underfitting.
  • 02 exercise 01, not cross_val_score but use train_test_split
  • Different kind of preprocessing, add a link to user guide. Question was: what
    happens if the data is not gaussian.
  • n_iter_ is a list for some reason ...
  print(
    f"The accuracy using a {model.__class__.__name__} is "
    f"{model.score(data_test, target_test):.3f} with a fitting time of "
    f"{elapsed_time:.3f} seconds in {model[-1].n_iter_} iterations"
  )
  The accuracy using a Pipeline is 0.818 with a fitting time of 0.809 seconds in [13] iterations
  • Cross-validation explanation plot: add legend for blue vs red. It looks like
    there might be better images from scikit-learn documentation.
  • minor: sparse=False in OneHotEncoder just for visualization purposes (to see the numpy
    array).
  • handle_unknown='ignore': explain more the reason: to put 0 in the categories if at test time, a category has not been seen in the train data.
  • For exercise, have a link to the similar example, e.g. OrdinalEncoder put a
    link to what we did with OneHotEncoder.
  • Can we have link in notebooks to an other notebook, that works locally, on
    binder, etc ...
  • Question about : pipeline with the scaler does it compute the mean on the
    training, so you have to explain how the Pipeline works, calls .fit and
    .transform. You don't have to explain maybe, you can just say the parameters
    are modified only in the .fit (so not in the .predict)
  • Question about pipeline, why is it useful rather than just writing the code
    yourself? You have to explain .fit and .fit_transform. Hmmm, maybe you can
    just add a comment about why the Pipeline is useful in general.

Miscellaneous

  • Timings are very slow on binder:
    0.7 s for LogisticRegression fit vs 5.6s on binder.
    2 minutes (rather than ~10s on Olivier's machine) for Reference pipeline (no
    numerical scaling and integer-coded categories)
    02_basic_preprocessing_exercise_03_solution.ipynb

Suggestions

I have some minor suggestions:

01_tabular_data_exploration:

  • adult_census.profile_report() tells us that there are a few duplicate rows. It may be worthwhile explaining how these duplicate entries may affect/not affect prediction?

02_basic_preprocessing

  • convergence warning - you explain that this tells us that our model stopped learning because it reached the maximum number of iterations allowed and that scaling the data will help. Can you expand on what convergence means, why increasing the number of allowed iterations is a bad idea and why scaling the data helps?
  • explain what the StandardScaler does? Maybe not everyone knows the equation?

04_basic_parameters_tuning:

  • I think you need to explain more about the hyper-parameter C. Maybe even just give them a useful link to read on regularisation and overfitting?
  • For the last cell:
model = make_pipeline(
    preprocessor, LogisticRegressionCV(max_iter=1000, solver='lbfgs', cv=5)
)
score = cross_val_score(model, data, target, n_jobs=4, cv=5)

you don't provide a Cs argument like you do above and it might be worth mentioning that by default it tests a grid of 10 C values.

Inconsistent tree diagram and plot in data exploration notebook

https://nbviewer.jupyter.org/github/lesteve/scikit-learn-tutorial/blob/master/rendered_notebooks/01_tabular_data_exploration.ipynb

image

You would expect the high age high hours-per-week to be "high-earning". This is what the 2d plot shows (in red). The plot_tree (tree diagram) says value = [637, 572] so in the same order as all the other leaves.

Not clear where the problem is, plot_tree (scikit-learn) or plot_tree_decision_function (function in the notebook).

RFC Stop pushing the rendered notebook in the main repo?

Maybe instead we could have a CI job that check that the notebook render without errors and another CI jobs that run the rendering of the notebooks only on merge commits in master and commits and push the results into a dedicated "rendered" branch that we then use for nbviewer links. We could also render the HTML to make the rendered notebooks directly browsable as a website on https://lesteve.github.io/scikit-learn-tutorial

This way we would never have the diff with spurious output changes in pull request.

WDYT?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.