Coder Social home page Coder Social logo

deltaanalytics / machine_learning_for_good Goto Github PK

View Code? Open in Web Editor NEW
173.0 173.0 122.0 278.45 MB

Machine learning fundamentals lesson in interactive notebooks

Home Page: http://www.deltanalytics.org/

License: Creative Commons Attribution 4.0 International

Jupyter Notebook 99.96% Shell 0.01% Python 0.03%
data-analysis data-mining data-science data-visualisation non-profit python python3

machine_learning_for_good's People

Contributors

amankhullar avatar ayingsu avatar brianspiering avatar cloudchaoszero avatar dmarinere avatar emekaborisama avatar erourke23 avatar hannahksong avatar hlina avatar jackalack avatar karlazz avatar kevinrpan avatar krisharma avatar lekeonilude avatar rosina9700 avatar sarahooker avatar sydneymwong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

machine_learning_for_good's Issues

Linear Regression Module Edits and Additions

Hello! This is an issue with information about my PR: #57

Here are the changes that I made:

  • Fixed package imports to be fit for the most recent version of the statsmodels package. The change here is subtle - statsmodels now uses it's .api subpackage for data and labels and .formula.api for the case where the formula and dataframe are parameters. I also correct the capitalization errors
  • I added intuitive explanations for the regularization module to explain what both types of regularization actually do and an explanation in English on how they perform that regularization
  • I added some additional details on why we hold out a test set rather than just using the same data itself to test the model. Additionally, I incorporated information from the theoretical slides for this module, briefly mentioning the concept of overfitting and how by holding out a test set we can evaluate for this
  • I added labels to every plot that was missing them. I believe this practice is critical when creating Notebooks that will be read by others and the best way to make this habit is to show students in our own examples :)
  • I added explanations for any technical jargon within the notebooks. From my understanding and perception of the modules, the theoretical details are being applied in code with these Notebooks, as such, students should have that theoretical information readily available when they are programming. As such, I added explanations for the differences between uni- and multi- variate models as well as the descriptions for the statistical assumptions that are made.
  • Some minor spelling changes
  • Finally, I added descriptions about why we are using regression on this task in comparison to classification in the first submodule of this module to ease students into the coding

What I would also like to do in the future (building on this PR):

  • Add intuition for why standardization works
  • Add explanations for the statistical tests that are performed to validate our statistical assumptions when using linear regression
  • Standardize usage of Seaborn or Matplotlib

I think the code for this module is truly fantastic, it is very comprehensive and visually appealing. I went in with the mindset of "learning linear regression for the first time" and the code flows extremely well and the visualizations allow for an intuitive understanding. As such, I primarily focused on adding aids and guiding students with intuitive descriptions of what we are doing with our program at each step.

NOTE: Please change the y-axis label for the submodules 3_1 and 3_2 in the title plot! It currently says "Input Feature" where as that is really the value we are looking to predict. See below:

Please let me know if you have any questions or concerns about my changes.
Screenshot from 2020-03-20 22-17-49

Conditional Inference Trees

I enjoyed the code that you have here as well as the descriptions in the accompanying slideshow. I have two suggestions.

  1. Would you consider adding a section on conditional inference trees? These are trees that utilize probabilistic associations between features and outcome variable to make splits rather than the information gain (or impurity/error reduction) criterion used here. Conditional inference trees are effective in reducing the variable selection bias present in CART methods (the framework for decision tree creation used here) in which variables with more splits are preferentially selected when compared to variables with fewer splits. They are also less risky to interpret, as each split is based on some statistical association rather than absolute differences in some information metric. The one thing I worry about is obfuscating your very clear presentation of the decision tree algorithm, but this might help to reduce confusion down the line when people try to interpret CART method trees too much. Note that besides the segmentation criterion, conditional inference trees are identical to CART method trees.
  2. In the Module 5-Decision Tree slideshow on slide 53 (the slide right before "Model Performance") you reference the fact that the RMSE algorithm proceeds by "Calculating the variance for each node" and "calculating the variance for each split as weighted average of each node variance." Unless I'm misunderstanding here, I believe this is incorrect. You are not calculating the variance in each node, nor are you calculating the variance for each split; you are instead calculating the root mean squared error (RMSE) and not the variance. The equations are similar, but for variance you are subtracting the average value from each observation in the summation, whereas in RMSE you are subtracting the predicted value from each observation in the summation.

Thank you for your consideration.

Intro to numpy notebook

I think there should be a notebook that introduces the student to NumPy. NumPy is a very import tool in ML and in taking about NumPy with the student, the concept of vectors and matrics can be visualised which would be of great help to the student when they move into linear regression and other machine learning model

Which English? American or British

I sweep something under the rug. You sweep something under the carpet.
Which is it (the eternal question)?

I have a preference for American English but do not care too much. Let's just be consistent.

I guess this applies to the slides also...

Switch from Anaconda to Miniconda

Anacdonda is the current build environment. It is great but uses a lot of disk space (>1GB).

We could use Miniconda which is the lightweight version. It has a smaller footprint and requires a more explicit package lists. It might work better with student's computers.

Thoughts?

Suggested Additions

1. In module_1_introduction_pandas >> 1_1 intro_to_python:

  • In the section on Strings, I propose that other string methods should be taught (eg. find(), startswith() and join()).
  • In the section on Lists, I propose that other list methods should be taught (eg. count(), sort() and copy()).
  • In the section on DIctionaries, I propose that other dictionary methods should be taught (eg. keys(), values() and popitem()).
  • In the section on Sets, I propose that other set methods should be taught (eg. add(), difference() and issubset()).
  • I propose that a section is added on tuples.

2. In module_1_introduction_pandas >> 1_3_intro_to_pandas:

  • I propose that pd.info() method is introduced.
  • Other pd.read_ methods like pd.read_excel(), pd.read_json() and pd.read_sql() should be introduced apart from pd.read_csv().

3. In module_1_introduction_pandas >> 1_5_exploratory_data_analysis:

  • In cell 7, num_df has not yet been defined and should be removed as it causes an error.

the data folder is missing

FileNotFoundError Traceback (most recent call last) <ipython-input-2-61495ea1e105> in <module>() ----> 1 df = pd.read_csv("../data/loans_full.zip",index_col=0)

Handling data

We need a coherent and consistent way of handling the Kiva data. Otherwise, people will be using very different data and might "blow up" the git repo history.

Options:

  • Store no data in the repo. Each person calls API and creates their own personal/local copy.
  • There is a single immutable dataset in the repo. That everyone uses.
  • We store data in an another system https://datproject.org/
  • Something else...

^ @jackalack

Switch from matplotlib to Seaborn

Currently, we are using matplotlib. I suggest we switch to Seaborn.

Here are the top reasons:

  1. API is simpler. Seaborn requires less code and has more intuitive keywords arguments.
  2. Seaborn is designed for statistical plotting. matplotlib is general data use.
  3. Seaborn has more attractive plots. They are more modern in style.

Thoughts?

R_F_Suggestions

Hello all,

I have done some suggestions for Random Forest notebook.

  1. Added information: In random forest Heuristic is applied to select the best feature
  2. I added df.head() that can be very helpfull for students to know which kind of data are they dealing with.
  3. Also using describe() it is very interesting even more for a R_F Regressor to get some statistical information about our data
  4. I suggest to add information in the beginning of the notebook about which kind of features do we have and their units
  5. Improvements in the graphics could be done adding for example a reference line.

I hope this can help :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.