deltaanalytics / machine_learning_for_good Goto Github PK

View Code? Open in Web Editor NEW

173.0 173.0 122.0 278.45 MB

Machine learning fundamentals lesson in interactive notebooks

Home Page: http://www.deltanalytics.org/

License: Creative Commons Attribution 4.0 International

Jupyter Notebook 99.96% Shell 0.01% Python 0.03%

data-analysis data-mining data-science data-visualisation non-profit python python3

machine_learning_for_good's People

Contributors

Stargazers

Watchers

Forkers

hbcbh1999 brianspiering parikshit95 espoirmur robinkiplangat radcliffe clintweathers daniel-m jimtyhurst pablohmoha sydneymwong shelmith-kariuki paolominguzzi crawford30 gabis93 kaizendae elouafi98 benmoussaloubna hamzamogni akarena wojohowitz00 fatima-ezzahrae yoecon kalidindis kipkurui knowbee sysh-north wambuik jennyhsiao arapat sshaikh2 shubham23471 sindhu819 theafricanquant fadmaennaji ask404 quantumanalytica taggsoft dnzengou artificyan nguyendo24 jgqwhucs georgeeks callezenwaka limbu2 elallaoui-m santhu45482 vemasolutionstz opiticalvin loretosanchez nengi kmera wandabwaherman dfdavila afcarl dfdavila2 leylig krishnakatyal diskandarnerd emekaborisama lekeonilude joelbrice arbbakbenny amankhullar moreno61399 ximenaceli twiga2 fortune-adekogbe ditirodt sshuster yasserhxh eddy1759 faris-octa maddyvc yilinee shukrohbello kafkaese obinnaobeleagu marcelomata nirikshan chimacoded dmarinere anhnguyendepocen asmaaalaa99 maicorebong akmalds alexgermancw rancychepchirchir cloudchaoszero vijita deenuy puritynyakundi theopetunde2 spirit-kay aisprayogi ruthnduta pydata-nairobi ai6ph venu2791 fayeee-e

machine_learning_for_good's Issues

Linear Regression Module Edits and Additions

Hello! This is an issue with information about my PR: #57

Here are the changes that I made:

Fixed package imports to be fit for the most recent version of the statsmodels package. The change here is subtle - statsmodels now uses it's .api subpackage for data and labels and .formula.api for the case where the formula and dataframe are parameters. I also correct the capitalization errors
I added intuitive explanations for the regularization module to explain what both types of regularization actually do and an explanation in English on how they perform that regularization
I added some additional details on why we hold out a test set rather than just using the same data itself to test the model. Additionally, I incorporated information from the theoretical slides for this module, briefly mentioning the concept of overfitting and how by holding out a test set we can evaluate for this
I added labels to every plot that was missing them. I believe this practice is critical when creating Notebooks that will be read by others and the best way to make this habit is to show students in our own examples :)
I added explanations for any technical jargon within the notebooks. From my understanding and perception of the modules, the theoretical details are being applied in code with these Notebooks, as such, students should have that theoretical information readily available when they are programming. As such, I added explanations for the differences between uni- and multi- variate models as well as the descriptions for the statistical assumptions that are made.
Some minor spelling changes
Finally, I added descriptions about why we are using regression on this task in comparison to classification in the first submodule of this module to ease students into the coding

What I would also like to do in the future (building on this PR):

Add intuition for why standardization works
Add explanations for the statistical tests that are performed to validate our statistical assumptions when using linear regression
Standardize usage of Seaborn or Matplotlib

I think the code for this module is truly fantastic, it is very comprehensive and visually appealing. I went in with the mindset of "learning linear regression for the first time" and the code flows extremely well and the visualizations allow for an intuitive understanding. As such, I primarily focused on adding aids and guiding students with intuitive descriptions of what we are doing with our program at each step.

NOTE: Please change the y-axis label for the submodules 3_1 and 3_2 in the title plot! It currently says "Input Feature" where as that is really the value we are looking to predict. See below:

Please let me know if you have any questions or concerns about my changes.

Conditional Inference Trees

I enjoyed the code that you have here as well as the descriptions in the accompanying slideshow. I have two suggestions.

Would you consider adding a section on conditional inference trees? These are trees that utilize probabilistic associations between features and outcome variable to make splits rather than the information gain (or impurity/error reduction) criterion used here. Conditional inference trees are effective in reducing the variable selection bias present in CART methods (the framework for decision tree creation used here) in which variables with more splits are preferentially selected when compared to variables with fewer splits. They are also less risky to interpret, as each split is based on some statistical association rather than absolute differences in some information metric. The one thing I worry about is obfuscating your very clear presentation of the decision tree algorithm, but this might help to reduce confusion down the line when people try to interpret CART method trees too much. Note that besides the segmentation criterion, conditional inference trees are identical to CART method trees.
In the Module 5-Decision Tree slideshow on slide 53 (the slide right before "Model Performance") you reference the fact that the RMSE algorithm proceeds by "Calculating the variance for each node" and "calculating the variance for each split as weighted average of each node variance." Unless I'm misunderstanding here, I believe this is incorrect. You are not calculating the variance in each node, nor are you calculating the variance for each split; you are instead calculating the root mean squared error (RMSE) and not the variance. The equations are similar, but for variance you are subtracting the average value from each observation in the summation, whereas in RMSE you are subtracting the predicted value from each observation in the summation.

Thank you for your consideration.

images and alt text links to images for Logistic Regression, SVM and Intro. to API not found in readme.md

add images and alt text links to images for Logistic Regression, SVM and Intro. to API in readme.md as the new materials are not yet added in readme.md

Intro to numpy notebook

I think there should be a notebook that introduces the student to NumPy. NumPy is a very import tool in ML and in taking about NumPy with the student, the concept of vectors and matrics can be visualised which would be of great help to the student when they move into linear regression and other machine learning model

Which English? American or British

I sweep something under the rug. You sweep something under the carpet.
Which is it (the eternal question)?

I have a preference for American English but do not care too much. Let's just be consistent.

I guess this applies to the slides also...

Switch from Anaconda to Miniconda

Anacdonda is the current build environment. It is great but uses a lot of disk space (>1GB).

We could use Miniconda which is the lightweight version. It has a smaller footprint and requires a more explicit package lists. It might work better with student's computers.

Thoughts?

Suggested Additions

1. In module_1_introduction_pandas >> 1_1 intro_to_python:

In the section on Strings, I propose that other string methods should be taught (eg. find(), startswith() and join()).
In the section on Lists, I propose that other list methods should be taught (eg. count(), sort() and copy()).
In the section on DIctionaries, I propose that other dictionary methods should be taught (eg. keys(), values() and popitem()).
In the section on Sets, I propose that other set methods should be taught (eg. add(), difference() and issubset()).
I propose that a section is added on tuples.

2. In module_1_introduction_pandas >> 1_3_intro_to_pandas:

I propose that pd.info() method is introduced.
Other pd.read_ methods like pd.read_excel(), pd.read_json() and pd.read_sql() should be introduced apart from pd.read_csv().

3. In module_1_introduction_pandas >> 1_5_exploratory_data_analysis:

In cell 7, num_df has not yet been defined and should be removed as it causes an error.

the data folder is missing

FileNotFoundError Traceback (most recent call last) <ipython-input-2-61495ea1e105> in <module>() ----> 1 df = pd.read_csv("../data/loans_full.zip",index_col=0)

Policy about external links?

@hannahksong wrote "Adopt a policy on linking to other websites given links may break"

Thought?

Data directory not found on module_4_classification

On the notebook for section 4_1_logistic regression.ipynb, the dataset 'admissions.csv' is missing. I couldn't locate the datasets directory

Module 9 (NLP PART 2): Stanford Resource link not found

Slides 65 and 66 in Module 9: Natural Language Processing Part 2 contains this link (http://web.stanford.edu/class/cs224u/materials/cs224u-2016-intro.pdf) which does not exist and I believe has been moved to https://web.stanford.edu/class/cs224u/2016/materials/cs224u-2016-intro.pdf

Example issue

In the decision tree lesson, there is a reference to homework but there is no homework.

https://github.com/DeltaAnalytics/machine_learning_for_good/blob/master/module_4_decision_trees/4_1_decision_trees.ipynb

Handling data

We need a coherent and consistent way of handling the Kiva data. Otherwise, people will be using very different data and might "blow up" the git repo history.

Options:

Store no data in the repo. Each person calls API and creates their own personal/local copy.
There is a single immutable dataset in the repo. That everyone uses.
We store data in an another system https://datproject.org/
Something else...

^ @jackalack

Move from CircleCI to GitHub Actions

GitHub Actions has matured enough to support what we need.

Switch from matplotlib to Seaborn

Currently, we are using matplotlib. I suggest we switch to Seaborn.

Here are the top reasons:

API is simpler. Seaborn requires less code and has more intuitive keywords arguments.
Seaborn is designed for statistical plotting. matplotlib is general data use.
Seaborn has more attractive plots. They are more modern in style.

Thoughts?

R_F_Suggestions

Hello all,

I have done some suggestions for Random Forest notebook.

Added information: In random forest Heuristic is applied to select the best feature
I added df.head() that can be very helpfull for students to know which kind of data are they dealing with.
Also using describe() it is very interesting even more for a R_F Regressor to get some statistical information about our data
I suggest to add information in the beginning of the notebook about which kind of features do we have and their units
Improvements in the graphics could be done adding for example a reference line.

I hope this can help :)

Large git history

FYI - Right now the .git folder is kinda big (400+ mb). The primary issue is a pretrained deep learning model hanging around meme-model-cnn.h5.

I tried to use a variation of this script to reduce the size but I was not completely successful.
https://gist.github.com/brianspiering/337f68c4d826881dd8970222e114b382/edit

Anyone is welcome to slay this dragon.