This project is part of The Udacity Data Scientist Nanodegree Program which is composed by:
- Term 1
- Supervised Learning
- Deep Learning
- Unsupervised Learning
- Term 2
- Write A Data Science Blog Post
- Disaster Response Pipelines
- Recommendation Engines
The goal of this project is to apply Supervised learning techniques on data collected for the U.S. census to help CharityML (a fictitious charity organization) to identify people most likely to donate to their cause. From Udacity a template code was provided in the finding_donors.ipynb
notebook and the visuals.py
module which is some out-of-the-box code needed for visualizations.
This project uses Python 3.10.4 and the most important packages are:
To create the virtual enviroment you can run python -m venv .venv
.
More informations in requirements.txt
. I am providing a simplified version of the file and letting pip handle the dependencies to avoid maintenance overhead.
To create a complete requirements file you can run pip freeze > requirements.txt
and to install all python packages in it you can run pip install -r requirements.txt
.
To setup a new enviroment and install all requirements you can go in folder others
and run setup.cmd
The modified census dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", by Ron Kohavi. You may find this paper online, with the original dataset hosted on UCI.
- census.csv: census dataset
age
: Ageworkclass
: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)education_level
: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)education-num
: Number of educational years completedmarital-status
: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)occupation
: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)relationship
: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)race
: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)sex
: Sex (Female, Male)capital-gain
: Monetary Capital Gainscapital-loss
: Monetary Capital Losseshours-per-week
: Average Hours Per Week Workednative-country
: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)
income
: Income Class (<=50K, >50K)
The code is provided in the Jupyter Notebook finding_donors.ipynb. It will read the data from DATA_FOLDER
and step by step explore the data, train 3 models and compare their performances.
From the project folder run pytest
To run a single test: pytest .\tests\test_configuration.py::test_create_folders
PEP8 is the style guide for Python code, and it's good practice to follow it to ensure your code is readable and consistent.
To check and format my code according to PEP8 I am using:
- pycodestyle: tool to check the code against PEP 8 conventions.
- autopep8: tool to automatically format Python code according to PEP 8 standards.
To run pycodestyle on all files in the project folder and create a report: pycodestyle --statistics --count . > code_styling/report.txt
To run autopep8 on all files in the project folder: autopep8 --recursive --in-place .
I prefere to check and update one file at the time because the previous recursive commands affect also .venv\
files. For example:
pycodestyle .\utils\configuration.py > .\code_styling\configuration_report.txt
autopep8 --in-place .\utils\configuration.py
You can go in folder code_styling
and run format_and_lint.cmd
.
You can open finding_donors.ipynb and run each cell and check their results.
You can also run the command ipython -c "%run finding_donors.ipynb"
.
To convert the notebook in HTML format run jupyter nbconvert finding_donors.ipynb --to html
.
I have compared the results of three models:
- Decision Trees
- Support Vector Machines
- AdaBoost
Becasue the dataset is pretty imbalanced to evaluate the resutls I have used F-1 Score. As explained wonderfully in this post Accurcay is not indicated for imbalanced dataset.
AdaBoost has the best F-1 Score and using GridSearchCV I have tuned the model improving furthermore the performances
Finally I have extracted the feature importance:
In the TODO file you can find the list of tasks and on going activities.
Thanks Udacity for the dataset.
I hope this repository was interesting and thank you for taking the time to check it out. On my Medium you can find a more in depth story and on my Blogspot you can find the same post in italian. Let me know if you have any question and if you like the content that I create feel free to buy me a coffee.