The dat_homework from ashishterp

HW1 Feedback

Status: Pass (Homework is graded on "Pass" or "Needs Improvement")
Comments:
Great work Ashish! Your methods are efficient and easy to follow, with great supporting visuals which are clearly labeled and effective. I would encourage you to add some written analysis to each question communicate your conclusions to your reader, even when the numbers are "obvious", it can go a long way. Well done!

HW2 Feedback

Status: Pass (Homework is graded on "Pass" or "Needs Improvement")
Comments:
Great work Ashish, comments on each section below!

Describe the content of the dataset and its goals
Good! The data dictionary is clear with effective visuals and observations. Would be good to also state a brief "goal" such as "using the data to predict diabetes", so it will frame the objective of the rest of your analysis.
Describe the features and formulate a hypothesis on which might be relevant in predicting diabetes
Nice, good idea to group by class and visualize the change in distributions.
Describe the missing/NULL values. Decide if you should impute or drop them and justify your choice.
Great, your process is well organized and easy to understand. Some people also tried replacing 0 values for Insulin. It's a bit unclear from the data whether 0 insulin is impossible. One idea would be to try with and without adjusted insulin.
Come up with a benchmark for the minimum performance that an algorithm should have on this dataset
35% would be a good benchmark for making sure you detect all cases of diabetes (which IS important here, so I don't disagree). However if you make a dummy classifier, which predicts NOT DIABETIC for everyone, you'd score 65%! As a rule of thumb, the "dumb" benchmark for classification would be the % of your largest group ("not diabetic" in this case).
What's the best performance you can get with kNN?
Is kNN a good choice for this dataset?
Great use of classification_report! Depending on the dataset, some scores may be more relevant than others. Many people found better performance with higher n than 3 - In general it's good to try gridsearch for each model on a new dataset.
What's the best performance you can get with Naive Bayes? Is NB a good choice for this dataset?
Good use of gridsearch here! However you should try Gaussian NB instead of Multinomial NB here, since you are dealing with numerical inputs rather than categorical inputs. You might see some improvement!
What's the best performance you can get with Logistic Regression? Is LR a good choice for this dataset?
You can try gridsearch here with both L1 and L2, as well as a range of C values, some people saw better outcomes with L1. Another good thing to try with LR is to check the coefficients, as they can indicate feature importance, and can help you down the line!
What's the best performance you can get with Random Forest? Is RF a good choice for this dataset?
Excellent, great to loop through the param values, as well as good use of RF to check feature importance.
If you could only choose one, which classifier from the above that you already ran is best? How do you define best? (hint: could be prediction accuracy, running time, interpretability, etc)
Clear reasoning for you selection weighing performance and time, nice work!

ashishterp / dat_homework Goto Github PK

dat_homework's People

Contributors

Watchers

dat_homework's Issues

HW1 Feedback

HW2 Feedback

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent