dat_homework's People
dat_homework's Issues
HW1 Feedback
Status: Pass (Homework is graded on "Pass" or "Needs Improvement")
Comments:
Great work Ashish! Your methods are efficient and easy to follow, with great supporting visuals which are clearly labeled and effective. I would encourage you to add some written analysis to each question communicate your conclusions to your reader, even when the numbers are "obvious", it can go a long way. Well done!
HW2 Feedback
Status: Pass (Homework is graded on "Pass" or "Needs Improvement")
Comments:
Great work Ashish, comments on each section below!
Describe the content of the dataset and its goals
Good! The data dictionary is clear with effective visuals and observations. Would be good to also state a brief "goal" such as "using the data to predict diabetes", so it will frame the objective of the rest of your analysis.
Describe the features and formulate a hypothesis on which might be relevant in predicting diabetes
Nice, good idea to group by class and visualize the change in distributions.
Describe the missing/NULL values. Decide if you should impute or drop them and justify your choice.
Great, your process is well organized and easy to understand. Some people also tried replacing 0 values for Insulin. It's a bit unclear from the data whether 0 insulin is impossible. One idea would be to try with and without adjusted insulin.
Come up with a benchmark for the minimum performance that an algorithm should have on this dataset
35% would be a good benchmark for making sure you detect all cases of diabetes (which IS important here, so I don't disagree). However if you make a dummy classifier, which predicts NOT DIABETIC for everyone, you'd score 65%! As a rule of thumb, the "dumb" benchmark for classification would be the % of your largest group ("not diabetic" in this case).
What's the best performance you can get with kNN?
Is kNN a good choice for this dataset?
Great use of classification_report! Depending on the dataset, some scores may be more relevant than others. Many people found better performance with higher n than 3 - In general it's good to try gridsearch for each model on a new dataset.
What's the best performance you can get with Naive Bayes? Is NB a good choice for this dataset?
Good use of gridsearch here! However you should try Gaussian NB instead of Multinomial NB here, since you are dealing with numerical inputs rather than categorical inputs. You might see some improvement!
What's the best performance you can get with Logistic Regression? Is LR a good choice for this dataset?
You can try gridsearch here with both L1 and L2, as well as a range of C values, some people saw better outcomes with L1. Another good thing to try with LR is to check the coefficients, as they can indicate feature importance, and can help you down the line!
What's the best performance you can get with Random Forest? Is RF a good choice for this dataset?
Excellent, great to loop through the param values, as well as good use of RF to check feature importance.
If you could only choose one, which classifier from the above that you already ran is best? How do you define best? (hint: could be prediction accuracy, running time, interpretability, etc)
Clear reasoning for you selection weighing performance and time, nice work!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.