Kaggle Data Science & Machine Learning Survey Analysis

This is a capstone project for Introduction to Data Science (DS-GA 1001) at NYU Center for Data Science.

Project Intro/Objective

The purpose of this project is to understand the state of Data Science and Machine Learning across industries and technologies.

Methods Used

Inferential Statistics
Data Visualization
Predictive Modeling
Clustering
Supervised Classification

Technologies

Python
Pandas, jupyter
Sklearn

Project Description

We are using Kaggle annual data science and machine learning survey responses data from 2022 and previous years. Data Sources:

Inferential Statistics:

Hypothesis: What Data Science jobs have the highest salaries?
Hypothesis: Is the representation of Advanced degrees (Masters & above) among Data Professionals increasing over years?

Salary prediction: Predict data science job salaries of individuals in the United States based on covariates such as their job title, industry, skill sets and experience. Results:

	Model	Train R²	Test R²	Adj Train R²	Adj Test R²
1	OLS	0.159	0.18	0.389	0.393
2	Ridge	0.151	0.186	0.388	0.393
3	Lasso	0.157	0.184	0.386	0.392
4	Linear SVR	0.092	0.113	0.306	0.298
5	Random Forest	0.863	0.165	0.674	0.399
6	XGBoost	0.433	0.178	0.516	0.404
7	Poisson GLM	0.182	0.223	0.396	0.401

Clustering: Clustering the survey respondents into clusters based on their skills and expeiriences Approach:

Traditional KMeans clustering using only numerical data.
KPrototypes clustering, which is compatible with categorical information as described by Huang (1998)

Classification: Identify the most suitable jobs for a user based on the user’s responses about their skills, exposure and experience. Results:

Model	Accuracy	F1-Score	Top 2 accuracy
Dummy Classifier	35%	0.10	65%
Decision Tree	41%	0.41	51.9%
KNN Classifier	42%	0.33	62.6%
Random Forest	48%	0.44	69.0%
SVM	47%	0.45	78.6%
Multinomial Logisitic Reg	48%	0.45	71.8%
Multi-Layer Perceptron	54%	0.43	78.8%
XGBoost	54%	0.45	79.1%
LightGBM	55%	0.45	79.62%

Findings and analysis: Report

Team Members

Name	Github
Sargun Nagpal	sargun-nagpal
Harsha Koneru	harshakoneru
Sharad Dargan	sharaddargan

sharad5 / kaggle-data-science-ml-survey-analysis Goto Github PK

kaggle-data-science-ml-survey-analysis's Introduction

Kaggle Data Science & Machine Learning Survey Analysis

Project Intro/Objective

Methods Used

Technologies

Project Description

Team Members

kaggle-data-science-ml-survey-analysis's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent