by Amrita Bhatia (Roll no. 102017017)
Sampling is a statistical technique used to select a representative subset of a population. An ideal sample has all the characteristics of the population it is derived from. While an ideal sample is not feasible to derive, different sampling techniques can be used to obtain samples close to an ideal sample.
This project applies various sampling techniques to different machine learning models and evaluates their performance using the accuracy metric.
Sampling techniques used:
S. No. | Technique |
---|---|
S1 | Simple random sampling (w/0 replacement) |
S2 | Simple random sampling (with replacement) |
S3 | Systematic sampling |
S4 | Stratified sampling |
S5 | Cluster sampling |
Models used:
S. No. | Model |
---|---|
M1 | Logistic regression |
M2 | Naive Bayes classifier |
M3 | Support vector classifier |
M4 | Decision tree |
M5 | Random forest |
- The data was highly imbalanced with a 98.8% majority class. Therefore, SMOTE was used to balance the class distribution.
- The five sampling techniques were then applied to select samples from the data.
- These samples were then used to train the machine learning models. The models were tested on the remaining data not included in the sample.
The accuracies of the models for each sampling technique are as follows:
logreg | nb | svc | dt | rf | |
---|---|---|---|---|---|
S1 | 0.818182 | 0.792208 | 0.935065 | 0.948052 | 1.000000 |
S2 | 0.917866 | 0.897544 | 0.911092 | 0.970364 | 0.990686 |
S3 | 0.906557 | 0.796721 | 0.906557 | 0.947541 | 0.991803 |
S4 | 0.922162 | 0.752432 | 0.915676 | 0.966486 | 0.992432 |
S5 | 0.458849 | 0.541151 | 0.490168 | 0.446468 | 0.448653 |
- The highest accuracy (100%) is seen in the random forest model using simple random sampling without replacement.
- The worst performing model is the decision tree using the cluster sampling technique with an accuracy of 44.65%.