Polycystic ovary syndrome (PCOS) is a common condition that affect women and are characterized by having two or more of the following features: irregular periods, excess male hormones that may lead to excess facial and body hair growth, polycystic ovaries (enlarged ovaries containing fluid filled sacks called follicles).
The dataset used in this project contains numerous physical and clinical parameters to determine PCOS and infertility related issues. The data has been collected from 10 different hospital across Kerala,India and is available to access freely from https://www.kaggle.com/datasets/prasoonkottarathil/polycystic-ovary-syndrome-pcos
I was diagnosed with PCOS in 2022. However, I have been experiencing symptoms for many years before and have visited other doctors before. In India, there is a large tendency to ignore PCOS in unmarried women, even by experienced gynecologists. The fact that it drastically reduces quality of life for many women is not taken into account and doctors tend to do no more than advise to exercise and eat healthy. Therfore, I became interested in exploring ways that I can use my data science skills to improve healthcare for women with PCOS.
The aim of this project is to use an appropriate classification model to diagnose PCOS. The use of machine learning in situations like these can help process large amounts of data to gain accurate diagnosis and thus possibly help reduce healthcare costs.
This project uses several R packages to perform data analysis and modeling. Below is a brief description of each package and its purpose.
-
readxl The readxl package provides functions for reading data from Excel files into R.
-
tidyverse The tidyverse package is a collection of packages that provide tools for data manipulation, visualization, and modeling. It includes popular packages like dplyr, ggplot2, and tidyr.
-
plyr The plyr package provides functions for splitting, applying, and combining data in R.
-
dplyr The dplyr package provides functions for data manipulation, including filtering, sorting, grouping, and summarizing data.
-
ggplot2 The ggplot2 package provides a powerful system for creating graphics in R, with an emphasis on creating aesthetically pleasing and informative visualizations.
-
Hmisc The Hmisc package provides functions for data analysis and modeling, including descriptive statistics, regression modeling, and survival analysis.
-
stats The stats package is a core R package that provides functions for statistical analysis and modeling, including hypothesis testing, regression modeling, and time series analysis.
-
corrplot The corrplot package provides functions for creating correlation matrix plots in R.
-
psych The psych package provides functions for psychometrics and personality research, including factor analysis and correlations.
-
DescTools The DescTools package provides functions for descriptive statistics and data visualization, including various summary statistics, contingency tables, and graphical displays.
-
caret The caret package provides functions for machine learning and predictive modeling, including feature selection, model training, and model evaluation.
-
tree The tree package provides functions for creating classification and regression trees in R.
-
rpart The rpart package provides functions for creating decision trees in R.
-
rattle The rattle package provides a graphical user interface (GUI) for data mining and machine learning tasks in R. It includes tools for data preprocessing, feature selection, and model evaluation.