This project aims to analyze synthetic healthcare data and build a machine learning model to predict test results based on various factors such as age, gender, and medical condition.
The dataset contains synthetic healthcare records with the following fields:
- Patient Name
- Age
- Gender
- Blood Type
- Medical Condition
- Date of Admission
- Doctor Name
- Hospital Name
- Insurance Provider
- Billing Amount
- Room Number
- Admission Type
- Discharge Date
- Medication
- Test Results
- Data Cleaning: Standardize column names, correct data types, and handle missing values.
- Exploratory Data Analysis (EDA): Visualize the distribution of age, gender, medical conditions, admission types, and test results.
- Feature Engineering: Encode categorical variables, create new features such as the length of hospital stay, and drop unnecessary columns.
- Correlation Analysis: Calculate and visualize the correlation matrix to understand relationships between features.
- Model Building and Evaluation: Train a Random Forest classifier and evaluate its performance using accuracy, classification report, and confusion matrix. Visualize feature importance.
- Python 3.x
- Required libraries: pandas, matplotlib, seaborn, scikit-learn
- Clone the repository.
- Install the required libraries using
pip install -r requirements.txt
. - Run the
healthcare_analysis.ipynb
notebook to see the analysis and model building steps.
The analysis provides insights into the data distribution and relationships between features. The Random Forest classifier model predicts test results with a certain accuracy, and the feature importance plot shows which features contribute the most to the predictions.
This project demonstrates the process of cleaning and analyzing healthcare data, performing exploratory data analysis, and building a predictive model using machine learning techniques. '''
This analysis is still under regular update on insights...