ML pipeline (python):
- Dataset organization. Aging, sex as well as common for all cohort genes will be considered for data analysis.
- Train/test splitting in relation 80%/20% with shuffling and stratification by outcome.
- As preprocessing, for negative potential outliers effects, the robust scale procedure will be considerd. It will be applied after the train/test splitting to avoid data leakage.
- To optimise the computational process the number of potential parameters will be reduced after correlational analysis (Spearman test), the threshhold for features elimination being 0.8.
- Removing features with low variance will be considered for feature selection.
- For outcome prediction we plan to apply Logistic regression, Random Forest, Support Vector Machine, LightGBM, XGBoost form sklearn library. For each model two options will be considered. First, the model with hyperparameters by default and the model after hyperparameters optimization in greedsearch procedure (cv = 5, the metric for optimization being recall or accuracy).
- The better model from each pair will be a part of Staking alghorithm in default hyperparameters and optimized hyperparameters variants.
- Models performans will be estimated considering accuracy, ballanced accuracy, f1-score, recall, precision, recall, ROC-curve and PR-curve.
- The best model from 12 proposed models will be considered for feature important analysis throught permutation, drop column and shap techniques to find the top 20 features.