Prerequisites
Description
Hello sit. Arun here. I have worked as ML research intern for AIIMS on behalf of my college and have worked on health research domains. Thus raising this issue. This issue aims to address these concerns and proposes the following improvements to make the model more reliable and efficient.
Handle Random Sampling:
Randomly selecting 80% of the training data may introduce bias and affect model performance. Instead, consider implementing sampling techniques to maintain the class distribution while selecting a subset of given data.
Optimize Memory Usage:
Use pandas' astype function to downcast numeric columns and consider using sparse data structures where applicable to further reduce memory consumption.
HTML Decoding Function:
Evaluate the performance impact of the decode_html function that decodes HTML-encoded characters. Optimize the function, if necessary, to improve efficiency and minimize processing time.
Missing Value Handling:
Assess the impact of using SimpleImputer for handling missing values and explore alternative strategies such as data imputation based on domain knowledge or techniques like K-nearest neighbors (KNN) or IterativeImputer.
Feature Encoding:
Check the effectiveness of alternative feature encoding methods such as one-hot encoding, target encoding, or entity embeddings to capture complex relationships and improve model performance.
Regularisation
Consider implementing regularization techniques like L1 (Lasso) or L2 (Ridge) regularization in the LinearRegression and LogisticRegression models to handle potential overfitting and improve generalization.
Screenshots
No response
Code of Conduct