This project was my solution for SPIT hackathon, I attended, during my final year engineering.
A solution to detect malware using Machine learning, on features extracted from windows PE files.
Problem statement uploaded.
Focus has to be on Data pre-processing and normalizing, especially text processing in columns, “ImportedDlls, ImportedSymbols”. My approach to process them is to extract filenames, words by appropriate delimeter. Then, apply 1-hot encoder that will create filenames as columns (1414 columns). And then apply dimension reduction techniques to reduce the columns. And finally train the classifier on it.
I am still learning and may be wrong. Any other suggestion/approach welcome!
No processing was done on numeric columns. Any suggestions? Normalizing?
Feature selection performed using SelectKbest.
Classifier used: Decision Tree, Random forest, XGBoost, GradientBoost, etc.
Much more Data Processing can be done to improve accuracy of the model. Many more techniques can be applied. I am still working on it. Suggestions welcome!
Recommendation engine, as mentioned in problem statement is still remaining. Will try confidence scores for classification.
Working on EDA and text processing on those 2 columns.