- Pull the dataset of benign apk files and store them to <this repo>/benign_apk/
http://205.174.165.80/CICDataset/CICMalAnal2017/Dataset/APKs/Benign-APKs-2017.zip
- Pull the datasets of malicious apk files and store them to <this repo>/malicious_apk/
http://205.174.165.80/CICDataset/MalDroid-2020/Dataset/APKs/ (exclude Benign.tar.gz)
- Rename them to .apk.
Run extract_apks_parallel.sh
unpacks the .apk files into folders and processes some of the data there in.
You can monitor it in another shell by running watch "wc -l benign_apk/valid_apks.txt; wc -l malicious_apk/valid_apks.txt"
Run one of the following scripts to generate feature vectors:
parse_xml.py
for permissions. "app_permission_vectors.json" is generatedparse_maline_output.py
for syscalls. "app_syscall_vectors.json" is generated. You will have to run maline first for this to work.parse_disassembled.py
for API calls. "app_method_vectors.json" is generatedparse_ssdeep.py
for fuzzy hashes. "app_hash_vectors.json" is generated. You will have to run ssdeep first for this to work.combine_features.py
for a combination of the top weighted features. "app_feature_vectors.json" is generated. This only works if you've previously trained a network on the specified features, and the feature weights files are named appropriately.
Run $ run_trials.sh app_feature_vectors.json
(or whichever json you want) which runs the tensorflow_learn.py
script (where the ML happens) a number of times and puts the results into a folder. It also runs plot_data.py
and match_features.py
to create a plot and create a list of top weighted features, respectively.
Change the parameters or input data and repeat step 6. It should be non-destructive so you can compare the results of different runs.
Note: If you want to use a SVM instead of a neural network, use sklearn_svm.py
in place of tensorflow_learn.py
. You can also use sklearn_tree.py
to use a decision tree.