Coder Social home page Coder Social logo

spacecase123 / android-permission-extraction-and-dataset-creation-with-python Goto Github PK

View Code? Open in Web Editor NEW

This project forked from saket-upadhyay/android-permission-extraction-and-dataset-creation-with-python

0.0 0.0 0.0 21.41 MB

One script to create a permission-based dataset of android applications for your next ML Malware Detection gizmo.

License: MIT License

Shell 47.77% Python 52.23%

android-permission-extraction-and-dataset-creation-with-python's Introduction

Update Note : This script works best in Linux, the binaries used for internal process like JADX is ELF, and the OS specific commands are for 'nix systems. But it's also easy to run on windows, you might need to tweak ~10-15% of this code. If you do so and get this working on windows, I would really appreciate if you can share with everyone here, feel free to create a PR ^_^. I am planning to do it as many people are in windows env. and they are facing problems, but I will do it when I get time or a vacation from my current academics. :-)

Android Permission Extraction and Dataset Creation with Python

About:

This script will extract permission information from Malware and Benign applications in their respective folders and then create one Comma Seperated Values (.csv) file to store them in one place ready to be fed into ML algorithms.

How to use ?

Just copy your Malware and Benign applications on which you want to train your ML Model and run the script by following command in terminal.
python3 ExtractorAIO.py

The script will do the rest.

This can take several minutes depending on the size and number of your APK files.

How to use generated data?

The generated data will be in .csv format and can be parsed with the help of many prebuild libraries or modules. > pandas module in python is suggested

Formatting

The data is formatted in following way -
NAMEandroid.permission.ACCESS_KEYGUARD_SECURE_STORAGEandroid.permission.ACCESS_NETWORK_STATEandroid.permission.CALL_PHONEandroid.permission.READ_PHONE_STATEandroid.permission.WRITE_EXTERNAL_STORAGECLASS
a.SurlyProjectFinal.apk011110
ae.gov.dha.dha.apk010110
aero.zztrop.apk000000
a5starapps.com.drkalamquotes.apk010001
ackman.placemarks.apk010011
ackmaniac.currencyfxrates.apk010001

This is sample dataset of 6 applications (3 Malware & 3 Benign)

With 1000s of samples the table can be too big for general Office tools to open it.

The 1st column contains name of respective application and last column "CLASS" contains information if the application if from benign or malware family of training set. [0=Benign, 1=Malware]

In between there are all the permissions (common + all found in 1st phase) with respective information bit, [0=The applicaion do not use this permission, 1=This permission is used in the application]

Importing Data Example in SKlearn

Following is an example to import the data from the generated dataset into your sklearn RandomForest Model.
file = pd.read_csv("data.csv")
coulmnNames = file.iloc[1:1, 1:].columns
FeatureNames = list(coulmnNames[1:-1])
LabelName = coulmnNames[-1]
X = file[FeatureNames]
X = np.asarray(X)
Y = file[LabelName]
Y = np.asarray(Y)
feature_vectors = X
 labels = Y
 train_x, test_x, train_y, test_y = train_test_split(feature_vectors,labels,test_size=0.2)

The above code will remove NAME column and then store FEATURE_MATRIX (from column after * NAME * to second last column) and LABEL_VECTOR (* CLASS *column) in X and Y respectively, which later can be split into desired training and testing sets.

More

- This is used in PACE project.
- This can be used to Reproduce the work in
  • A. Kumar, V. Agarwal, S. K. Shandilya, A. Shalaginov, S. Upadhyay and B. Yadav, "PACE: Platform for Android Malware Classification and Performance Evaluation," 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 4280-4288, doi: 10.1109/BigData47090.2019.9006557.
    • Abstract: Android malware has become the topmost threat for ubiquitous and useful Android eco-system. Multiple solutions leveraging big data and machine learning capabilities to detect android malware are being constantly developed. Too often, many of these solutions are either limited to the research output or remain isolated and unable to reach to end-users or malware researchers. In this paper, we propose, PACE, a unified solution to offer open and easy implementation access to several machine learning-based Android malware detection techniques that make most of the research in this domain reproducible. The benefits of PACE are offered using three interfaces i.e. through REST API, Web Interface and ADB interface. Multiple interfaces enable users with different expertise such as IT administrator, security practitioners, malware researcher, etc. to avail its offered services. A community-accepted dataset is used for testing of all the techniques to provide a better comparison of performance. A prototype of the proposed platform is introduced and our vision is that it will help malware analysts to tackle challenges and reduce the amount of manual work. keywords: {Android (operating system);Big Data;invasive software;learning (artificial intelligence);pattern classification;software performance evaluation;big data;malware analysts;Android malware classification;performance evaluation;PACE;machine learning-based Android malware detection;Malware;Androids;Humanoid robots;Feature extraction;Smart phones;Machine learning;Security;Android Malware;Reproducible Research;Machine Learning;Cyber Threat Intelligence},URL:ย https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9006557&isnumber=9005444

Please Cite above paper if you are using this tool :

@INPROCEEDINGS{9006557, author={A. {Kumar} and V. {Agarwal} and S. K. {Shandilya} and A. {Shalaginov} and S. {Upadhyay} and B. {Yadav}}, booktitle={2019 IEEE International Conference on Big Data (Big Data)}, title={PACE: Platform for Android Malware Classification and Performance Evaluation}, year={2019}, volume={}, number={}, pages={4280-4288},}



=== Extra Reading ===

android-permission-extraction-and-dataset-creation-with-python's People

Contributors

saket-upadhyay avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.