Coder Social home page Coder Social logo

deepdf's Introduction

Detecting malicious PDF using CNN

Description

This repository contains the code accompanying the paper: Detecting malicious PDF using CNN. It is implemented using PyTorch.

Setup

To set up just create a virtual environment with python3 and run:

pip install -r requirements.txt

Prerequisites

In order to train a model, you need to have:

  • The pdfs files you want to train on in a local folder
  • A csv containing about information about the files. The format should be the same as samples.csv

Downloading experiment files

In order to reproduce our experiments you can:

  • Download the list of files given in training_files.csv from VirusTotal
  • Download the contagio dump for PDF files and use training_contagio.csv

Run a training

To run a training, run the file train.py

Example:
python3 train.py ModelB training_files.csv data/pdfs/ --name training1 --gpu cuda:3
It saves:
  • The model in the folder trainings/. It should be loaded with torch.load().
  • The logs in the folder logs/.
  • A PNG file containing the ROC on the test set in the current working directory.
Usage:
usage: train.py [-h] [--name NAME] [--gpu GPU] [--resample] [--cont]
                [--contagio]
                model files_csv data_path

positional arguments:
model        Model to use, should be either 'ModelA', 'ModelB', or 'ModelC'
files_csv    CSV containing the files for the training and some info. Format
            should be the same as sample.csv
data_path    Directory in which the files are stored, the name of the files
            must be to the hash in the csv file.

optional arguments:
-h, --help   show this help message and exit
--name NAME  Name of the training (for the log file, the model object and
            the ROC picture)
--gpu GPU    Which GPU to use, default will be cuda:0
--resample   Whether to resample the train set
--cont       Whether to continue old training
--contagio   Split train test for contagio dataset

deepdf's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

deepdf's Issues

Dataset

Hi, am currently trying to do some research on CNN malware detection. do you think you could provide the dataset you used and the samples.csv file?

Data set and samples.csv

Hi,
I am currently writing on my masters thesis and would like to try your implementation.
Unfortunately samples.csv is missing and I don´t have access to the contagio data set or your data set.
Would it be possible to train on the contagio data set and then try to classify unseen PDFs with the trained model? If so, could you maybe guide me a bit on how to achieve that?
I would highly appreciate it!
Kind regards!

About the lack of sample.csv

In your Readme .md, you said that some information should be the same format of the sample.csv, but there is not a file named sample.csv in your repository.
Wait for your answer and dataset.
Thanks

About the dataset using in Detection-of-Malicious-PDF-Files

I am a graduate frofhina,majoring in malicious document detection. I learn about the paper "DETECTING MALICIOUS PDF USING CNN" from your GitHub. I gained a lot of interest. I wonder if I can get the document dataset you used in the paper. I would be grateful if you can agree. I will guarantee that I will only use these data sets for personal learning and research, and will not spread to others and other malicious uses. You could transfer the data using google drive, one drive, and other ways.
Thanks.
You could contact me from the email:
base64.b64decode("c24yMDE0MTExMjMwMTVAZ21haWwuY29t")

Dataset

I would appreciate a lot if you could provide the used dataset, as well as samples.csv that you used for training the model. Thank you in advance.

Utils.py missing function

Hello, Looking at the code provided it looks like utils.py is supposed to have a function "find_detection_at" that is missing. Is that intentional?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.