Protecting the World against the tyranny

Please consider making your donation to the National Bank of Ukraine. The fundraiser National Bank of Ukraine or NBU is the central bank of Ukraine. You could help freedom fighters and Ukrainian civilians in humanitarian crisis:

If you are against funding bombs and arms, you could also donate to Come Back Alive Charity which helps the Ukrainian armed forces with defense including medical assistance and rehabilitation. The organization is transparent with the donation and its spending.

Come Back Alive Charity

Thank you for your support.

What

This is a collection of Python notebooks of my on-going research on deep packet inspection.

V2Ray Deep Packet Inspection Notebook demos my work to perform deep packet inspection and classify V2ray traffic. For more details, please visit my blog post.
Adversarial Examples for Traffic Classifiers Notebook demos white-box adversarial example to evade V2Ray classifier. For more details, please visit my blog post.
A Classification Model for V2Ray TLS + WebSocket settings is a pre-trained model for TLS + WebSocket settings. For more details, please visit my blog post.

Wrong way to split the dataset

An important result of this repository is that V2Ray traffic data can be easily detected by a CNN model. But I found that the method of dividing the data set seems to be wrong. The following code is in this file. The following code undersamples both the training set and the validation set. It's ok to undersample on the training set, but undersampling on the validation/test set is not such a good choice. This method of dividing the data set allows the classifier to cheat. Because the distribution of positive and negative examples of the test set has been artificially adjusted. For machine learning, what is important is the independence of the training set and the test set.



import numpy as np
import math
import os.path
from pathlib import Path
import glob
from tensorflow.keras.utils import Sequence

FIXED_PACKET_SIZE = 1500
NUM_OF_PACKETS_PER_FILE = 16
RESCALE_FACTOR = 1./255

# v2ray traffic tag
TRAINING_DATA_PERCENTAGE = 0.8
PACKET_FILE_EXT = '*.bin'


def rglob(data_root, file_ext):
    files = list()
    for filePath in Path(data_root).rglob(file_ext):
        files.append(str(filePath))
    return files


def binary_classification(packet_path, match_string=V2RAY_HOST_TAG):
    """Binary network traffic classification function

    :param packet_path: file path to packet
    :param match_string:
    :return: 1, if it is v2ray traffice. 0, otherwise.
    """
    if packet_path.find(match_string) != -1:
        return 1
    else:
        return 0


def generate_train_validation_packet_path_list(data_root, training_pct=TRAINING_DATA_PERCENTAGE, eqaul_size=True):
    file_list = rglob(data_root, PACKET_FILE_EXT)
    v2ray_file_list = [file_path for file_path in file_list if binary_classification(file_path) == 1]
    non_v2ray_file_list = [file_path for file_path in file_list if binary_classification(file_path) == 0]

    if eqaul_size:
        cut_off_count = min(len(v2ray_file_list), len(non_v2ray_file_list))
        v2ray_file_size = cut_off_count
        non_v2ray_file_size = cut_off_count
    else:
        v2ray_file_size = len(v2ray_file_list)
        non_v2ray_file_size = len(non_v2ray_file_list)

    v2ray_indexes = np.arange(len(v2ray_file_list))
    np.random.shuffle(v2ray_indexes)
    non_v2ray_indexes = np.arange(len(non_v2ray_file_list))
    np.random.shuffle(non_v2ray_indexes)

    training_file_list = [v2ray_file_list[index]
                          for index in v2ray_indexes[:math.ceil(v2ray_file_size * training_pct)]] + \
                         [non_v2ray_file_list[index]
                          for index in non_v2ray_indexes[:math.ceil(non_v2ray_file_size * training_pct)]]

    validation_file_list = [v2ray_file_list[index]
                            for index in v2ray_indexes[math.ceil(v2ray_file_size * training_pct): v2ray_file_size]] + \
                           [non_v2ray_file_list[index]
                            for index in non_v2ray_indexes[math.ceil(non_v2ray_file_size * training_pct): non_v2ray_file_size]]

    print("Statistics: ")
    print("Total V2ray traffic %d, Total non-V2ray traffic %d" % (len(v2ray_file_list), len(non_v2ray_file_list)))
    print("Output train traffic %d, Total validation traffic %d" % (len(training_file_list), len(validation_file_list)))

    return training_file_list, validation_file_list
# Generate training data and validation file list
train_file_list, val_file_list = generate_train_validation_packet_path_list(data_root=DATA_ROOT, eqaul_size=True)

Related issue/discussion:

v2fly/v2ray-core#557
v2ray/discussion#569

Ref:
https://datascience.stackexchange.com/questions/61858/oversampling-undersampling-only-train-set-only-or-both-train-and-validation-set

rickyzhang82 / v2ray-deep-packet-inspection Goto Github PK

v2ray-deep-packet-inspection's Introduction

Protecting the World against the tyranny

What

v2ray-deep-packet-inspection's People

Contributors

Stargazers

Watchers

Forkers

v2ray-deep-packet-inspection's Issues

Interesting project

Wrong way to split the dataset

data set

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent