Coder Social home page Coder Social logo

kahramankostas / iotdevidv2 Goto Github PK

View Code? Open in Web Editor NEW
44.0 2.0 12.0 125.21 MB

A Behavior-Based Device Identification Method for the IoT

License: MIT License

Jupyter Notebook 97.95% Python 2.05%
machine-learning device-identification iot fingerprinting

iotdevidv2's Introduction

IoTDevID: A Behavior-Based Device Identification Method for the IoT

Overview

In this repository you will find a Python implementation of the methods in the paper IoTDevID: A Behavior-Based Device Identification Method for the IoT.

Kahraman Kostas, Mike Just, and Michael A. Lones. IoTDevID: A Behavior-Based Device Identification Method for the IoT, IEEE Internet of Things Journal, 2022.

What is IoTDevID?

Device identification is one way to secure a network of IoT devices, whereby devices identified as suspicious can subsequently be isolated from a network. In this study, we present a machine learning-based method, IoTDevID, that recognises devices through characteristics of their network packets. As a result of using a rigorous feature analysis and selection process, our study offers a generalizable and realistic approach to modelling device behavior, achieving high predictive accuracy across two public datasets. The model's underlying feature set is shown to be more predictive than existing feature sets used for device identification, and is shown to generalise to data unseen during the feature selection process. Unlike most existing approaches to IoT device identification, IoTDevID is able to detect devices using non-IP and low-energy protocols.

drawing

Fig 1 - A brief overview of the IoTDevID methodology.

Requirements and Infrastructure:

Wireshark and Python 3.6 were used to create the application files. Before running the files, it must be ensured that Wireshark, Python 3.6+ and the following libraries are installed.

Library Task
Scapy Packet(Pcap) crafting
tshark Packet(Pcap) crafting
Sklearn Machine Learning & Data Preparation
xverse Feature importance/voting
Numpy Mathematical Operations
Pandas Data Analysis
Matplotlib Graphics and Visuality
Seaborn Graphics and Visuality
graphviz Graphics and Visuality

The technical specifications of the computer used for experiments are given below.

Central Processing Unit : Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz 2.90 GHz
Random Access Memory : 8 GB (7.74 GB usable)
Operating System : Windows 10 Pro 64-bit
Graphics Processing Unit : AMD Readon (TM) 530

Implementation:

The implementation phase consists of 5 steps, which are:

  • Feature Extraction
  • Feature Selection
  • Algorithm Selection
  • Performance Evaluation
  • Comparison with Previous Work

Each of these steps is implemented using one or more Python files. The same file was saved with both "py" and "ipynb" extensions. The code they contain is exactly the same. The file with the ipynb extension has the advantage of saving the state of the last run of that file and the screen output. Thus, screen output can be seen without re-running the files. Files with the ipynb extension can be run using jupyter notebook.

01 Feature Extraction (PCAP2CSV)

Section III.C in the article

There are four files relevant to this section:

These files convert the files with pcap extension to single packet-based, CSV extension fingerprint files (IoT Sentinel, IoTSense, IoTDevID individual packet based feature sets) and creates the labeling.

The processed datasets are shared in the repository. However, raw versions of the datasets used in the study and their addresses are given below.

Dataset capture year Number of Devices Type
Aalto University 2016 31 Benign
UNSW-Sydney IEEE TMC 2016 31 Benign
UNSW-Sydney ACM SOSR 2018 28 Benign & Malicious
CIC-IoT-22* 2022 60 Benign & Malicious
LSIF** 2020 22 Benign

*: The IoTDevID method was applied to this dataset as part of another study [Code]-[Paper]

**: The IoTDevID method was applied to this dataset as part of another study [Code]-[Paper-see Chapter 4]

Since the UNSW data are very large, we filter the data on a device and session basis. You can access the Pcap files obtained from this filtering process from this link (Used Pcap Files).

In addition, the CSVs.zip file contains the feature sets that are the output of this step and that we used in our experiments. These files:

  • Aalto_test_IoTDevID.csv
  • Aalto_train_IoTDevID.csv
  • Aalto_IoTSense_Test.csv
  • Aalto_IoTSense_Train.csv
  • Aalto_IoTSentinel_Test.csv
  • Aalto_IoTSentinel_Train.csv
  • UNSW_test_IoTDevID.csv
  • UNSW_train_IoTDevID.csv
  • UNSW_IoTSense_Test.csv
  • UNSW_IoTSense_Train.csv
  • UNSW_IoTSentinel_Test.csv
  • UNSW_IoTSentinel_Train.csv

02 Feature Selection

Section IV.A in the article

There are three files relevant to this section.

  • 02.1 Feature importance voting and pre-assessment of features: This file calculates the importance scores for each feature using six feature score calculation methods. It then votes for features using these scores. It lists the feature scores and the votes they have received and shows them on a plot. The six feature importance score calculation methods used are as follows.

    • Information Value using Weight of evidence.
    • Variable Importance using Random Forest.
    • Recursive Feature Elimination.
    • Variable Importance using Extra trees classifier.
    • Chi-Square best variables.
    • L1-based feature selection.
  • 02.2 Comparison of isolated data and CV methods: In this file, the results of the isolated test-training data and the cross-validated data are compared.

  • 02.3 Feature selection process using genetic algorithm: In this file, feature selection is performed by using a genetic algorithm.

03 Algorithm Selection

Section IV.B in the article

There are two files relevant to this section.

04 Performance Evaluation

Section V in the article

There are four files relevant to this section. In our experiments above, we found that DT offers the best balance between predictive performance and inference time among other machine learning methods. Therefore, only DT is used in all our subsequent experiments.

05 Comparison with Previous Work

Section VI in the article

There are two files relevant to this section.

License

This project is licensed under the MIT License - see the LICENSE file for details

Citations

If you use the source code please cite the following paper:

@article{kostas2022iot,
author = "Kahraman Kostas and Mike Just and Lones, {Michael Adam}",
year = "2022",
month = dec,
day = "1",
doi = "10.1109/JIOT.2022.3191951",
language = "English",
volume = "9",
pages = "23741--23749",
journal = "IEEE Internet of Things Journal",
issn = "2327-4662",
publisher = "IEEE",
number = "23",
}

Contact: Kahraman Kostas [email protected]

iotdevidv2's People

Contributors

kahramankostas avatar michaellones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

iotdevidv2's Issues

02.3: Issue with variable value and creation of ./100 folder

Hi!

I ran across an error in 02.3, where I can't find the value of dataset, step and mixed variable. seems like the value is not predefined in this notebook. however, I noticed those values were given at the later portion of that notebook. but using those information, I can't generate the ./100 folder.

Capture

Your help in this regard will be highly appreciated!

About "01.1 Aalto feature extraction IoTDevID", no module "scapy"

Dear Sir,
My working platform is ubuntu 20.04,
the env is anaconda,
the version of python is 3.6.2
I used "pip3 install scapy" to my visual env, but I still got the error of "ModuleNotFoundError: No module named 'scapy' "

could u help me or give me some advices?
Thanks a lot.

03. 2 Classification of Individual packets for Aalto Dataset.ipynb -AttributeError: 'NoneType' object has no attribute 'split' (KNN)

AttributeError Traceback (most recent call last)
Cell In[31], line 13
11 output_csv=dataset+str(sayac)+""+str(step)+""+str(mixed)+".csv"
12 target_names=target_name(test)
---> 13 ML(train,test,output_csv,feature,step,mixed,dataset[2:-1]+"_"+str(step))

Cell In[29], line 67, in ML(loop1, loop2, output_csv, cols, step, mixed, dname)
65 train_time=(float((time.time()-second)) )
66 second=time.time()
---> 67 predict =clf.predict(X_test)
68 test_time=(float((time.time()-second)) )
69 if step==1:
File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\sklearn\neighbors_classification.py:237, in KNeighborsClassifier.predict(self, X)
235 neigh_dist = None
236 else:
--> 237 neigh_dist, neigh_ind = self.kneighbors(X)
239 classes_ = self.classes_
240 _y = self._y

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\sklearn\neighbors_base.py:824, in KNeighborsMixin.kneighbors(self, X, n_neighbors, return_distance)
817 use_pairwise_distances_reductions = (
818 self._fit_method == "brute"
819 and ArgKmin.is_usable_for(
820 X if X is not None else self.fit_X, self.fit_X, self.effective_metric
821 )
822 )
823 if use_pairwise_distances_reductions:
--> 824 results = ArgKmin.compute(
825 X=X,
826 Y=self.fit_X,
827 k=n_neighbors,
828 metric=self.effective_metric
,
829 metric_kwargs=self.effective_metric_params
,
830 strategy="auto",
831 return_distance=return_distance,
832 )
834 elif (
835 self._fit_method == "brute" and self.metric == "precomputed" and issparse(X)
836 ):
837 results = _kneighbors_from_graph(
838 X, n_neighbors=n_neighbors, return_distance=return_distance
839 )

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\sklearn\metrics_pairwise_distances_reduction_dispatcher.py:277, in ArgKmin.compute(cls, X, Y, k, metric, chunk_size, metric_kwargs, strategy, return_distance)
196 """Compute the argkmin reduction.
197
198 Parameters
(...)
274 returns.
275 """
276 if X.dtype == Y.dtype == np.float64:
--> 277 return ArgKmin64.compute(
278 X=X,
279 Y=Y,
280 k=k,
281 metric=metric,
282 chunk_size=chunk_size,
283 metric_kwargs=metric_kwargs,
284 strategy=strategy,
285 return_distance=return_distance,
286 )
288 if X.dtype == Y.dtype == np.float32:
289 return ArgKmin32.compute(
290 X=X,
291 Y=Y,
(...)
297 return_distance=return_distance,
298 )

File sklearn\metrics_pairwise_distances_reduction_argkmin.pyx:95, in sklearn.metrics._pairwise_distances_reduction._argkmin.ArgKmin64.compute()

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\sklearn\utils\fixes.py:139, in threadpool_limits(limits, user_api)
137 return controller.limit(limits=limits, user_api=user_api)
138 else:
--> 139 return threadpoolctl.threadpool_limits(limits=limits, user_api=user_api)
File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:171, in threadpool_limits.init(self, limits, user_api)
167 def init(self, limits=None, user_api=None):
168 self._limits, self._user_api, self._prefixes =
169 self._check_params(limits, user_api)
--> 171 self._original_info = self._set_threadpool_limits()

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:268, in threadpool_limits._set_threadpool_limits(self)
265 if self._limits is None:
266 return None
--> 268 modules = _ThreadpoolInfo(prefixes=self._prefixes,
269 user_api=self._user_api)
270 for module in modules:
271 # self._limits is a dict {key: num_threads} where key is either
272 # a prefix or a user_api. If a module matches both, the limit
273 # corresponding to the prefix is chosed.
274 if module.prefix in self._limits:

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:340, in _ThreadpoolInfo.init(self, user_api, prefixes, modules)
337 self.user_api = [] if user_api is None else user_api
339 self.modules = []
--> 340 self._load_modules()
341 self._warn_if_incompatible_openmp()
342 else:
File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:485, in _ThreadpoolInfo._find_modules_with_enum_process_module_ex(self)
482 filepath = buf.value
484 # Store the module if it is supported and selected
--> 485 self._make_module_from_path(filepath)
486 finally:
487 kernel_32.CloseHandle(h_process)

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:515, in _ThreadpoolInfo._make_module_from_path(self, filepath)
513 if prefix in self.prefixes or user_api in self.user_api:
514 module_class = globals()[module_class]
--> 515 module = module_class(filepath, prefix, user_api, internal_api)
516 self.modules.append(module)

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:606, in _Module.init(self, filepath, prefix, user_api, internal_api)
604 self.internal_api = internal_api
605 self._dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
--> 606 self.version = self.get_version()
607 self.num_threads = self.get_num_threads()
608 self._get_extra_info()

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:646, in _OpenBLASModule.get_version(self)
643 get_config = getattr(self._dynlib, "openblas_get_config",
644 lambda: None)
645 get_config.restype = ctypes.c_char_p
--> 646 config = get_config().split()
647 if config[0] == b"OpenBLAS":
648 return config[1].decode("utf-8")

AttributeError: 'NoneType' object has no attribute 'split'

02.2 Comparison of isolated and CV methods.ipynb , TypeError: Could not convert ['0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty' 'DTDTDTDTDTDTDTDTDTDT'] to numeric

image

and i get solution -> replace any data can not convert to numeric to NAN

this is solution

def average_values(name_list):
    flag = 1
    for i in name_list:
        df = pd.read_csv(i) 
        col = i[14:-4]  # Extract column name from file path

        df = df.apply(lambda x: pd.to_numeric(x,errors='coerce')) # this is my change , if any value can not convert to numeric , replace it by NAN

        temp = pd.DataFrame(df.mean(), columns=[col])  # Assign column name to DataFrame
        
        if flag:
            std = temp
            flag = 0
        else:
            std[col] = temp[col]
    tt = std.T
    return tt

##isolated

name_list=find_the_way('./isolated/','.csv')
iso=average_values(name_list)
iso = iso.drop(['Dataset' , 'ML algorithm'],axis=1)# this is my change

name_list=find_the_way('./crossval/','.csv')
cv=average_values(name_list)
cv=cv.drop(['Dataset' , 'ML_algorithm'], axis=1)# this is my change

it is true ? because the graph result not same paper

it is my result

image

paper result

image

feature_extraction

hello ,
I have set up an IoT network consisting of three low-energy devices that support IPv6 and IEEE 802.15.4 MAC addresses. Upon inspecting your PCAP files, it appears that you are using only IPv4 devices. I noticed that when I included my MAC address (example: '01:12:74:01:00:01:01:01':'lab1') in the MAC list and executed the code, it appeared to be malfunctioning.

Could you please clarify whether your approach is based solely on IPv4 devices or if the code supports IPv6 devices that work with all low-energy protocols?

02.1 Feature importance voting and pre-assessment of features

I am sorry to trouble u that this part of code in file 02.1 can't run successfully. I referred to the solution in the issues and ran pip install git+https://github.com/kahramankostas/XuniVerse,Successfully installed contourpy-1.2.0 cycler-0.12.1 fonttools-4.44.3 joblib-1.3.2 kiwisolver-1.4.5 matplotlib-3.8.2 packaging-23.2 patsy-0.5.3 pillow-10.1.0 pyparsing-3.1.1 scikit-learn-1.3.2 scipy-1.11.4 statsmodels-0.14.0 threadpoolctl-3.2.0 xverse-1.0.5. but it did not take effect.

Hope for your early reply. Thanks!
my pandas version is 2.1.3,

AttributeError Traceback (most recent call last)
Cell In[22], line 14
12 clf = VotingSelector()
13 print(X, y)
---> 14 clf.fit(X, y)
15 #Selected features
16 temp="./results/"+i[18:-4]+"FI.csv"

File D:\Python_env\Lib\site-packages\xverse\ensemble_voting.py:224, in VotingSelector.fit(self, X, y)
222 #start training on the data
223 temp_X = X[self.use_features]
--> 224 self.feature_importances_, self.feature_votes_ = self.train(temp_X, y)
226 return self

File D:\Python_env\Lib\site-packages\xverse\ensemble_voting.py:285, in VotingSelector.train(self, X, y)
283 #handle categorical values with either 'woe' or 'le'
284 if self.handle_category == 'woe':
--> 285 transformed_X, self.mapping, iv_df = self.woe_information_value(X, y) #woe transformed_X
286 elif self.handle_category == 'le':
287 transformed_X = X.copy(deep=True)

File D:\Python_env\Lib\site-packages\xverse\ensemble_voting.py:115, in VotingSelector.woe_information_value(self, X, y)
112 def woe_information_value(self, X, y):
114 clf = WOE()
--> 115 clf.fit(X, y)
117 return clf.transform(X), clf.woe_bins, clf.iv_df

File D:\Python_env\Lib\site-packages\xverse\transformer_woe.py:137, in WOE.fit(self, X, y)
132 if self.monotonic_binning:
133 self.mono_bin_clf = MonotonicBinning(feature_names=self.mono_feature_names,
134 max_bins=self.mono_max_bins, force_bins=self.mono_force_bins,
135 cardinality_cutoff=self.mono_cardinality_cutoff,
136 prefix=self.mono_prefix, custom_binning=self.mono_custom_binning)
--> 137 X = self.mono_bin_clf.fit_transform(X, y)
138 self.mono_custom_binning = self.mono_bin_clf.bins
140 #identify the variables to tranform and assign the bin mapping dictionary

File D:\Python_env\Lib\site-packages\sklearn\utils_set_output.py:157, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
155 @wraps(f)
156 def wrapped(self, X, *args, **kwargs):
--> 157 data_to_wrap = f(self, X, *args, **kwargs)
158 if isinstance(data_to_wrap, tuple):
159 # only wrap the first output for cross decomposition
160 return_tuple = (
161 _wrap_data_with_container(method, data_to_wrap[0], X, self),
162 *data_to_wrap[1:],
163 )

File D:\Python_env\Lib\site-packages\xverse\transformer_binning.py:257, in MonotonicBinning.fit_transform(self, X, y)
256 def fit_transform(self, X, y):
--> 257 return self.fit(X, y).transform(X)

File D:\Python_env\Lib\site-packages\xverse\transformer_binning.py:122, in MonotonicBinning.fit(self, X, y)
118 raise ValueError("The input feature(s) should be numeric type. Some of the input features
119 has character values in it. Please use a encoder before performing monotonic operations.")
121 #apply the monotonic train function on dataset
--> 122 fit_X.apply(lambda x: self.train(x, y), axis=0)
123 return self

File D:\Python_env\Lib\site-packages\pandas\core\frame.py:10034, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, **kwargs)
10022 from pandas.core.apply import frame_apply
10024 op = frame_apply(
10025 self,
10026 func=func,
(...)
10032 kwargs=kwargs,
10033 )

10034 return op.apply().finalize(self, method="apply")

File D:\Python_env\Lib\site-packages\pandas\core\apply.py:837, in FrameApply.apply(self)
834 elif self.raw:
835 return self.apply_raw()
--> 837 return self.apply_standard()

File D:\Python_env\Lib\site-packages\pandas\core\apply.py:963, in FrameApply.apply_standard(self)
962 def apply_standard(self):
--> 963 results, res_index = self.apply_series_generator()
965 # wrap results
966 return self.wrap_results(results, res_index)

File D:\Python_env\Lib\site-packages\pandas\core\apply.py:979, in FrameApply.apply_series_generator(self)
976 with option_context("mode.chained_assignment", None):
977 for i, v in enumerate(series_gen):
978 # ignore SettingWithCopy here in case the user mutates
--> 979 results[i] = self.func(v, *self.args, **self.kwargs)
980 if isinstance(results[i], ABCSeries):
981 # If we have a view on v, we need to make a copy because
982 # series_generator will swap out the underlying data
983 results[i] = results[i].copy(deep=False)

File D:\Python_env\Lib\site-packages\xverse\transformer_binning.py:122, in MonotonicBinning.fit..(x)
118 raise ValueError("The input feature(s) should be numeric type. Some of the input features
119 has character values in it. Please use a encoder before performing monotonic operations.")
121 #apply the monotonic train function on dataset
--> 122 fit_X.apply(lambda x: self.train(x, y), axis=0)
123 return self

File D:\Python_env\Lib\site-packages\xverse\transformer_binning.py:170, in MonotonicBinning.train(self, X, y)
165 """
166 Execute this block when monotonic relationship is not identified by spearman technique.
167 We still want our code to produce bins.
168 """
169 if len(bins_X_grouped) == 1:
--> 170 bins = algos.quantile(X, np.linspace(0, 1, force_bins)) #creates a new binnning based on forced bins
171 if len(np.unique(bins)) == 2:
172 bins = np.insert(bins, 0, 1)

AttributeError: module 'pandas.core.algorithms' has no attribute 'quantile'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.