kahramankostas / iotdevidv2 Goto Github PK

View Code? Open in Web Editor NEW

44.0 2.0 12.0 125.21 MB

A Behavior-Based Device Identification Method for the IoT

License: MIT License

Jupyter Notebook 97.95% Python 2.05%

machine-learning device-identification iot fingerprinting

iotdevidv2's Introduction

IoTDevID: A Behavior-Based Device Identification Method for the IoT

Overview

In this repository you will find a Python implementation of the methods in the paper IoTDevID: A Behavior-Based Device Identification Method for the IoT.

Kahraman Kostas, Mike Just, and Michael A. Lones. IoTDevID: A Behavior-Based Device Identification Method for the IoT, IEEE Internet of Things Journal, 2022.

What is IoTDevID?

Device identification is one way to secure a network of IoT devices, whereby devices identified as suspicious can subsequently be isolated from a network. In this study, we present a machine learning-based method, IoTDevID, that recognises devices through characteristics of their network packets. As a result of using a rigorous feature analysis and selection process, our study offers a generalizable and realistic approach to modelling device behavior, achieving high predictive accuracy across two public datasets. The model's underlying feature set is shown to be more predictive than existing feature sets used for device identification, and is shown to generalise to data unseen during the feature selection process. Unlike most existing approaches to IoT device identification, IoTDevID is able to detect devices using non-IP and low-energy protocols.

Fig 1 - A brief overview of the IoTDevID methodology.

Requirements and Infrastructure:

Wireshark and Python 3.6 were used to create the application files. Before running the files, it must be ensured that Wireshark, Python 3.6+ and the following libraries are installed.

Library	Task
Scapy	Packet(Pcap) crafting
tshark	Packet(Pcap) crafting
Sklearn	Machine Learning & Data Preparation
xverse	Feature importance/voting
Numpy	Mathematical Operations
Pandas	Data Analysis
Matplotlib	Graphics and Visuality
Seaborn	Graphics and Visuality
graphviz	Graphics and Visuality

The technical specifications of the computer used for experiments are given below.


Central Processing Unit	:	Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz 2.90 GHz
Random Access Memory	:	8 GB (7.74 GB usable)
Operating System	:	Windows 10 Pro 64-bit
Graphics Processing Unit	:	AMD Readon (TM) 530

Implementation:

The implementation phase consists of 5 steps, which are:

Feature Extraction
Feature Selection
Algorithm Selection
Performance Evaluation
Comparison with Previous Work

Each of these steps is implemented using one or more Python files. The same file was saved with both "py" and "ipynb" extensions. The code they contain is exactly the same. The file with the ipynb extension has the advantage of saving the state of the last run of that file and the screen output. Thus, screen output can be seen without re-running the files. Files with the ipynb extension can be run using jupyter notebook.

01 Feature Extraction (PCAP2CSV)

Section III.C in the article

There are four files relevant to this section:

These files convert the files with pcap extension to single packet-based, CSV extension fingerprint files (IoT Sentinel, IoTSense, IoTDevID individual packet based feature sets) and creates the labeling.

The processed datasets are shared in the repository. However, raw versions of the datasets used in the study and their addresses are given below.

Dataset	capture year	Number of Devices	Type
Aalto University	2016	31	Benign
UNSW-Sydney IEEE TMC	2016	31	Benign
UNSW-Sydney ACM SOSR	2018	28	Benign & Malicious
CIC-IoT-22*	2022	60	Benign & Malicious
LSIF**	2020	22	Benign

*: The IoTDevID method was applied to this dataset as part of another study [Code]-[Paper]

**: The IoTDevID method was applied to this dataset as part of another study [Code]-[Paper-see Chapter 4]

Since the UNSW data are very large, we filter the data on a device and session basis. You can access the Pcap files obtained from this filtering process from this link (Used Pcap Files).

In addition, the CSVs.zip file contains the feature sets that are the output of this step and that we used in our experiments. These files:

Aalto_test_IoTDevID.csv
Aalto_train_IoTDevID.csv
Aalto_IoTSense_Test.csv
Aalto_IoTSense_Train.csv
Aalto_IoTSentinel_Test.csv
Aalto_IoTSentinel_Train.csv
UNSW_test_IoTDevID.csv
UNSW_train_IoTDevID.csv
UNSW_IoTSense_Test.csv
UNSW_IoTSense_Train.csv
UNSW_IoTSentinel_Test.csv
UNSW_IoTSentinel_Train.csv

02 Feature Selection

Section IV.A in the article

There are three files relevant to this section.

02.1 Feature importance voting and pre-assessment of features: This file calculates the importance scores for each feature using six feature score calculation methods. It then votes for features using these scores. It lists the feature scores and the votes they have received and shows them on a plot. The six feature importance score calculation methods used are as follows.
- Information Value using Weight of evidence.
- Variable Importance using Random Forest.
- Recursive Feature Elimination.
- Variable Importance using Extra trees classifier.
- Chi-Square best variables.
- L1-based feature selection.
02.2 Comparison of isolated data and CV methods: In this file, the results of the isolated test-training data and the cross-validated data are compared.
02.3 Feature selection process using genetic algorithm: In this file, feature selection is performed by using a genetic algorithm.

03 Algorithm Selection

Section IV.B in the article

There are two files relevant to this section.

03.1 Hyperparameter Optimization: In this file, hyperparameter optimization is applied via sklearn-Randomizedsearch to the machine learning models being used. These machine learning models are:
- Decision Trees (DT)
- Naïve Bayes (NB)
- Gradient Boosting (GB)
- k-Nearest Neighbours (kNN)
- Random Forest (RF)
- Support Vector Machine (SVM)
03. 2 Classification of Individual packets for Aalto Dataset: This file trains machine learning models using the individual packets of Aalto University dataset using the methods mentioned above and the optimised hyperparameters.

04 Performance Evaluation

Section V in the article

There are four files relevant to this section. In our experiments above, we found that DT offers the best balance between predictive performance and inference time among other machine learning methods. Therefore, only DT is used in all our subsequent experiments.

04.1 Determination of aagregetion size: In this file, different aggregation sizes are tested. For this purpose, groups of different sizes (from 2 to 25) are formed and the performance results of these groups are observed.
04.2 Classification of ind-aag-mixed packets for Aalto Dataset: In this file, results are obtained for the Aalto dataset using individual, aggregated and mixed methods. A group size of 13 was used in the aggregation operations.
04.3 Classification of ind-aag-mixed packets for UNSW Dataset: In this file, results are obtained for the UNSW dataset using individual, aggregated and mixed methods. A group size of 13 was used in the aggregation operations.
04.4 Aalto results with combined labels: In this file, to deal with lower performance caused by the fact that the Aalto dataset contains many very similar devices, these similar devices are considered as a group and collected under the same label.

05 Comparison with Previous Work

Section VI in the article

There are two files relevant to this section.

05.1 Aalto IoTSense & IoTSentinel Normal, Aagregeted, Mixed Results: This file trains machine learning models using Aalto University data for 3 studies (IoTDevID, IoTSense, IoT Sentinel) with an individual, aggregated and mixed approach in order to compare the feature set performances.
05.2 UNSW IoTSense & IoTSentinel Normal, Aagregeted, Mixed Results: This file trains machine learning models using UNSW data for 3 studies (IoTDevID, IoTSense, IoT Sentinel) with an individual, aggregated and mixed approach in order to compare the feature set performances.

License

This project is licensed under the MIT License - see the LICENSE file for details

Citations

If you use the source code please cite the following paper:

@article{kostas2022iot,
author = "Kahraman Kostas and Mike Just and Lones, {Michael Adam}",
year = "2022",
month = dec,
day = "1",
doi = "10.1109/JIOT.2022.3191951",
language = "English",
volume = "9",
pages = "23741--23749",
journal = "IEEE Internet of Things Journal",
issn = "2327-4662",
publisher = "IEEE",
number = "23",
}

Contact: Kahraman Kostas [email protected]

iotdevidv2's People

Contributors

Stargazers

Watchers

Forkers

davidwu-ns amitchahalsign3 gudlin kanglinew zjjhuihui jenxp gayathrap yubars jasonhyj julianoth apzhou

iotdevidv2's Issues

02.3: Issue with variable value and creation of ./100 folder

Hi!

I ran across an error in 02.3, where I can't find the value of dataset, step and mixed variable. seems like the value is not predefined in this notebook. however, I noticed those values were given at the later portion of that notebook. but using those information, I can't generate the ./100 folder.

Your help in this regard will be highly appreciated!

About "01.1 Aalto feature extraction IoTDevID", no module "scapy"

Dear Sir,
My working platform is ubuntu 20.04,
the env is anaconda,
the version of python is 3.6.2
I used "pip3 install scapy" to my visual env, but I still got the error of "ModuleNotFoundError: No module named 'scapy' "

could u help me or give me some advices?
Thanks a lot.

03. 2 Classification of Individual packets for Aalto Dataset.ipynb -AttributeError: 'NoneType' object has no attribute 'split' (KNN)

AttributeError Traceback (most recent call last)
Cell In[31], line 13
11 output_csv=dataset+str(sayac)+""+str(step)+""+str(mixed)+".csv"
12 target_names=target_name(test)
---> 13 ML(train,test,output_csv,feature,step,mixed,dataset[2:-1]+"_"+str(step))

Cell In[29], line 67, in ML(loop1, loop2, output_csv, cols, step, mixed, dname)
65 train_time=(float((time.time()-second)) )
66 second=time.time()
---> 67 predict =clf.predict(X_test)
68 test_time=(float((time.time()-second)) )
69 if step==1:
File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\sklearn\neighbors_classification.py:237, in KNeighborsClassifier.predict(self, X)
235 neigh_dist = None
236 else:
--> 237 neigh_dist, neigh_ind = self.kneighbors(X)
239 classes_ = self.classes_
240 _y = self._y

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\sklearn\neighbors_base.py:824, in KNeighborsMixin.kneighbors(self, X, n_neighbors, return_distance)
817 use_pairwise_distances_reductions = (
818 self._fit_method == "brute"
819 and ArgKmin.is_usable_for(
820 X if X is not None else self.fit_X, self.fit_X, self.effective_metric
821 )
822 )
823 if use_pairwise_distances_reductions:
--> 824 results = ArgKmin.compute(
825 X=X,
826 Y=self.fit_X,
827 k=n_neighbors,
828 metric=self.effective_metric,
829 metric_kwargs=self.effective_metric_params,
830 strategy="auto",
831 return_distance=return_distance,
832 )
834 elif (
835 self._fit_method == "brute" and self.metric == "precomputed" and issparse(X)
836 ):
837 results = _kneighbors_from_graph(
838 X, n_neighbors=n_neighbors, return_distance=return_distance
839 )

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\sklearn\metrics_pairwise_distances_reduction_dispatcher.py:277, in ArgKmin.compute(cls, X, Y, k, metric, chunk_size, metric_kwargs, strategy, return_distance)
196 """Compute the argkmin reduction.
197
198 Parameters
(...)
274 returns.
275 """
276 if X.dtype == Y.dtype == np.float64:
--> 277 return ArgKmin64.compute(
278 X=X,
279 Y=Y,
280 k=k,
281 metric=metric,
282 chunk_size=chunk_size,
283 metric_kwargs=metric_kwargs,
284 strategy=strategy,
285 return_distance=return_distance,
286 )
288 if X.dtype == Y.dtype == np.float32:
289 return ArgKmin32.compute(
290 X=X,
291 Y=Y,
(...)
297 return_distance=return_distance,
298 )

File sklearn\metrics_pairwise_distances_reduction_argkmin.pyx:95, in sklearn.metrics._pairwise_distances_reduction._argkmin.ArgKmin64.compute()

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\sklearn\utils\fixes.py:139, in threadpool_limits(limits, user_api)
137 return controller.limit(limits=limits, user_api=user_api)
138 else:
--> 139 return threadpoolctl.threadpool_limits(limits=limits, user_api=user_api)
File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:171, in threadpool_limits.init(self, limits, user_api)
167 def init(self, limits=None, user_api=None):
168 self._limits, self._user_api, self._prefixes =
169 self._check_params(limits, user_api)
--> 171 self._original_info = self._set_threadpool_limits()

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:268, in threadpool_limits._set_threadpool_limits(self)
265 if self._limits is None:
266 return None
--> 268 modules = _ThreadpoolInfo(prefixes=self._prefixes,
269 user_api=self._user_api)
270 for module in modules:
271 # self._limits is a dict {key: num_threads} where key is either
272 # a prefix or a user_api. If a module matches both, the limit
273 # corresponding to the prefix is chosed.
274 if module.prefix in self._limits:

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:340, in _ThreadpoolInfo.init(self, user_api, prefixes, modules)
337 self.user_api = [] if user_api is None else user_api
339 self.modules = []
--> 340 self._load_modules()
341 self._warn_if_incompatible_openmp()
342 else:
File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:485, in _ThreadpoolInfo._find_modules_with_enum_process_module_ex(self)
482 filepath = buf.value
484 # Store the module if it is supported and selected
--> 485 self._make_module_from_path(filepath)
486 finally:
487 kernel_32.CloseHandle(h_process)

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:515, in _ThreadpoolInfo._make_module_from_path(self, filepath)
513 if prefix in self.prefixes or user_api in self.user_api:
514 module_class = globals()[module_class]
--> 515 module = module_class(filepath, prefix, user_api, internal_api)
516 self.modules.append(module)

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:606, in _Module.init(self, filepath, prefix, user_api, internal_api)
604 self.internal_api = internal_api
605 self._dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD)
--> 606 self.version = self.get_version()
607 self.num_threads = self.get_num_threads()
608 self._get_extra_info()

File ~.conda\envs\IoTDevIDv2_0\lib\site-packages\threadpoolctl.py:646, in _OpenBLASModule.get_version(self)
643 get_config = getattr(self._dynlib, "openblas_get_config",
644 lambda: None)
645 get_config.restype = ctypes.c_char_p
--> 646 config = get_config().split()
647 if config[0] == b"OpenBLAS":
648 return config[1].decode("utf-8")

AttributeError: 'NoneType' object has no attribute 'split'

02.2 Comparison of isolated and CV methods.ipynb , TypeError: Could not convert ['0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty0 empty' 'DTDTDTDTDTDTDTDTDTDT'] to numeric

and i get solution -> replace any data can not convert to numeric to NAN

this is solution

def average_values(name_list):
    flag = 1
    for i in name_list:
        df = pd.read_csv(i) 
        col = i[14:-4]  # Extract column name from file path

        df = df.apply(lambda x: pd.to_numeric(x,errors='coerce')) # this is my change , if any value can not convert to numeric , replace it by NAN

        temp = pd.DataFrame(df.mean(), columns=[col])  # Assign column name to DataFrame
        
        if flag:
            std = temp
            flag = 0
        else:
            std[col] = temp[col]
    tt = std.T
    return tt

##isolated

name_list=find_the_way('./isolated/','.csv')
iso=average_values(name_list)
iso = iso.drop(['Dataset' , 'ML algorithm'],axis=1)# this is my change

name_list=find_the_way('./crossval/','.csv')
cv=average_values(name_list)
cv=cv.drop(['Dataset' , 'ML_algorithm'], axis=1)# this is my change

it is true ? because the graph result not same paper

it is my result

paper result

feature_extraction

hello ,
I have set up an IoT network consisting of three low-energy devices that support IPv6 and IEEE 802.15.4 MAC addresses. Upon inspecting your PCAP files, it appears that you are using only IPv4 devices. I noticed that when I included my MAC address (example: '01:12:74:01:00:01:01:01':'lab1') in the MAC list and executed the code, it appeared to be malfunctioning.

Could you please clarify whether your approach is based solely on IPv4 devices or if the code supports IPv6 devices that work with all low-energy protocols?

02.1 Feature importance voting and pre-assessment of features

I am sorry to trouble u that this part of code in file 02.1 can't run successfully. I referred to the solution in the issues and ran pip install git+https://github.com/kahramankostas/XuniVerse,Successfully installed contourpy-1.2.0 cycler-0.12.1 fonttools-4.44.3 joblib-1.3.2 kiwisolver-1.4.5 matplotlib-3.8.2 packaging-23.2 patsy-0.5.3 pillow-10.1.0 pyparsing-3.1.1 scikit-learn-1.3.2 scipy-1.11.4 statsmodels-0.14.0 threadpoolctl-3.2.0 xverse-1.0.5. but it did not take effect.

Hope for your early reply. Thanks!
my pandas version is 2.1.3,

AttributeError Traceback (most recent call last)
Cell In[22], line 14
12 clf = VotingSelector()
13 print(X, y)
---> 14 clf.fit(X, y)
15 #Selected features
16 temp="./results/"+i[18:-4]+"FI.csv"

File D:\Python_env\Lib\site-packages\xverse\ensemble_voting.py:224, in VotingSelector.fit(self, X, y)
222 #start training on the data
223 temp_X = X[self.use_features]
--> 224 self.feature_importances_, self.feature_votes_ = self.train(temp_X, y)
226 return self

File D:\Python_env\Lib\site-packages\xverse\ensemble_voting.py:285, in VotingSelector.train(self, X, y)
283 #handle categorical values with either 'woe' or 'le'
284 if self.handle_category == 'woe':
--> 285 transformed_X, self.mapping, iv_df = self.woe_information_value(X, y) #woe transformed_X
286 elif self.handle_category == 'le':
287 transformed_X = X.copy(deep=True)

File D:\Python_env\Lib\site-packages\xverse\ensemble_voting.py:115, in VotingSelector.woe_information_value(self, X, y)
112 def woe_information_value(self, X, y):
114 clf = WOE()
--> 115 clf.fit(X, y)
117 return clf.transform(X), clf.woe_bins, clf.iv_df

File D:\Python_env\Lib\site-packages\xverse\transformer_woe.py:137, in WOE.fit(self, X, y)
132 if self.monotonic_binning:
133 self.mono_bin_clf = MonotonicBinning(feature_names=self.mono_feature_names,
134 max_bins=self.mono_max_bins, force_bins=self.mono_force_bins,
135 cardinality_cutoff=self.mono_cardinality_cutoff,
136 prefix=self.mono_prefix, custom_binning=self.mono_custom_binning)
--> 137 X = self.mono_bin_clf.fit_transform(X, y)
138 self.mono_custom_binning = self.mono_bin_clf.bins
140 #identify the variables to tranform and assign the bin mapping dictionary

File D:\Python_env\Lib\site-packages\sklearn\utils_set_output.py:157, in _wrap_method_output..wrapped(self, X, *args, **kwargs)
155 @wraps(f)
156 def wrapped(self, X, *args, **kwargs):
--> 157 data_to_wrap = f(self, X, *args, **kwargs)
158 if isinstance(data_to_wrap, tuple):
159 # only wrap the first output for cross decomposition
160 return_tuple = (
161 _wrap_data_with_container(method, data_to_wrap[0], X, self),
162 *data_to_wrap[1:],
163 )

File D:\Python_env\Lib\site-packages\xverse\transformer_binning.py:257, in MonotonicBinning.fit_transform(self, X, y)
256 def fit_transform(self, X, y):
--> 257 return self.fit(X, y).transform(X)

File D:\Python_env\Lib\site-packages\xverse\transformer_binning.py:122, in MonotonicBinning.fit(self, X, y)
118 raise ValueError("The input feature(s) should be numeric type. Some of the input features
119 has character values in it. Please use a encoder before performing monotonic operations.")
121 #apply the monotonic train function on dataset
--> 122 fit_X.apply(lambda x: self.train(x, y), axis=0)
123 return self

File D:\Python_env\Lib\site-packages\pandas\core\frame.py:10034, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, **kwargs)
10022 from pandas.core.apply import frame_apply
10024 op = frame_apply(
10025 self,
10026 func=func,
(...)
10032 kwargs=kwargs,
10033 )

10034 return op.apply().finalize(self, method="apply")

File D:\Python_env\Lib\site-packages\pandas\core\apply.py:837, in FrameApply.apply(self)
834 elif self.raw:
835 return self.apply_raw()
--> 837 return self.apply_standard()

File D:\Python_env\Lib\site-packages\pandas\core\apply.py:963, in FrameApply.apply_standard(self)
962 def apply_standard(self):
--> 963 results, res_index = self.apply_series_generator()
965 # wrap results
966 return self.wrap_results(results, res_index)

File D:\Python_env\Lib\site-packages\pandas\core\apply.py:979, in FrameApply.apply_series_generator(self)
976 with option_context("mode.chained_assignment", None):
977 for i, v in enumerate(series_gen):
978 # ignore SettingWithCopy here in case the user mutates
--> 979 results[i] = self.func(v, *self.args, **self.kwargs)
980 if isinstance(results[i], ABCSeries):
981 # If we have a view on v, we need to make a copy because
982 # series_generator will swap out the underlying data
983 results[i] = results[i].copy(deep=False)

File D:\Python_env\Lib\site-packages\xverse\transformer_binning.py:122, in MonotonicBinning.fit..(x)
118 raise ValueError("The input feature(s) should be numeric type. Some of the input features
119 has character values in it. Please use a encoder before performing monotonic operations.")
121 #apply the monotonic train function on dataset
--> 122 fit_X.apply(lambda x: self.train(x, y), axis=0)
123 return self

File D:\Python_env\Lib\site-packages\xverse\transformer_binning.py:170, in MonotonicBinning.train(self, X, y)
165 """
166 Execute this block when monotonic relationship is not identified by spearman technique.
167 We still want our code to produce bins.
168 """
169 if len(bins_X_grouped) == 1:
--> 170 bins = algos.quantile(X, np.linspace(0, 1, force_bins)) #creates a new binnning based on forced bins
171 if len(np.unique(bins)) == 2:
172 bins = np.insert(bins, 0, 1)

AttributeError: module 'pandas.core.algorithms' has no attribute 'quantile'

02.2 Comparison of isolated and CV methods.ipynb , ML_isolated(train,test,output_csv,features,step,flexible,i) give me error ValueError: could not convert string to float: 'b'

and i found in altro train IOTDevID csv data

02.2 Comparison of isolated and CV methods.ipynb , ['sport'] col not found Aalto_train_IoTDevID.CSV data

i found this

is possible to increase feature to add five col name are related 'sport' ?

note I use CSVs files from CSVs.rar not by feature extraction

02.1 Feature importance voting and pre-assessment of features.ipynb Can‘t run successfully

I am sorry to trouble u that this part of code in file 02.1 can't run successfully. Hope for your early reply. Thanks!