Deep Learning-Based Hybrid Intelligent Intrusion Detection System
Deep Learning-Based Hybrid Intelligent Intrusion Detection System
Deep Learning-Based Hybrid Intelligent Intrusion Detection System
com/cmc/v68n1/41825/html
[BACK]
1
Department of Information, and Communication Engineering, Dongguk University, Seoul,
100-715, Korea
2
Department of Electronics Engineering, IoT and Big-Data Research Center, Incheon
National University, Incheon, Korea
*
Corresponding Author: Yangwoo Kim. Email: [email protected]
Received: 01 December 2020; Accepted: 24 January 2021
Abstract: Machine learning (ML) algorithms are often used to design effective
intrusion detection (ID) systems for appropriate mitigation and effective
detection of malicious cyber threats at the host and network levels. However,
cybersecurity attacks are still increasing. An ID system can play a vital role in
detecting such threats. Existing ID systems are unable to detect malicious
threats, primarily because they adopt approaches that are based on traditional
ML techniques, which are less concerned with the accurate classification and
feature selection. Thus, developing an accurate and intelligent ID system is a
priority. The main objective of this study was to develop a hybrid intelligent
intrusion detection system (HIIDS) to learn crucial features representation
efficiently and automatically from massive unlabeled raw network traffic data.
Many ID datasets are publicly available to the cybersecurity research
community. As such, we used a spark MLlib (machine learning library)-based
robust classifier, such as logistic regression (LR), extreme gradient boosting
(XGB) was used for anomaly detection, and a state-of-the-art DL, such as a long
short-term memory autoencoder (LSTMAE) for misuse attack was used to
develop an efficient and HIIDS to detect and classify unpredictable attacks. Our
approach utilized LSTM to detect temporal features and an AE to more
efficiently detect global features. Therefore, to evaluate the efficacy of our
proposed approach, experiments were conducted on a publicly existing dataset,
the contemporary real-life ISCX-UNB dataset. The simulation results
demonstrate that our proposed spark MLlib and LSTMAE-based HIIDS
significantly outperformed existing ID approaches, achieving a high accuracy
rate of up to 97.52% for the ISCX-UNB dataset respectively 10-fold cross-
validation test. It is quite promising to use our proposed HIIDS in real-world
circumstances on a large-scale.
1 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
1 Introduction
The intrusion detection (ID) system is a renowned solution for detecting
malicious activities in a network. The types of malicious network attacks have
grown exponentially, and the ID system has become an essential component of
defense in addition to network security infrastructure. In 1931, John Anderson
published the first significant paper on ID, Computer Security surveillance, and
threat monitoring [1]. An ID system usually monitors all internal and external
packets of a network to detect whether a packet has a sign of intrusion. A well-
made ID system can determine the properties of numerous malicious activities
and automatically respond to them by sending cautions.
In general, there are three common ID system classes; these classes are based
on detection approaches. The first class is the signature-based system (SBS),
which includes the misuse detection technique. The second is the anomaly-based
system (ABS), also known simply as “anomaly.” The third one is the stateful
protocol analysis detection [2]. SBS relies upon a pattern matching technique,
taking a database of known attack signatures and comparing these to signatures
present in the observed data. An alarm goes off when a match is identified. SBS
detects attacks based on existing knowledge; as such, the misuse detection
technique is also recognized as a knowledge-based technique. The misuse
detection technique features a minimum false alarm rate and maximum accuracy;
it cannot, however, identify strange attacks. Similarly, the behavior-based ID
system, also known as ABS, can detect intrusion by matching normal behavior to
an abnormal one. The stateful protocol ID method compares the known malicious
activities and identifies the eccentricity of protocol activity, taking advantage of
both anomaly and signature-based ID techniques. ID systems can be further
categorized into three types according to their architectures: Network-based
detection system (NIDS), Host-based detection system (HIDS), and the hybrid
approach [3]. For a HIDS application, the software is fixed, and the host computer
plays an important role in evaluating and monitoring system behavior and event
log files play active roles in ID [4]. Unlike a HIDS, which analyzes each host
separately, a NIDS analyzes the packets that flow above the network. This gives
the NIDS an edge over the HIDS because it can test the whole network with a
unique system structure. However, while the NIDS is superior in terms of
installation cost and time of application software, it is vulnerable to distribution
into a system over the network and affects the complete network. The hybrid IDS
combine both the HIDS and NIDS with better-quality security mechanisms. The
hybrid system joins the spatial sensors to identify vulnerabilities, which can occur
at a particular point or over the whole network. There are two main ID system
types, which are defined according to the system’s deployment structure:
distributed structure and non-distributed structure. A distributed structure
involves several ID subsystems that communicate with each other over an
extensive network. In contrast, a non-distributed system can be mounted only at a
single, unique location, for example, an open-source snort.
Most approaches currently used in ID systems are unable to deal with the
complex and dynamic nature of malicious threats on computer networks.
Therefore, effective adaptive methods, such as several ML techniques, can
achieve a higher intrusion detection rate (DR), a low false alarm rate (FAR),
2 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
3 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
art DL techniques, such as LSTMAE, which merges both shallow and deep
networks to overwhelm their analytical overheads and exploit their benefits.
This HIIDS investigates how to solve the class imbalance problem that
usually occurs in ISCX ID datasets.
• Further investigation of the packet capture file directly on Spark; prior
studies did not evaluate the raw packet dataset.
• Comparison of the HIIDS with other conventional ML methods. The
simulation results demonstrate that the HIIDS approach is highly appropriate
for malicious traffic detection. It has higher attack detection accuracy and
was found to correctly detect network misuses in 97.52% of the cases
through 10-fold cross-validation.
The rest of this paper is organized as follows. The background of ID and related
work are briefly reviewed in Section 2. A brief overview of our proposed HIIDS
and a detailed description of the dataset that was used for classification are
provided in Section 3. A simulation of our proposed framework with performance
metrics is discussed in Section 4. The paper is concluded with a possible direction
for future work described in Section 5.
2 Related Work
Over the last two decades, the application of machine learning (ML) and deep
learning (DL) to intrusion detection (ID) systems has been suggested by several
researchers. Therefore, various models have been developed for network intrusion
detection (NID) using conventional ML techniques. Examples include K nearest
neighbors (KNN) as suggested by Khammassi et al. [11] Logistic Regression (LR)
as suggested by Moustafa et al. [12] Support vector machine (SVM) as suggested
by Khan et al. [13] Random Forest (RF) as suggested by Farnaaz et al. [14]
Decision Tree (DT) as suggested by Sindhu et al. [15] Naïve Bayes (NB) as
suggested by Buczak et al. [16] and Artificial Neural Networks (ANN) as
suggested by Vincent et al. [17]. However, these prior techniques demonstrate
inadequate classification performance with the maximum false alarm rate (FAR)
and low attack detection rate (DR) in an ID system. Kim et al. [18] developed a
hybrid system that incorporates misuse and anomaly using supervised ML
classifiers SVM and DT, respectively, and assessed their hybrid approach using
NSL_KDD older data. The authors claimed that the improved attack detection
accuracy was owing to the hybrid ID system. Paulauskas et al. [19] developed a
novel approach for ID using various weak learners; this is known as the ensemble
approach. The weak learners have low malicious detection accuracy. There were
some weak learners, such as J48, C5.0, Naïve Bayes, and rule-Based classifiers
that were used by the authors. Zaman et al. [20] used a better-quality ID
algorithm recognized as enhanced support vector decision function (ESVDF) and
evaluated their proposed IDS using the DARPA dataset; the proposed IDS was
found to be superior to other conventional ID approaches.
Although the above ID approaches have demonstrated decent accuracy up to a
certain level, certain improvements, such as decreasing the number of FAR and
increasing the ID accuracy, are necessary. In this regard, DL is a powerful
technique. DL is a branch of ML that has become progressively dominant in
various fields, such as speech recognition and natural language processing (NLP).
DL’s popularity is due to its two fundamental characteristics: (a) hierarchical
features representations and (b) handling of long-term dependencies of sequential
4 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
patterns. Today, state-of-the-art DL approaches that are used for NID includes
auto-encoders (AE), deep belief networks (DBNs), deep neural networks (DNNs),
and restricted Boltzmann machines (RBMs) as well as variants of these
approaches. An overview of the state-of-the-art approaches is presented in Tab. 1.
DL has shown that its attack detection accuracy in the ID domain effectively
exceeds conventional approaches [31]. Erfani et al. [32] developed a novel tactic
that joined one-class linear SVM with the DBN for ID, evaluating it with various
benchmark ID data. Fiore et al. [33] proposed a new technique, discriminative
RBM, to learn compressed attributes from data attributes; these compressed
attributes are then used for binary classification purposes into softmax classifier
for benign and malicious network behaviors. Wang et al. [34] presented a DL-
based IDS based on AE for detecting network traffic from the raw dataset and
achieved a very high ID performance. Javaid et al. [35] used the DNN technique
for anomaly detection. Their evaluation based on DNN DL found that the DNN
technique is a novel and effective approach for ID in a software-defined network
(SDNs). Yin et al. [36] introduced a neural network (NN) DL-based NIDS. This DL-
based ID was tested using the NSL_KDD dataset; it was found that the DL-based
IDS outperformed conventional ML-based classification techniques. Khan et al.
[37] presented the hybrid DL approach for ID and applied it to real-time ID data.
Their simulation outcomes showed that the hybrid DL-based IDS was superior in
terms of attack classification accuracy and performance. Alrawashdeh et al. [38]
proposed a DL-based IDS using the DBN of RBM with four and one hidden layers
for attribute reduction purposes; the weights of the DBN were restructured during
fine-tuning, and attack classification was accomplished using an LR classifier. The
developed methodology was evaluated on the benchmark KDD99 data and
attained an attack classification accuracy of up to 97.9% with a FAR of 0.5%.
However, the attack classification accuracy as evaluated using this ancient data is
not sufficient to show that this a robust approach for NID. Shone et al. [39]
proposed a non-symmetric deep AE-based ID and evaluated the proposed
framework with the benchmark KDD99 dataset, achieving an attack classification
accuracy of 97.87% and a FAR of 2.15%. In [40], the authors aimed for a Deep
Neural Network (DNN) of 100 hidden units. To improve performance, they utilized
a GPU and the KDD99 dataset. The authors proposed that the models of both
recurrent neural network (RNN) and long short-term memory (LSTM) are better
for enhancing the attack detection accuracy. These ID systems based on DL
techniques were found to be superior to traditional approaches; the authors also
presented various ideas by joining DL and ML techniques, with the primary goal
of developing an efficient and robust ID system. Wang et al. [41] developed a novel
5 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
approach for ID by combining fuzzy clustering and ANN; they tested a novel
hybrid approach on the KDD99 dataset and demonstrated that their hybrid FC-
ANN approach outperforms traditional ML approaches in terms of ID. Mukkamala
et al. [42] used a hybrid approach by combining the SVM and ANN; they evaluated
this approach on the benchmark KDD99 dataset. Here, SVM and ANN were used
for classification tasks and data patterns, respectively. Various researchers have
used the ISCX-2012 ID dataset to conduct suitable system validation. However,
there is still much room for enhancements, such as improving attack detection
accuracy and reducing FAR [43–48]. ID research has been carried out by various
scholars for developing both the ABS and SBS using separate classification
methods. These methods fail to afford the efficient possibility of attack detection,
so a hybrid ID system is an important research challenge. ML-based techniques
have been mostly used by scientists and engineers to develop an ABS, which can
make a model by comparing normal with abnormal behavior and then attempting
to classify whether upcoming new packets are “attack” or “normal.” DL is
enormously valuable for the ID system because it automatically extracts features
of the specific problem without requiring robust preceding knowledge. The main
downside of using the DL model for the ID domain is the extent of the training;
obtaining the right model is time-consuming.
The research community has drawn substantial attention to the issue of class
imbalance [49]. The problem of class imbalance is created by insufficient data
distribution; one class contains most samples, while others contain comparatively
few. The classification problem becomes more complicated as data dimensionality
increases due to unbounded data values and unbalanced classes. Bedi et al. [50]
utilized numerous ML approaches to deal with the class imbalance issue. Thabtah
et al. [51] also evaluated various approaches to the class imbalance problem. Most
data samples are targeted by most of the algorithms while missing the minority
data samples. As a result, minority samples appear irregularly but constantly. The
main algorithms for solving the unbalanced data problem are data preprocessing
and feature selection techniques, and every approach has both benefits and
shortcomings. The ID dataset has a high-dimensional imbalance problem including
missing features of interest, missing feature values, or the sole existence of
cumulative data. The data appear to be noisy, containing errors and outliers, and
unpredictable, comprising discrepancies in codes or names. We used over-
sampling to resolve the problem of the imbalance; this involved enlarging the
number of instances in the minority class by arbitrarily replicating them to
increase the presence of the minority class in the sample. Although this procedure
has some risk of overfitting, no information was lost, and the over-sampling
approach was found to outperform the under-sampling alternative.
With the accelerated growth of big data, DL approaches have flourished and
have been widely utilized in numerous domains. In contrast to previous studies,
we took a hybrid approach—the Anomaly-Misuse ID method—to two-stage
classification to overwhelm the condition face by separate classification methods.
We used Spark MLlib and the LSTMAE DL approach for ID, on the well-known
real-time contemporary dataset ISCX-2012.
3 Proposed Approach
Fig. 1 presents the anticipated ID framework. It comprises two learning stages.
For this HIID, we planned to construct a two-stage ID system, in such a way that
6 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
7 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
The architecture of the hybrid IDS is as shown in Fig. 1; initially network traffic
was arranged and preprocessed. During preprocessing, all necessary conversions
were made for both Stage-1 Spark MLlib and Stage-2 LSTMAE-based modules of
HIIDS; both stages had their supported data formats. For our hybrid ID
experiment, we used 1,512,000 network traffic packets attained from ISCX-2012
datasets to demonstrate the effectiveness of the proposed HIIDS.
3.2 Datasets
Choosing a suitable ID dataset plays a significant role in testing the ID system;
therefore, the simulation of the proposed HIID approach was carefully
deliberated.
8 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
9 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
Fig. 1. The ISCX-2012 dataset was evaluated, and after preprocessing, it was
composed of seven consecutive days with the systematic and practical
circumstances reflecting network attacks. The data were labeled for malicious and
benign streams for a full of 68,792, and 2,381,532 records in the own class. The
abnormal attacks were detected in the initial traffic data and were divided into
two classes: benign/normal and abnormal/malicious.
Furthermore, several multi-stage malicious intrusion scenarios were executed to
generate various attack traces (e.g., HTTP, DoS, brute force SSH, infiltration from
the interior, DDoS via an IRC botnet). The detailed descriptions of training and
testing data distributions are presented in Tabs. 3 and 4.
The core idea of this research was to evaluate the reliability of the hybrid
system against anomalies and the unknown, via the misuse approach. Tab. 4
presents the testing and training network traffic data for misuse attack detection
using a state-of-the-art DL approach, such as an LSTMAE.
10 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
ISCX-2012 dataset with higher ID accuracy and low FAR. 80% of the data with 10-
fold cross-validation was utilized for training purposes, and the model was
evaluated with a 20% held-out dataset.
11 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
Research has shown that LSTM demonstrates high confidence and effectiveness
for resolving issues of video classification [58], sentiment analysis [59], emotion
recognition [60], and abnormal activities [61].
In this module, LSTMAE is used as a misused attack detection technique.
LSTMAE misuse attack detection techniques aim to further categorize the
abnormal data from Stage-1 among equivalent classification: DOS, Scan, HTTP,
and R2L. While misuse ID uses the LSTMAE, the technique was initially trained in
the abnormal traffic to create a model that provides the baseline profile for
abnormal traffic. A test set is an input to the training model that tests whether the
training model performance is malicious (abnormal) or normal. An alarm goes off
when a match is found. More internal information can be effectively obtained with
LSTMAE, compared with other hand-crafted techniques.
4 Experimental Evaluations
A detailed description of the experimental results will be discussed in this
section. Since the dominance of the proposed HIID is sensibly analyzed, this can
only be realized throughout experiments applying the ISCX 2012 ID datasets via
normal and attack classification, false positive, false negative, true positive, attack
detection accuracy, and error rate.
12 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
However, the most significant improvement that we observed was with state-of-
the-art DL approaches such as LSTMAE, which correctly identified misuse for up
to 97.0% of cases. This improvement was due to the temporal feature’s extraction
with LSTM and the extraction of more important internal information by the AE.
13 of 18 8/16/22, 13:21
Deep Learning-Based Hybrid Intelligent Intrusion Detect... https://www.techscience.com/cmc/v68n1/41825/html
In this article, the HIIDS was developed using the Spark MLlib and LSTMAE
deep learning approach, which is an efficient cybersecurity method. We trained
the HIIDS using an ISCX-2012 dataset. We implemented the HIIDS using several
robust classification algorithms, such as LR and XGB, for anomaly detection at
Stage1 and the LSTMAE deep learning technique for misuse detection at Stage 2.
The proposed HIIDS, based on DL classification, combines the benefits of both
Signature-based (SB) and Anomaly-based (AB) approaches, reducing
computational complexity and increasing ID accuracy and DR.
Both conventional ML and LSTMAE deep learning models were evaluated using
well-known classification metrics, such as F1 score, Precision, Recall, DR, and
accuracy of classification.
We believe that our approach can be expanded to other domains in the future;
misuses and anomalies can be recognized in several real-time image data,
emphasis on exploring deep learning as a features extraction mechanism to learn
knowledgeable data illustrations in case of other anomaly detection issues in
modern real-time datasets.
Conflicts of Interest: The authors declare that they have no conflicts of interest
to report regarding the present study.
References
1. X. C. Shen, J. X. Du and F. Zhang. (2018). “An intrusion detection system
using a deep neural network with gated recurrent units,” IEEE Access, vol. 6, pp.
48697–48707.
2. K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun et al. (2020). , “A review of android
malware detection approaches based on machine learning,” IEEE Access, vol. 8,
pp. 124579–124607.
3. M. A. Khan and J. Kim. (2020). “Toward developing efficient Conv-AE-based
intrusion detection system using the heterogeneous dataset,” Electronics, vol. 9,
no. 11, pp. 1–17.
4. J. Kim and H. Kim. (2017). “An effective intrusion detection classifier using
long short-term memory with gradient descent optimization,” in Proc. Platform
Technology and Service (Plat ConBusan, South Korea, pp. 1–5.
5. G. E. Hinton, S. Osindero and Y. W. Teh. (2006). “A fast learning algorithm for
deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554.
6. H. Alqahtani, I. H. Sarker, A. Kalim, S. M. Hossain, S. Ikhlaq et al. (2020). ,
“Cyber intrusion detection using machine learning classification techniques,” in
Proc. Computing Science, Communication and Security, Gujarat, India, pp.
121–131.
7. N. Kaloudi and L. Jingyue. (2020). “The AI-based cyber threat landscape: A
survey,” ACM Computing Surveys, vol. 53, no. 1, pp. 1–34.
8. B. Li, Y. Wu, J. Song, R. Lu, T. Li et al. (2020). , “Deep Fed: Federated deep
learning for intrusion detection in industrial cyber-physical systems,” EEE
Transactions on Industrial Informatics, vol. 1, pp. 1–10.
14 of 18 8/16/22, 13:21