Cyber Intrusion Detection Using Machine Learning Classification Techniques
Cyber Intrusion Detection Using Machine Learning Classification Techniques
Cyber Intrusion Detection Using Machine Learning Classification Techniques
Hamed Alqahtani1,7 , Iqbal H. Sarker2(B) , Asra Kalim3 , Syed Md. Minhaz Hossain2,4 ,
Sheikh Ikhlaq5 , and Sohrab Hossain2,6(B)
1 King Khalid University, Abha, Saudi Arabia
2 Chittagong University of Engineering and Technology, Chittagong, Bangladesh
[email protected]
3 Jazan University, Jizan, Saudi Arabia
4 Premier University, Chittagong, Bangladesh
5 Accenture Solutions Private Limited, Mumbai, India
6 East Delta University, Chittagong, Bangladesh
[email protected]
7 Macquarie University, Sydney 2109, Australia
1 Introduction
In recent days, cyber-security and protection against numerous cyber-attacks are becom-
ing a burning question. The main reason behind that is the tremendous growth of com-
puter networks and the vast number of relevant applications used by individuals or
groups for either personal or commercial use, specially after the acceptance of Internet-
of-Things (IoT). The cyber-attacks cause severe damage and severe financial losses in
large-scale networks [25]. The existing solutions like hardware and software firewalls,
user’s authentication, and data encryption method are not sufficient to meet the challenge
of upcoming demand, and unfortunately, not able to protect the computer network’s sev-
eral cyber-threats. These conventional security structures are not sufficient as safeguard
due to the faster rigorous evolution of intrusion systems [13, 26, 27]. Firewall only
controls every accesses from network to network, which means prevent access between
networks. But it does not provide any signal in case of an internal attack. So, it is obvi-
ous to develop accurate defense techniques such as machine learning-based intrusion
detection system (IDS) for the system’s security.
In general, an intrusion detection system (IDS) is a system or software that detects
infectious activities and violations of policy in a network or system. An IDS identifies
the inconsistencies and abnormal behavior on a network during the functioning of daily
activities in a network or system used to detect risks or attacks related to network security,
like denial-of-service (DoS). An intrusion detection system also helps to locate, decide,
and control unauthorized system behavior such as unauthorized access, or modification
and destruction [12, 31]. There are different types of intrusion detection systems based
on the user perspective. For instance, they are host-based and network-based IDS [25].
These are in the scope of single computers to large networks some extend. In a host-
based intrusion detection system (HIDS), it lies on an individual system and keeps track
of operating system files for inconsistency and abnormalities in the activity. In contrast,
the network intrusion detection system (NIDS) investigates and scans connections in
the network for unwanted traffic. On the other hand, there are two approaches based
on detection, one is signature-based, and another one is anomaly-based detection [25,
18]. Signature-based IDS explores the byte patterns in the path of the network. One can
treat it as malicious instruction sequences used by malware. It arises from antivirus soft-
ware referred to the groups or patterns as signatures detected in it. Signature-based IDS
cannot detect attacks, for which there is no pattern available before. An anomaly-based
IDS, it examines the behavior of the network and finds patterns, automatically creates
a data-driven model for profiling the expected behavior, and thus detects deviations in
the case of any anomalies [18]. The merit of this anomaly-based IDS is to trace current,
latest, and unseen inconsistencies or cyber-attacks like denial-of-services.
For developing computational methods to identify various cyber-attacks, it needs
to analyze different incident patterns, and eventually predict the threats utilizing cyber-
security data. It is known as a data-driven intelligent intrusion detection system [25].
To build a data-driven intrusion detection model, the knowledge of artificial intelli-
gence, particularly machine learning techniques, is essential. However, the prediction
of cyber-attacks using machine learning algorithms is problematic due to the several
identifications of multiple classifiers results in different contexts depending on data
characteristics [23]. For this reason, we analyze several machine learning algorithms
on intrusion detection systems for utilizing cyber-security data. For this purpose, we
employ various popular machine learning classification techniques, such as Bayesian
Network (BN), Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), Ran-
dom Tree (RT), Decision Table (DTb), and Artificial Neural Network (ANN), for pro-
viding intelligent services in the domain of cyber-security, particularly for intrusion
detection. Finally, the effectiveness is tested by conducting numerous experiments on
Cyber Intrusion Detection Using Machine Learning Classification Techniques 123
(DT), Random Tree (RT), Decision Table (DTb), and Artificial Neural Network (ANN)
for classifying cyber-attacks and make a comparative analysis with experiments.
This section presents our data-driven IDS model of numerous machine learning tech-
niques. It incorporates several steps: dataset exploration, data processing, and machine
learning-based security modeling. It has been discussed these steps chronologically, as
below.
– DoS: Denial of service (DoS) is a kind of attack in which a legitimate user does not
have access to the system and network resources. Online banking services, email may
be affected. DoS attacks comprise of the SYN flood attack and the Smurf attack.
– R2L: Remote to Local (R2L) is an attack where an attacker tries to gain access to the
victim machine without having an account in it.
– U2R: User to Root (U2R) is an attack where an attacker tries to gain privileges having
local access in the victim machine.
– PROBE: In Probe, the attacker targets the host and tries to get information about the
host.
We first prepare the dataset, including these attack categories and available attributes
for developing machine learning-based IDS models. There are four types of features used
in this dataset; they are Basic features, Content Features, Time-based Traffic Features,
and Host-based Traffic Features. Feature-based attributes are extracted from TCP/IP
connections. Traffic features are computed by window interval. It divides into two groups;
one is ‘same host features’ and another one is ‘same service features.’ They are both
called time-based features. Sometimes, in the case of probing, there is a slower scan
than 2 s. To solve this problem, ‘same host features’ and ‘same service features’ are
recomputed by the connection window. Then it is called connection based features. DoS
and probing may have several connections to a host/s during a period. In Table 2, we
have summarized these categories of attacks. In contrast to that, Root to Local (R2L) and
User to Root (U2R) attacks generally require a single connection. Content-based features
Cyber Intrusion Detection Using Machine Learning Classification Techniques 125
have been used to detect these attacks. Then process these features according to the
requirements and design the target machine learning-based IDS model. This data-driven
pattern-based decision analysis plays a useful role in providing data-driven intelligent
cyber-security services.
– Artificial Neural Network: In addition to the above classical machine learning tech-
niques, we also take into account a neural network learning model. The most com-
monly used form of neural network architecture is the Multilayer Perceptron that has
an input layer consisting of several inputs, one or more hidden layers that typically use
sigmoid activation functions and one output layer to predict the attack. This approach
uses backpropagation to build the network [8, 29].
We discuss our machine learning-based intrusion detection model that carries out on
four main components:
– Attack Class Label: All the diverse threats have been counted as different distinct
class labels to put them into model intrusion detection systems. For instance, different
types of attacks such as DoS, U2R, R2L, PROBE shown in Table 2 are represented
as distinct classes; Class 1, Class 2, Class 3, and Class 4 respectively.
Cyber Intrusion Detection Using Machine Learning Classification Techniques 127
– Security Features or Attributes: These are used independently to predict the above
cyber threats. These are also known as features such as protocol type, service, duration,
and error-rate. shown in Table 1, on which the cyberattacks class levels are dependent.
– Training and Testing Dataset: The dataset is categorized into two; one is a training
dataset, and another one is the test dataset. The training data set is used to train the IDS
model, and the testing dataset is used to evaluate the generalization of that IDS model.
We use a large amount of the cybersecurity data mentioned above for developing the
IDS model and the rest for testing purposes.
4 Experimental Evaluation
This section defines the performance metrics in terms of intrusion detection and dis-
cusses the outcome by conducting experiments on cybersecurity datasets with different
128 H. Alqahtani et al.
Fig. 1. Performance comparison results with respect to accuracy for numerous machine learning
based IDS model.
Cyber Intrusion Detection Using Machine Learning Classification Techniques 129
To evaluate the performances of each classifier based IDS model, Fig. 1 and Fig. 2
show the comparison of accuracy, precision, recall, and f1-score, respectively. For eval-
uation, we use the same set of train and testing data in each classification based IDS
model.
Fig. 2. Performance comparison results with respect to precision, recall, f1-score for numerous
machine learning classification based IDS model.
From Fig. 1 and Fig. 2, we find that Random Forest classifier based IDS model
consistently performs better than other classifiers for detecting intrusions. In particular,
the Random Decision Forest gives the best results concerning the accuracy, precision,
recall, f1-score. The reason behind it is that the Random Forest classifier at first originates
several decision trees and thus deduces a set of rules in the forest. Every tree in a
Random Forest Model behaves as a different machine learning classification technique,
and thus it generates more logic rules by taking into account the majority voting of these
trees while producing the outcome. For this reason, the Random Forest Model performs
better in precision, recall, f1-score, and accuracy. Overall, the machine learning classifier
based IDS model discussed above is fully data-oriented that reflects the behavioral
patterns of various cyber-attacks. Although we consider data-driven prediction according
to the patterns available in a given dataset using machine learning techniques, a recency-
based model [19] could be more effective in developing a data-driven intrusion detection
system. Moreover, incorporating contextual information and their analysis [21, 16] could
play an important role to build smart intrusion detection system.
References
1. Kdd cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. Accessed 20 Oct
2019
2. Aftergood, S.: Cybersecurity: the cold war online. Nature 547(7661), 30 (2017)
3. Ammar, A., Michael, H., Jemal, A., Moutaz, A.: Using feature selection for intrusion detection
system. In: 2012 International Symposium on Communications and Information Technologies
(ISCIT), pp. 296–301. IEEE (2012)
4. Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees. Neural
Comput. 9(7), 1545–1588 (1997)
5. Shahid, A., et al.: From intrusion detection to an intrusion response system: fundamentals,
requirements, and future directions. Algorithms, 10(2), 39 (2017)
6. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
7. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
8. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam
(2011)
9. John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Pro-
ceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345.
Morgan Kaufmann Publishers Inc. (1995)
10. Khraisat, A., Gondal, I., Vamplew, P., Kamruzzaman, J.: Survey of intrusion detection systems:
techniques, datasets and challenges. Cybersecurity 2(1), 20 (2019)
11. Liao, H.-J., Lin, C.-H.R., Lin, Y.-C., Tung, K.-Y.: Intrusion detection system: a comprehensive
review. J. Netw. Comput. Appl. 36(1), 16–24 (2013)
12. Milenkoski, A., Vieira, M., Kounev, S., Avritzer, A., Payne, B.D.: Evaluating computer intru-
sion detection systems: a survey of common practices. ACM Comput. Surv. (CSUR) 48(1),
1–41 (2015)
13. Mohammadi, S., Mirvaziri, H., Ghazizadeh-Ahsaee, M., Karimipour, H.: Cyber intrusion
detection by combined feature selection algorithm. J. Inf. Secur. Appl. 44, 80–88 (2019)
14. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
15. Quinlan, J.R.: C4.5: programs for machine learning. Machine Learning (1993)
16. Sarker, I.H.: Context-aware rule learning from smartphone data: survey, challenges and future
directions. J. Big Data 6(1), 1–25 (2019). https://doi.org/10.1186/s40537-019-0258-4
17. Sarker, I.H.: A machine learning based robust prediction model for real-life mobile phone
data. Internet of Things 5, 180–193 (2019)
18. Sarker, I.H., Abushark, Y.B., Alsolami, F., Khan, A.I.: Intrudtree: a machine learning-based
cyber security intrusion detection model. Symmetry 12, 754 (2020)
19. Sarker, I.H., Colman, A., Han, J.: Recencyminer: mining recency-based personalized behavior
from contextual smartphone data. J. Big Data 6(1), 49 (2019)
Cyber Intrusion Detection Using Machine Learning Classification Techniques 131
20. Sarker, I.H., Colman, A., Han, J., Khan, A.I., Abushark, Y.B., Salah, K.: Behavdt: a behavioral
decision tree learning to build user-centric context-aware predictive model. Mobile Netw.
Appl. 1, 1–11 (2019)
21. Sarker, I.H., Colman, A., Kabir, M.A., Han, J.: Individualized time-series segmentation
for mining mobile phone user behavior. The Comput. J., 61(3), 349–368 (2018). Oxford
University, UK
22. Sarker, I.H., Kabir, M.A., Colman, A., Han, J.: An improved naive bayes classifier-based
noise detection technique for classifying user phone call behavior (2017)
23. Sarker, I.H., Kayes, A., Watters, P.: Effectiveness analysis of machine learning classification
models for predicting personalized context-aware smartphone usage. Journal of Big Data
(2019)
24. Sarker, I.H., Salim, F.D.: Mining user behavioral rules from smartphone data through asso-
ciation analysis. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.)
PAKDD 2018. LNCS (LNAI), vol. 10937, pp. 450–461. Springer, Cham (2018). https://doi.
org/10.1007/978-3-319-93034-3_36
25. Sarker, I.H., et al.: Cybersecurity data science: an overview from machine learning perspective
(2020)
26. Tapiador, J.E., Orfila, A., Ribagorda, A., Ramos, B.: Keyrecovery attacks on kids, a keyed
anomaly detection system. IEEE Trans. Dependable Sec. Comput. 12(3), 312–325 (2013)
27. Tavallaee, M., Stakhanova, N., Ghorbani, A.A.: Toward credible evaluation of anomaly-
based intrusion-detection methods. IEEE Trans. Syst. Man Cybern. Part C (Applications and
Reviews), 40(5), 516–524 (2010)
28. Viegas, E., Santin, A.O., Franca, A., Jasinski, R., Pedroni, V.A., Oliveira, L.S.: Towards
an energy-efficient anomaly-based intrusion detection engine for embedded systems. IEEE
Trans. Comput. 66(1), 163–177 (2016)
29. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, Burlington (2005)
30. Witten, I.H., Frank, E., Trigg, L.E., Hall, M.A., Holmes, G., Cunningham, S.J.: Weka: practical
machine learning tools and techniques with java implementations (1999)
31. Xin, Y., et al.: Machine learning and deep learning methods for cybersecurity. IEEE Access
6, 35365–35381 (2018)