Selecting Features For Intrusion Detection: A Feature Relevance Analysis On KDD 99 Intrusion Detection Datasets

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Selecting Features for Intrusion Detection:

A Feature Relevance Analysis on KDD 99 Intrusion Detection Datasets


H. Güneş Kayacık, A. Nur Zincir-Heywood, Malcolm I. Heywood
Dalhousie University, Faculty of Computer Science,
6050 University Avenue, Halifax, Nova Scotia. B3H 1W5
{kayacik, zincir, mheywood}@cs.dal.ca

Abstract intrusion detection is the extensive amount of data


collected from the network. Therefore, before feeding the
KDD 99 intrusion detection datasets, which are based on data to a machine learning algorithm, raw network traffic
DARPA 98 dataset, provides labeled data for researchers should be summarized into higher-level events such as
working in the field of intrusion detection and is the only connection records. Each higher-level event is described
labeled dataset publicly available. Numerous researchers with a set of features. Selecting good features is a crucial
employed the datasets in KDD 99 intrusion detection activity and requires extensive domain knowledge.
competition to study the utilization of machine learning for Given the significance of the intrusion detection
intrusion detection and reported detection rates up to 91% problem, there have been various initiatives that attempt to
with false positive rates less than 1%. To substantiate the quantify the current state of the art. In particular, MIT
performance of machine learning based detectors that are Lincoln Lab’s DARPA intrusion detection evaluation
trained on KDD 99 training data; we investigate the datasets have been employed to design and test intrusion
relevance of each feature in KDD 99 intrusion detection detection systems. In 1999, recorded network traffic from
datasets. To this end, information gain is employed to the DARPA 98 Lincoln Lab dataset [1] was summarized
determine the most discriminating features for each class. into network connections with 41-features per connection.
This formed the KDD 99 intrusion detection benchmark in
Keywords the International Knowledge Discovery and Data Mining
Intrusion detection, KDD 99 intrusion detection datasets, Tools Competition [2]. Although not without its
feature relevance, information gain drawbacks [3, 12], KDD 99 benchmark provides the only
publicly available labeled datasets for comparing IDS
systems, which the authors are aware of.
1 Introduction The detection results reported by the research that
Along with the benefits, the Internet also created employed machine learning algorithms (such as decision
numerous ways to compromise the stability and security of trees [4, 5], neural network algorithms [6], clustering and
the systems connected to it. Although static defense support vector machine approaches [11]) on KDD 99
mechanisms such as firewalls and software updates can intrusion detection datasets indicate that denial of service
provide a reasonable level of security, more dynamic attacks and probes are detected accurately whereas attacks
mechanisms such as intrusion detection systems (IDSs) involving content have substantially lower detection rates.
should also be utilized. Intrusion detection systems are Sabhnani et al. [10] investigated the deficiencies of KDD
typically classified as host based or network based. A host 99 intrusion detection datasets and concluded that it is not
based IDS will monitor resources such as system logs, file possible to achieve a high level of detection rate on attacks
systems and disk resources; whereas a network based involving content (user to root and remote to local
intrusion detection system monitors the data passing attacks). Given the detection rate of recent research, our
through the network. Different detection techniques can be objective is to perform a feature relevance analysis to
employed to search for attack patterns in the data substantiate the performance of machine learning IDSs.
monitored. Misuse detection systems try to find attack Therefore, we aim to investigate the relevance of the 41
signatures in the monitored resource. Anomaly detection features with respect to dataset labels. That is, for normal
systems typically rely on knowledge of normal behavior behavior and each type of attack (i.e. class labels), we
and flag any deviation from this. Intrusion detection determine the most relevant feature, which best
systems currently in use typically require human input to discriminates the given class from the others. To do so,
create attack signatures or to determine effective models information gain, which is the underlying feature selection
for normal behavior. Support for learning algorithms measure for constructing decision trees, is employed. For a
provides a potential alternative to expensive human input. given class, the feature with the highest information gain
The main task of such a learning algorithm is to discover is considered the most discriminative feature. Although
appropriate models from the training data for information gain is indirectly employed on KDD 99
characterizing normal and attack behavior. The ensuing intrusion detection dataset by the use of decision trees, our
model is then used to make predictions regarding unseen objective is to perform a relevance analysis rather than
data. One of the biggest challenges in network-based training a detector.
The remainder of the paper is organized as follows. source IP address to a target IP address under some well
Section 2 provides the methodology of the work. Results defined protocol” [2]. This process is completed using the
are reported in Section 3 and Conclusions are drawn in Bro IDS [7], resulting in 41 features for each connection,
Section 4. which are detailed in Appendix 1. Features are grouped
into four categories:
2 Methodology  Basic Features: Basic features can be derived from
packet headers without inspecting the payload. Basic
As indicated in the introduction, the basic objective of features are the first six features listed in Appendix 1.
this work is to determine the contribution of the 41  Content Features: Domain knowledge is used to
features in KDD 99 intrusion detection datasets to attack assess the payload of the original TCP packets. This
detection (or discrimination of normal behavior from includes features such as the number of failed login
attacks). To this end, an approach based on information attempts;
gain is employed. Based on the entropy of a feature,  Time-based Traffic Features: These features are
information gain measures the relevance of a given designed to capture properties that mature over a 2
feature, in other words its role in determining the class second temporal window. One example of such a
label. If the feature is relevant, in other words highly feature would be the number of connections to the
useful for an accurate determination, calculated entropies same host over the 2 second interval;
will be close to 0 and the information gain will be close to  Host-based Traffic Features: Utilize a historical
1. Since information gain is calculated for discrete window estimated over the number of connections –
features, continuous features are discretized with the in this case 100 – instead of time. Host based features
emphasis of providing sufficient discrete values for are therefore designed to assess attacks, which span
detection. intervals longer than 2 seconds.
The KDD 99 intrusion detection benchmark consists
2.1 KDD dataset of three components, which are detailed in Table 1. In the
The KDD 99 intrusion detection datasets are based on International Knowledge Discovery and Data Mining
the 1998 DARPA initiative, which provides designers of Tools Competition, only “10% KDD” dataset is
intrusion detection systems (IDS) with a benchmark on employed for the purpose of training [8]. This dataset
which to evaluate different methodologies [1]. To do so, a contains 22 attack types and is a more concise version of
simulation is made of a factitious military network the “Whole KDD” dataset. It contains more examples of
consisting of three ‘target’ machines running various attacks than normal connections and the attack types are
operating systems and services. Additional three machines not represented equally. Because of their nature, denial of
are then used to spoof different IP addresses to generate service attacks account for the majority of the dataset. On
traffic. Finally, there is a sniffer that records all network the other hand the “Corrected KDD” dataset provides a
traffic using the TCP dump format. The total simulated dataset with different statistical distributions than either
period is seven weeks. Normal connections are created to “10% KDD” or “Whole KDD” and contains 14 additional
profile that expected in a military network and attacks fall attacks. The list of class labels and their corresponding
into one of four categories: User to Root; Remote to categories for “10% KDD” are detailed in Table 2.
Local; Denial of Service; and Probe. Since “10% KDD” is employed as the training set in
• Denial of Service (dos): Attacker tries to prevent the original competition, we performed our analysis on
legitimate users from using a service. the “10% KDD” dataset.
• Remote to Local (r2l): Attacker does not have an
account on the victim machine, hence tries to Table 1. Basic characteristics of the KDD 99
gain access. intrusion detection datasets in terms of number
• User to Root (u2r): Attacker has local access to of samples
the victim machine and tries to gain super user Dataset DoS Probe u2r r2l Normal
privileges. “10% 391458 4107 52 1126 97277
• Probe: Attacker tries to gain information about KDD”
the target host. “Corrected 229853 4166 70 16347 60593
In 1999, the original TCP dump files were KDD”
preprocessed for utilization in the Intrusion Detection “Whole 3883370 41102 52 1126 972780
System benchmark of the International Knowledge KDD”
Discovery and Data Mining Tools Competition [2]. To do
so, packet information in the TCP dump file is
summarized into connections. Specifically, “a connection
is a sequence of TCP packets starting and ending at some
well defined times, between which data flows from a
Table 2. Class labels that appears in “10% In our experiments, information gain is calculated for
KDD” dataset class labels by employing a binary discrimination for
Attack # Samples Category each class. That is, for each class, a dataset instance is
considered in-class, if it has the same label; out-class, if it
smurf. 280790 dos
has a different label. Consequently, as opposed to
neptune. 107201 dos calculating one information gain as a general measure on
back. 2203 dos the relevance of the feature for all classes, we calculate an
teardrop. 979 dos information gain for each class. Thus, this signifies how
pod. 264 dos well the feature can discriminate the given class (i.e.
normal or an attack type) from other classes.
land. 21 dos
normal. 97277 normal 2.3 Preprocessing
satan. 1589 probe Since information gain is calculated for discrete
ipsweep. 1247 probe features, continuous features should be discretized. To
portsweep. 1040 probe this end, continuous features are partitioned into equal-
nmap. 231 probe sized partitions by utilizing equal frequency intervals
[13]. In equal frequency intervals method, the feature
warezclient. 1020 r2l
space is partitioned into arbitrary number of partitions
guess_passwd. 53 r2l where each partition contains the same number of data
warezmaster. 20 r2l points. That is to say, the range of each partition is
imap. 12 r2l adjusted to contain N dataset instances. If a value occurs
ftp_write. 8 r2l more than N times in a feature space, it is assigned a
partition of its own. In “10% KDD” dataset, certain
multihop. 7 r2l
classes such as denial of service attacks and normal
phf. 4 r2l connections occur in the magnitude of hundreds of
spy 2 r2l thousands whereas other classes such as R2L and U2R
buffer_overflow. 30 u2r attacks occur in the magnitude of tens or hundreds.
rootkit. 10 u2r Therefore, to provide sufficient resolution for the minor
classes N is set to 10, (i.e. maximum 50,000 partitions).
loadmodule. 9 u2r
perl. 3 u2r 3 Results
Results are presented in terms of the classes that
2.2 Information Gain
achieved good levels of discrimination from others in the
Let S be a set of training set samples with their training set and the analysis of feature relevancy in the
corresponding labels. Suppose there are m classes and the training set. Table 3 details the most relevant features for
training set contains si samples of class I and s is the total each class and provides the corresponding information
number of samples in the training set. Expected gain measures.
information needed to classify a given sample is Three classes (namely normal, neptune and smurf)
calculated by: stand out from others with high information gain, hence
m
s s high degree of discrimination. As indicated before, recent
I(s1,s2 ,...,sm ) = "# i log 2 ( i ) (1) literature based on machine learning algorithms [4, 5, 6,
i=1 s s 11] achieved approximately 90% detection rate with low
false alarm rates (~2%). Given that normal, neptune and
A feature F with values { f1, f2, …, fv } can divide the smurf classes correspond to 98% of the training data, the
training set into v subsets { S1, S2, …, Sv } where Sj is the majority of the training set can be easily classified
! subset which has the value fj for feature F. Furthermore therefore high detection and low false positive rates of
let Sj contain sij samples of class i. Entropy of the feature IDSs trained on “10% KDD” dataset are questionable
F is because the dataset is unrealistically simple. Moreover,
v
s1 j + ...+ smj for 14 of the 23 classes, amount of data exchange (i.e.
E(F) = " # I(s1 j ,...,smj ) (2) source and destination bytes) is the most discriminating
s
j=1 feature. This is expected for denial of service and probe
category attacks where the nature of the attack involves
Information gain for F can be calculated as: very short or very long connections. However for content
Gain(F) = I(s1,...,sm ) " E(F) (3) based attacks (such as ftp_write back and phf) basing the
!

!
decision on a feature that is unrelated with content will
lead to unjustified detection of an attack. Furthermore, as
expected, feature 7, which is related to land attack, is
selected as the most discriminating feature for land class.
Table 3. Most relevant features for each class
label and information gain measures
Class Label Info. Feature Feature Name
Gain #
smurf 0.9859 5 source bytes
neptune 0.7429 30 diff srv rate
normal 0.6439 5 source bytes
back 0.0411 6 destination bytes
satan 0.0257 27 rerror rate
ipsweep 0.0222 37 dst host srv diff host rate
teardrop 0.0206 5 source bytes
warezclient 0.0176 5 source bytes
Figure 1. Information gain of each feature
portsweep 0.0163 4 status Flag
pod 0.0065 5 source bytes Table 4. List of features for which the class is
nmap 0.0024 4 flag selected most relevant
guess_passwd 0.0015 5 source bytes Class Label Relevant Features
buffer_overflow 0.0007 6 destination bytes normal 1, 6, 12, 15, 16, 17, 18, 19, 31, 32, 37
land 0.0007 7 land smurf 2, 3, 5, 23, 24, 27, 28, 36, 40, 41
warezmaster 0.0006 6 destination bytes neptune 4, 25, 26, 29, 30, 33, 34, 35, 38, 39
land 7
imap 0.0003 3 service teardrop 8
loadmodule 0.0002 6 destination bytes ftp_write 9
rootkit 0.0002 5 source bytes back 10, 13
perl 0.0001 16 # root guess_pwd 11
ftp_write 0.0001 5 source bytes buffer_overflow 14
warezclient 22
phf 0.0001 6 destination bytes
multihop 0.0001 6 destination bytes
4 Conclusion
spy 0.0001 39 dst host srv serror rate
In this paper, a feature relevance analysis is performed
Figure 1 shows the maximum information gain for on KDD 99 training set, which is widely used by machine
each feature. In addition, Table 4 details the most learning researchers. Feature relevance is expressed in
discriminative class label for each feature. For majority terms of information gain, which gets higher as the feature
of the features (31 over 41), normal, smurf and neptune gets more discriminative. In order to get feature relevance
are the most discriminative classes. That is to say, there measure for all classes in training set, information gain is
are many features that can discriminate these classes calculated on binary classification, for each feature
accurately. There are 9 features with very small resulting in a separate information gain per class. Recent
maximum information gain (e.g. smaller than 0.001), research employed decision trees, artificial neural
which contribute very little to intrusion detection. networks and a probabilistic classifier and reported, in
Moreover features 20 and 21 (outbound command count terms of detection and false alarm rates, that user to root
for FTP sessions and hot login, respectively) do not show and remote to local attacks are very difficult to classify.
any variations in the training set therefore they have no The contribution of this work is that it analyzes the
relevance to intrusion detection. involvement of each feature to classification.
Our results indicate that normal, neptune and smurf
classes are highly related to certain features that make
their classification easier. Since these three classes make
up 98% of the training data, it is very easy for a machine
learning algorithm to achieve good results. Moreover, the
amount of data exchange in a connection seems to be a
discriminating feature for majority of the classes. On the
other hand, certain features have no contribution to
intrusion detection, which indicates that not all features
are useful. Although test data shows different
characteristics than the training data, since “10% KDD” is
the training data in the competition, our analysis on
training data shed light on the performance of machine
learning based intrusion detection systems trained on
KDD 99 intrusion detection datasets in general.
Future work will include additional measures for
feature relevance and extend the analysis to other KDD 99
intrusion detection datasets.

Acknowledgments
This work was supported in part by NSERC and CFI.
All research was conducted at the NIMS Laboratory,
http://www.cs.dal.ca/projectx/.

References
[1] The 1998 intrusion detection off-line evaluation plan. MIT
Lincoln Lab., Information Systems Technology Group.
http://www.11.mit.edu/IST/ideval/docs/1998/id98-eval-11.txt, 25
March 1998.
[2] Knowledge discovery in databases DARPA archive. Task
Description.
http://www.kdd.ics.uci.edu/databases/kddcup99/task.html
[3] J. McHugh, “Testing intrusion detection systems: A critique of
the 1998 and 1999 DARPA intrusion detection system evaluations
as performed by Lincoln Laboratory,” ACM Transactions on
Information and System Security, 3(4), pp. 262-294, 2001.
[4] Pfahringer B. Winning the KDD99 Classification Cup: Bagged
Boosting. SIGKDD Explorations, 1(2):65–66, 2000.
[5] Levin I. KDD-99 Classifier Learning Contest: LLSoft’s Results
Overview. SIGKDD Explorations, 1(2):67–75, 2000.
[6] Kayacik G., Zincir-Heywood N., and Heywood M. On the
Capability of an SOM based Intrusion Detection System. In
Proceedings of International Joint Conference on Neural Networks,
2003.
[7] Paxson V., “Bro: A System for Detecting Network Intruders in
Real-Time”, Computer Networks, 31(23-24), pp. 2435-2463, 14
Dec. 1999.
[8] S. Hettich, S.D. Bay, The UCI KDD Archive. Irvine, CA:
University of California, Department of Information and Computer
Science, http://kdd.ics.uci.edu, 1999.
[9] Han J., Kamber M., “Data Mining: Concepts and Techniques”,
Morgan Kaufmann, 2000, ISBN 7-04-010041, Ch. 5.
[10] Sabhnani M., Serpen G., “Why Machine Learning Algorithms
Fail in Misuse Detection on KDD Intrusion Detection Data Set”, In
Journal of Intelligent Data Analysis, 2004
[11] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, “A
geometric framework for unsupervised anomaly detection:
Detecting intrusions in unlabeled data,” in Applications of Data
Mining in Computer Security, Chapter 4, D. Barbara and S. Jajodia
(editors), Kluwer, ISBN 1-4020-7054-3, 2002.
[12] Kayacik, G. H., Zincir-Heywood, A. N., "Analysis of Three
Intrusion Detection System Benchmark Datasets Using Machine
Learning Algorithms", Proceedings of the IEEE ISI 2005 Atlanta,
USA, May 2005.
[13] Wong, A. K. C., and Chiu, D. K. Y., “Synthesizing statistical
knowledge from incomplete mixed-mode data”, IEEE Transactions
on Pattern Analysis and Machine Intelligence, vPAMI-9, no. 6,
pp.796-805, November 1987.
Appendix 1. Description of KDD 99 Intrusion Detection Dataset Features
Table A.1. List of features with their descriptions and data types (summarized from [2])
Feature Description Type Feature Description Type
1. duration Duration of the connection. Cont. 22. is guest login 1 if the login is a "guest'' login; 0 Disc.
otherwise
2. protocol type Connection protocol (e.g. tcp, udp) Disc. 23. Count number of connections to the same Cont.
host as the current connection in the
past two seconds
3. service Destination service (e.g. telnet, ftp) Disc. 24. srv count number of connections to the same Cont.
service as the current connection in
the past two seconds
4. flag Status flag of the connection Disc. 25. serror rate % of connections that have “SYN'' Cont.
errors
5. source bytes Bytes sent from source to Cont. 26. srv serror rate % of connections that have “SYN'' Cont.
destination errors
6. destination Bytes sent from destination to Cont. 27. rerror rate % of connections that have “REJ'' Cont.
bytes source errors
7. land 1 if connection is from/to the same Disc. 28. srv rerror rate % of connections that have “REJ'' Cont.
host/port; 0 otherwise errors
8. wrong number of wrong fragments Cont. 29. same srv rate % of connections to the same service Cont.
fragment
9. urgent number of urgent packets Cont. 30. diff srv rate % of connections to different services Cont.
10. hot number of "hot" indicators Cont. 31. srv diff host % of connections to different hosts Cont.
rate
11. failed logins number of failed logins Cont. 32. dst host count count of connections having the same Cont.
destination host
12. logged in 1 if successfully logged in; 0 Disc. 33. dst host srv count of connections having the same Cont.
otherwise count destination host and using the same
service
13. # number of "compromised'' Cont. 34. dst host same % of connections having the same Cont.
compromised conditions srv rate destination host and using the same
service
14. root shell 1 if root shell is obtained; 0 Cont. 35. dst host diff srv % of different services on the current Cont.
otherwise rate host
15. su attempted 1 if "su root'' command attempted; Cont. 36. dst host same % of connections to the current host Cont.
0 otherwise src port rate having the same src port
16. # root number of "root'' accesses Cont. 37. dst host srv diff % of connections to the same service Cont.
host rate coming from different hosts

17. # file number of file creation operations Cont. 38. dst host serror % of connections to the current host Cont.
creations rate that have an S0 error
18. # shells number of shell prompts Cont. 39. dst host srv % of connections to the current host Cont.
serror rate and specified service that have an S0
error
19. # access files number of operations on access Cont. 40. dst host rerror % of connections to the current host Cont.
control files rate that have an RST error
20. # outbound number of outbound commands in Cont. 41. dst host srv % of connections to the current host Cont.
cmds an ftp session rerror rate and specified service that have an
RST error
21. is hot login 1 if the login belongs to the "hot'' Disc.
list; 0 otherwise

You might also like