Selecting Features For Intrusion Detection: A Feature Relevance Analysis On KDD 99 Intrusion Detection Datasets
Selecting Features For Intrusion Detection: A Feature Relevance Analysis On KDD 99 Intrusion Detection Datasets
Selecting Features For Intrusion Detection: A Feature Relevance Analysis On KDD 99 Intrusion Detection Datasets
!
decision on a feature that is unrelated with content will
lead to unjustified detection of an attack. Furthermore, as
expected, feature 7, which is related to land attack, is
selected as the most discriminating feature for land class.
Table 3. Most relevant features for each class
label and information gain measures
Class Label Info. Feature Feature Name
Gain #
smurf 0.9859 5 source bytes
neptune 0.7429 30 diff srv rate
normal 0.6439 5 source bytes
back 0.0411 6 destination bytes
satan 0.0257 27 rerror rate
ipsweep 0.0222 37 dst host srv diff host rate
teardrop 0.0206 5 source bytes
warezclient 0.0176 5 source bytes
Figure 1. Information gain of each feature
portsweep 0.0163 4 status Flag
pod 0.0065 5 source bytes Table 4. List of features for which the class is
nmap 0.0024 4 flag selected most relevant
guess_passwd 0.0015 5 source bytes Class Label Relevant Features
buffer_overflow 0.0007 6 destination bytes normal 1, 6, 12, 15, 16, 17, 18, 19, 31, 32, 37
land 0.0007 7 land smurf 2, 3, 5, 23, 24, 27, 28, 36, 40, 41
warezmaster 0.0006 6 destination bytes neptune 4, 25, 26, 29, 30, 33, 34, 35, 38, 39
land 7
imap 0.0003 3 service teardrop 8
loadmodule 0.0002 6 destination bytes ftp_write 9
rootkit 0.0002 5 source bytes back 10, 13
perl 0.0001 16 # root guess_pwd 11
ftp_write 0.0001 5 source bytes buffer_overflow 14
warezclient 22
phf 0.0001 6 destination bytes
multihop 0.0001 6 destination bytes
4 Conclusion
spy 0.0001 39 dst host srv serror rate
In this paper, a feature relevance analysis is performed
Figure 1 shows the maximum information gain for on KDD 99 training set, which is widely used by machine
each feature. In addition, Table 4 details the most learning researchers. Feature relevance is expressed in
discriminative class label for each feature. For majority terms of information gain, which gets higher as the feature
of the features (31 over 41), normal, smurf and neptune gets more discriminative. In order to get feature relevance
are the most discriminative classes. That is to say, there measure for all classes in training set, information gain is
are many features that can discriminate these classes calculated on binary classification, for each feature
accurately. There are 9 features with very small resulting in a separate information gain per class. Recent
maximum information gain (e.g. smaller than 0.001), research employed decision trees, artificial neural
which contribute very little to intrusion detection. networks and a probabilistic classifier and reported, in
Moreover features 20 and 21 (outbound command count terms of detection and false alarm rates, that user to root
for FTP sessions and hot login, respectively) do not show and remote to local attacks are very difficult to classify.
any variations in the training set therefore they have no The contribution of this work is that it analyzes the
relevance to intrusion detection. involvement of each feature to classification.
Our results indicate that normal, neptune and smurf
classes are highly related to certain features that make
their classification easier. Since these three classes make
up 98% of the training data, it is very easy for a machine
learning algorithm to achieve good results. Moreover, the
amount of data exchange in a connection seems to be a
discriminating feature for majority of the classes. On the
other hand, certain features have no contribution to
intrusion detection, which indicates that not all features
are useful. Although test data shows different
characteristics than the training data, since “10% KDD” is
the training data in the competition, our analysis on
training data shed light on the performance of machine
learning based intrusion detection systems trained on
KDD 99 intrusion detection datasets in general.
Future work will include additional measures for
feature relevance and extend the analysis to other KDD 99
intrusion detection datasets.
Acknowledgments
This work was supported in part by NSERC and CFI.
All research was conducted at the NIMS Laboratory,
http://www.cs.dal.ca/projectx/.
References
[1] The 1998 intrusion detection off-line evaluation plan. MIT
Lincoln Lab., Information Systems Technology Group.
http://www.11.mit.edu/IST/ideval/docs/1998/id98-eval-11.txt, 25
March 1998.
[2] Knowledge discovery in databases DARPA archive. Task
Description.
http://www.kdd.ics.uci.edu/databases/kddcup99/task.html
[3] J. McHugh, “Testing intrusion detection systems: A critique of
the 1998 and 1999 DARPA intrusion detection system evaluations
as performed by Lincoln Laboratory,” ACM Transactions on
Information and System Security, 3(4), pp. 262-294, 2001.
[4] Pfahringer B. Winning the KDD99 Classification Cup: Bagged
Boosting. SIGKDD Explorations, 1(2):65–66, 2000.
[5] Levin I. KDD-99 Classifier Learning Contest: LLSoft’s Results
Overview. SIGKDD Explorations, 1(2):67–75, 2000.
[6] Kayacik G., Zincir-Heywood N., and Heywood M. On the
Capability of an SOM based Intrusion Detection System. In
Proceedings of International Joint Conference on Neural Networks,
2003.
[7] Paxson V., “Bro: A System for Detecting Network Intruders in
Real-Time”, Computer Networks, 31(23-24), pp. 2435-2463, 14
Dec. 1999.
[8] S. Hettich, S.D. Bay, The UCI KDD Archive. Irvine, CA:
University of California, Department of Information and Computer
Science, http://kdd.ics.uci.edu, 1999.
[9] Han J., Kamber M., “Data Mining: Concepts and Techniques”,
Morgan Kaufmann, 2000, ISBN 7-04-010041, Ch. 5.
[10] Sabhnani M., Serpen G., “Why Machine Learning Algorithms
Fail in Misuse Detection on KDD Intrusion Detection Data Set”, In
Journal of Intelligent Data Analysis, 2004
[11] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, “A
geometric framework for unsupervised anomaly detection:
Detecting intrusions in unlabeled data,” in Applications of Data
Mining in Computer Security, Chapter 4, D. Barbara and S. Jajodia
(editors), Kluwer, ISBN 1-4020-7054-3, 2002.
[12] Kayacik, G. H., Zincir-Heywood, A. N., "Analysis of Three
Intrusion Detection System Benchmark Datasets Using Machine
Learning Algorithms", Proceedings of the IEEE ISI 2005 Atlanta,
USA, May 2005.
[13] Wong, A. K. C., and Chiu, D. K. Y., “Synthesizing statistical
knowledge from incomplete mixed-mode data”, IEEE Transactions
on Pattern Analysis and Machine Intelligence, vPAMI-9, no. 6,
pp.796-805, November 1987.
Appendix 1. Description of KDD 99 Intrusion Detection Dataset Features
Table A.1. List of features with their descriptions and data types (summarized from [2])
Feature Description Type Feature Description Type
1. duration Duration of the connection. Cont. 22. is guest login 1 if the login is a "guest'' login; 0 Disc.
otherwise
2. protocol type Connection protocol (e.g. tcp, udp) Disc. 23. Count number of connections to the same Cont.
host as the current connection in the
past two seconds
3. service Destination service (e.g. telnet, ftp) Disc. 24. srv count number of connections to the same Cont.
service as the current connection in
the past two seconds
4. flag Status flag of the connection Disc. 25. serror rate % of connections that have “SYN'' Cont.
errors
5. source bytes Bytes sent from source to Cont. 26. srv serror rate % of connections that have “SYN'' Cont.
destination errors
6. destination Bytes sent from destination to Cont. 27. rerror rate % of connections that have “REJ'' Cont.
bytes source errors
7. land 1 if connection is from/to the same Disc. 28. srv rerror rate % of connections that have “REJ'' Cont.
host/port; 0 otherwise errors
8. wrong number of wrong fragments Cont. 29. same srv rate % of connections to the same service Cont.
fragment
9. urgent number of urgent packets Cont. 30. diff srv rate % of connections to different services Cont.
10. hot number of "hot" indicators Cont. 31. srv diff host % of connections to different hosts Cont.
rate
11. failed logins number of failed logins Cont. 32. dst host count count of connections having the same Cont.
destination host
12. logged in 1 if successfully logged in; 0 Disc. 33. dst host srv count of connections having the same Cont.
otherwise count destination host and using the same
service
13. # number of "compromised'' Cont. 34. dst host same % of connections having the same Cont.
compromised conditions srv rate destination host and using the same
service
14. root shell 1 if root shell is obtained; 0 Cont. 35. dst host diff srv % of different services on the current Cont.
otherwise rate host
15. su attempted 1 if "su root'' command attempted; Cont. 36. dst host same % of connections to the current host Cont.
0 otherwise src port rate having the same src port
16. # root number of "root'' accesses Cont. 37. dst host srv diff % of connections to the same service Cont.
host rate coming from different hosts
17. # file number of file creation operations Cont. 38. dst host serror % of connections to the current host Cont.
creations rate that have an S0 error
18. # shells number of shell prompts Cont. 39. dst host srv % of connections to the current host Cont.
serror rate and specified service that have an S0
error
19. # access files number of operations on access Cont. 40. dst host rerror % of connections to the current host Cont.
control files rate that have an RST error
20. # outbound number of outbound commands in Cont. 41. dst host srv % of connections to the current host Cont.
cmds an ftp session rerror rate and specified service that have an
RST error
21. is hot login 1 if the login belongs to the "hot'' Disc.
list; 0 otherwise