A Deeper Dive Into The NS1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

A Deeper Dive into the NSL-KDD Data Set

Have you ever wondered how your computer/network is able to avoid being infected with
malware and bad traffic inputs from the internet? The reason why it can detect it so well is
because there are systems in place to protect your valuable information held in your computer or
networks. These systems that detect malicious traffic inputs are called Intrusion Detection
Systems (IDS) and are trained on internet traffic record data. The most common data set is the
NSL-KDD, and is the benchmark for modern-day internet traffic.

The NSL-KDD data set is not the first of its kind. The KDD cup was an International Knowledge
Discovery and Data Mining Tools Competition. In 1999, this competition was held with the goal
of collecting traffic records. The competition task was to build a network intrusion detector, a
predictive model capable of distinguishing between “bad’’ connections, called intrusions or
attacks, and “good’’ normal connections. As a result of this competition, a mass amount of
internet traffic records were collected and bundled into a data set called the KDD’99, and from
this, the NSL-KDD data set was brought into existence, as a revised, cleaned-up version of the
KDD’99 from the University of New Brunswick.

This data set is comprised of four sub data sets: KDDTest+, KDDTest-21, KDDTrain+,
KDDTrain+_20Percent, although KDDTest-21 and KDDTrain+_20Percent are subsets of the
KDDTrain+ and KDDTest+. From now on, KDDTrain+ will be referred to as train and
KDDTest+ will be referred to as test. The KDDTest-21 is a subset of test, without the most
difficult traffic records (Score of 21), and the KDDTrain+_20Percent is a subset of train, whose
record count makes up 20% of the entire train dataset. That being said, the traffic records that
exist in the KDDTest-21 and KDDTrain+_20Percent are already in test and train respectively
and aren’t new records held out of either dataset.

These data sets contain the records of the internet traffic seen by a simple intrusion detection
network and are the ghosts of the traffic encountered by a real IDS and just the traces of its
existence remains. The data set contains 43 features per record, with 41 of the features referring
to the traffic input itself and the last two are labels (whether it is a normal or attack) and Score
(the severity of the traffic input itself).

Within the data set exists 4 different classes of attacks: Denial of Service (DoS), Probe, User to
Root(U2R), and Remote to Local (R2L). A brief description of each attack can be seen below:

 DoS is an attack that tries to shut down traffic flow to and from the target system. The
IDS is flooded with an abnormal amount of traffic, which the system can’t handle, and
shuts down to protect itself. This prevents normal traffic from visiting a network. An
example of this could be an online retailer getting flooded with online orders on a day
with a big sale, and because the network can’t handle all the requests, it will shut down
preventing paying customers to purchase anything. This is the most common attack in the
data set.
 Probe or surveillance is an attack that tries to get information from a network. The goal
here is to act like a thief and steal important information, whether it be personal
information about clients or banking information.
 U2R is an attack that starts off with a normal user account and tries to gain access to the
system or network, as a super-user (root). The attacker attempts to exploit the
vulnerabilities in a system to gain root privileges/access.
 R2L is an attack that tries to gain local access to a remote machine. An attacker does not
have local access to the system/network, and tries to “hack” their way into the network.

It is noticed from the descriptions above that DoS acts differently from the other three attacks,
where DoS attempts to shut down a system to stop traffic flow altogether, whereas the other
three attempts to quietly infiltrate the system undetected.

In the table below, a breakdown of the different subclasses of each attack that exists in the data
set is shown:

Although these attacks exist in the data set, the distribution is heavily skewed. A breakdown of
the record distribution can be seen in the table below. Essentially, more than half of the records
that exist in each data set are normal traffic, and the distribution of U2R and R2L are extremely
low. Although this is low, this is an accurate representation of the distribution of modern-day
internet traffic attacks, where the most common attack is DoS and U2R and R2L are hardly ever
seen.

The features in a traffic record provide the information about the encounter with the traffic input
by the IDS and can be broken down into four categories: Intrinsic, Content, Host-based, and
Time-based. Below is a description of the different categories of features:

 Intrinsic features can be derived from the header of the packet without looking into the
payload itself, and hold the basic information about the packet. This category contains
features 1–9.
 Content features hold information about the original packets, as they are sent in multiple
pieces rather than one. With this information, the system can access the payload. This
category contains features 10–22.
 Time-based features hold the analysis of the traffic input over a two-second window and
contains information like how many connections it attempted to make to the same host.
These features are mostly counts and rates rather than information about the content of
the traffic input. This category contains features 23–31.
 Host-based features are similar to Time-based features, except instead of analyzing over a
2-second window, it analyzes over a series of connections made (how many requests
made to the same host over x-number of connections). These features are designed to
access attacks, which span longer than a two-second window time-span. This category
contains features 32–41.

The feature types in this data set can be broken down into 4 types:

 4 Categorical (Features: 2, 3, 4, 42)


 6 Binary (Features: 7, 12, 14, 20, 21, 22)
 23 Discrete (Features: 8, 9, 15, 23–41, 43)
 10 Continuous (Features: 1, 5, 6, 10, 11, 13, 16, 17, 18, 19)
A breakdown of the possible values for the categorical features can be seen in the table below.
There are 3 possible Protocol Type values, 60 possible Service values, and 11 possible Flag
values.

Unlike Protocol Type and Service whose values are self-explanatory (these values describe the
connection), Flag is not very easy to understand. The Flag feature describes the status of the
connection, and whether a flag was raised or not. Each value in Flag represents a status a
connection had and the explanations of each value can be found in the table below.
A description of each feature and a breakdown of the data set can be seen in the google
spreadsheet here.

You might also like