Ring 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

computers & security 82 (2019) 156–172

Available online at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cose

Flow-based network traffic generation using


Generative Adversarial Networks

Markus Ring a,∗, Daniel Schlör b, Dieter Landes a, Andreas Hotho b


a Faculty
of Electrical Engineering and Informatics, Coburg University of Applied Sciences and Arts, Coburg 96450,
Germany
b Data Mining and Information Retrieval Group, University of Würzburg, Würzburg 97074, Germany

a r t i c l e i n f o a b s t r a c t

Article history: Flow-based data sets are necessary for evaluating network-based intrusion detection sys-
Received 25 July 2018 tems (NIDS). In this work, we propose a novel methodology for generating realistic flow-
Revised 21 November 2018 based network traffic. Our approach is based on Generative Adversarial Networks (GANs)
Accepted 23 December 2018 which achieve good results for image generation. A major challenge lies in the fact that
Available online 26 December 2018 GANs can only process continuous attributes. However, flow-based data inevitably contain
categorical attributes such as IP addresses or port numbers. Therefore, we propose three
Keywords: different preprocessing approaches for flow-based data in order to transform them into
GANs continuous values. Further, we present a new method for evaluating the generated flow-
TTUR WGAN-GP based network traffic which uses domain knowledge to define quality tests. We use the
NetFlow three approaches for generating flow-based network traffic based on the CIDDS-001 data
Generation set. Experiments indicate that two of the three approaches are able to generate high quality
IDS data.

© 2018 Elsevier Ltd. All rights reserved.

able data sets are often outdated or suffer from other short-
1. Introduction comings. Typically, network traffic is captured in packet-based
or flow-based format. This work focuses on flow-based net-
Detecting attacks within network-based traffic has been of
work traffic. Using real flow-based network traffic is problem-
great interest in the data mining community over decades.
atic due to the missing ground truth. Since flow-based data
Recently, Buczak and Guven (2016) presented an overview of
sets contain millions up to billions of flows, manual label-
the community effort with regard to this issue. However, there
ing of real network traffic is difficult even for security experts
are still open challenges (e.g., the high cost of false-positives
and extremely time-consuming. As another disadvantage, real
or the lack of labeled data sets which are publicly available)
network traffic often cannot be shared within the community
for the successful use of data mining algorithms for anomaly-
due to privacy concerns. However, labeled data sets are neces-
based intrusion detection (Catania and Garino, 2012; Sommer
sary for training supervised data mining methods (e.g., clas-
and Paxson, 2010). In this work, we focus on a specific chal-
sification algorithms) and provide the basis for evaluating the
lenge within that setting.
performance of supervised as well as unsupervised anomaly-
Problem statement. For network-based intrusion detection,
based intrusion detection methods.
few labeled data sets are publicly available which contain re-
Objective. Large training data sets with high variance can
alistic user behavior and up-to-date attack scenarios. Avail-
increase the robustness of anomaly-based intrusion detection


Corresponding author.
E-mail addresses: [email protected] (M. Ring), [email protected] (D. Schlör), dieter.landes@hs-
coburg.de (D. Landes), [email protected] (A. Hotho).
https://doi.org/10.1016/j.cose.2018.12.012
0167-4048/© 2018 Elsevier Ltd. All rights reserved.
computers & security 82 (2019) 156–172 157

methods. Therefore, we intend to build a generative model


Table 1 – Overview of typical attributes in flow-based data
which allows to generate realistic flow-based network traf-
like NetFlow (Claise, 2004) or IPFIX (Claise, 2008). The third
fic. The generated data can be used to improve the training column provides the type of the attributes and the last
of anomaly-based intrusion detection methods as well as for column shows exemplary values for these attributes.
their evaluation. To that end, we propose an approach that is
# Attribute Type Example
able to learn the characteristics of collected network traffic
and generates new flow-based network traffic with the same 1 date first seen timestamp 2018-03-13
underlying characteristics. 12:32:30.383
2 duration continuous 0.212
Approach and contributions. Generative Adversarial Net-
3 transport protocol categorical TCP
works (GANs) (Goodfellow et al., 2014) are a popular method 4 source IP address categorical 192.168.100.5
to generate synthetic data by learning from a given set of in- 5 source port categorical 52128
put data. GANs consist of two neural networks, a generator 6 destination IP address categorical 8.8.8.8
network G and a discriminator network D. The generator net- 7 destination port categorical 80
work G is trained to generate synthetic data from noise. The 8 bytes numeric 2391
9 packets numeric 12
discriminator network D is trained to distinguish generated
10 TCP flags binar/categorical .A..S.
synthetic data from real world data. The generator network
G is trained by the output signal gradient of the discrimina-
tor network D. G and D are trained iteratively until the gen-
erator network G is able to fool the discriminator network D.
learn similarities between the flow attributes: bytes, packets
GANs achieve remarkably good results in image generation
and duration.
(Arjovsky et al., 2017; Isola et al., 2017; Ledig et al., 2017; Rad-
Structure. The next section of the paper describes flow-
ford et al., 2016). Furthermore, GANs have also been used for
based network traffic, GANs, and IP2Vec in more detail. In
generating text (Yu et al., 2017) or molecules (Preuer et al.,
Section 3, we present three different approaches for trans-
2018).
forming flow-based network data. An experimental evalua-
This work uses GANs to generate complete flow-based
tion of these approaches is given in Section 4 and the results
network traffic with all typical attributes. To the best of our
are discussed in Section 5. Section 6 analyzes related work on
knowledge, this is the first work that uses GANs for this pur-
network traffic generators and GANs applied to the domain
pose. GANs can only process continuous input attributes. This
IT security. A summary and outlook on future work concludes
poses a major challenge since flow-based network data con-
the paper.
sist of continuous and categorical attributes (e.g., IP addresses
or port numbers). Therefore, we analyze different preprocess-
ing strategies to transform categorical attributes of flow-based
network data into continuous attributes. The first method 2. Foundations
simply treats attributes like IP addresses and ports as nu-
merical values. The second method creates binary attributes This section starts with analyzing the underlying flow-based
from categorical attributes. The third method uses IP2Vec network traffic. Then, GANs are explained in more detail. Fi-
(Ring et al., 2017a) to learn meaningful vector representations nally, we explain IP2Vec (Ring et al., 2017a) which is the basis
of categorical attributes. After preprocessing, we use Improved of our third data transformation approach.
Wasserstein GANs (WGAN-GP) (Gulrajani et al., 2017) with the
two time-scale update rule (TTUR) proposed by Heusel et al. 2.1. Flow-based network traffic
(2017) to generate new flow-based network data based on the
public CIDDS-001 (Ring et al., 2017b) data set. Then, we evalu- We focus on flow-based network traffic in unidirectional Net-
ate the quality of the generated data with several evaluation Flow format (Claise, 2004). Flows contain header information
measures. The proposed approach is able to generate realis- about network connections between two endpoint devices
tic flows but does not consider the temporal dependencies like servers, workstation computers or mobile phones. Each
of flow sequences. As a consequence, this approach can be flow is an aggregation of transmitted network packets which
used to generate additional training data for intrusion detec- share some properties (Claise, 2008). Typically, all transmit-
tion methods which process flows individually like Najafabadi ted network packets with the same source IP address, source
et al. (2014, 2016) or Tran et al. (2012) and for all approaches port, destination IP address, destination port and transport protocol
which operate on data sets with no timestamps like KDD CUP within a time window are aggregated into one flow. NetFlow
99 (e.g., (Wagner et al., 2011)) or NSL-KDD (e.g., (Cao et al., (Claise, 2004) aggregates all network packets which share
2016)). However, it cannot be used to generate additional train- these five properties into one flow until an active or inac-
ing data for approaches which operate on time windows or tive timeout is reached. In order to consolidate contiguous
sequences of flows like Garcia et al. (2014) or Ring et al. (2018). streams the aggregation of network packets stops if no further
The paper has several contributions. The main contribu- packet is received within a time window of α seconds (inactive
tion is the generation of flow-based network data using GANs. timeout). The active timeout stops the aggregation of network
We propose three different preprocessing approaches and a packets after β seconds, even if further network packets are
new evaluation method which uses domain knowledge to observed to avoid unlikely long entries.
evaluate the quality of generated data. In addition to that, we Table 1 shows the typical attributes of unidirectional Net-
extend IP2Vec (Ring et al., 2017a) such that IP2Vec is able to Flow (Claise, 2004) data. NetFlow are heterogeneous data which
158 computers & security 82 (2019) 156–172

consists of continuous, numeric, categorical and binary at-


tributes. Most attributes like IP addresses and ports are cat-
egorical. Further, there is a timestamp attribute (date first
seen), a continuous attribute (duration) and numeric attributes
like bytes or packets. We define the type of TCP flags as bi-
nary/categorical. TCP flags can be either interpreted as six bi-
nary attributes (e.g., isSYN flag, isACK flag, etc.) or as one cate-
gorical value.

2.2. GANs

Discriminative models classify objects into predefined classes


(Han et al., 2011) and are often used for intrusion detection
(e.g. in Wagner et al. (2011), Beigi et al. (2014), or Stevanovic and
Pedersen (2015)). In contrast to discriminative models, gener-
ative models are used to generate data like flow-based net-
work traffic. Many generative models build on likelihood max-
imization for a parametric probability distribution. As the like-
lihood function is often unknown or the likelihood gradient
is computationally intractable, some models like Deep Boltz-
mann Machines (Salakhutdinov and Larochelle, 2010) use ap-
proximations to solve this problem. Other models avoid this
problem by not explicitly representing likelihood. Generative
Stochastic Networks for example learn the transition oper- Fig. 1 – Architecture of GANs.
ation of a Markov chain whose stationary distribution esti-
mates the data distribution. GANs avoid Markov chains es-
timating the data distribution by a game-theoretic approach:
The generator network G tries to mimic samples from the data the area of GANs is to solve the issue of non-convergence
distribution, while the discriminator network D has to differ- (Goodfellow, 2016). Heusel et al. (2017) propose a two time-
entiate real and generated samples. Both networks are trained scale update rule (TTUR) for training GANs with arbitrary loss
iteratively until the discriminator D cannot distinguish real functions. The authors prove that TTUR converges under mild
samples from generated samples any more. Beside computa- assumptions to a stationary local Nash equilibrium.
tional advantages, the generator G is never updated with real For those reasons, we use Improved Wasserstein Genera-
samples. Instead, the generator network G is fed with an input tive Adversarial Networks (WGAN-GP) (Gulrajani et al., 2017)
vector of noise z. The generator is trained using only the dis- with the two time-scale update rule (TTUR) from (Heusel et al.,
criminator’s gradients through backpropagation. Therefore, it 2017) in this work.
is less likely to overfit the generator G by memorization and
reproduction of real samples. Fig. 1 illustrates the generation 2.3. IP2Vec
process.
Goodfellow et al. say that: “another advantage of adversar- 2.3.1. Motivation of IP2Vec
ial networks is that they can represent very sharp, even degener- IP2Vec (Ring et al., 2017a) is inspired by Word2Vec (Mikolov
ate distributions” (Goodfellow et al., 2014) which is the case for et al., 2013a; 2013b) and aims at transforming IP addresses
some NetFlow attributes. However, (the original) vanilla GANs into a continuous feature space Rm such that standard sim-
(Goodfellow et al., 2014) require the visible units to be dif- ilarity measures can be applied. Ring et al. (2017a) show that
ferentiable, which is not the case for categorical attributes IP2Vec is able to transform IP addresses to semantic vector rep-
like IP addresses in NetFlow data. Further, vanilla GANs are resentations Rm based on their network behavior. The vector
sensitive to parameter tuning and their loss function often representations in Ring et al. (2017a) are used to distinguish
does not correlate with the quality of the generated data. clients from servers and infected hosts from non-infected
Gulrajani et al. (2017) show that Wasserstein GANs (WGANs), hosts. Since the results from Ring et al. (2017a) are promising,
besides other advantages, are capable of modeling discrete we want to investigate the suitability of IP2Vec for our work.
distributions over a continuous latent space and that their IP2Vec transforms IP addresses to vector representations Rm
loss function correlates with the quality of generated data. using available context information from flow-based network
In contrast to vanilla GANs, WGANs (Arjovsky et al., 2017) traffic. IP addresses which appear frequently in similar contexts
use the Earth Mover (EM) distance as a value function re- will be close to each other in the feature space Rm . More pre-
placing the classifying discriminator network with a critic net- cisely, similar contexts imply to IP2Vec that the devices asso-
work estimating the EM distance. While the original WGAN ciated to these IP addresses establish similar network connec-
approach uses weight clipping to guarantee differentiabil- tions. Fig. 2 illustrates the basic idea.
ity almost everywhere, Gulrajani et al. (2017) improve train- Arrows in Fig. 2 denote network connections from three IP
ing of WGANs by using gradient penalty as soft constrain addresses, namely 192.168.20.1, 192.168.20.2, and 192.168.20.3.
to enforce the Lipschitz constraint. One research frontier in Colors indicate different services. Consequently, IP2Vec leads
computers & security 82 (2019) 156–172 159

Fig. 2 – Idea of IP2Vec.

Fig. 3 – Architecture of the neural network used by IP2Vec.

to the following result: Let us assume the training data set contains 100,000 differ-
ent IP addresses, 20,000 different destination ports and 3 differ-
sim(192.168.20.1, 192.168.20.2)>sim(192.168.20.1, 192.168.20.3), ent transport protocols. Then, the size of the one-hot vector is
120,003 and only one component is 1, while all others are 0.
(1)
Input and output layers comprise exactly the same number of
neurons which is equal to the size of the vocabulary. The out-
where sim(X, Y) is an arbitrary similarity function (e.g., co-
put layer uses a softmax classifier which indicates the proba-
sine similarity) between the IP addresses X and Y. IPVec
bilities for each value of the vocabulary that it appears in the
considers IP addresses 192.168.20.1 and 192.168.20.2 as more
same flow (context) as the input value to the neural network.
similar than 192.168.20.1 and 192.168.20.3 because the IP ad-
The softmax classifier (Buduma and Locascio, 2017) normal-
dresses 192.168.20.1 and 192.168.20.2 refer to the same targets
izes the output of all output neurons such that the sum of the
and use the same services. In contrast to that, the IP address
outputs is 1. The number of neurons in the hidden layer is
192.168.20.3 targets different servers and uses different ser-
much smaller than the number of neurons in the input layer.
vices (e.g., SSH-Traffic).

2.3.2. Model 2.3.3. Training


IP2Vec is based upon a fully connected neural network with a The neural network is trained using captured flow-based net-
single hidden layer (see Fig. 3). work traffic. IP2Vec uses only the source IP address, destination
The attributes extracted from flow-based network traffic IP address, destination port and transport protocol of flows. Table 2
constitute the neural network’s input. These attributes (IP ad- outlines the generation of training samples.
dresses, destination ports and transport protocols) define the input IP2Vec generates five training samples from each flow. Each
vocabulary which contains all IP addresses, destination ports and training sample consists of an input value and an expected
transport protocols that appear in the flow-based data set. Since output value. In the first step, IP2Vec selects an input value for
neural networks cannot be fed with categorical attributes, the training sample. The selected input value is highlighted
each value of our input vocabulary is represented as an one-hot with cyan background in Table 2. The expected output val-
vector. The length of the one-hot-vector is equal to the size of ues for the corresponding input value are highlighted with
the vocabulary. Each neuron in the input and output layer is gray background. In Table 2 can be seen that IP2Vec generates
assigned a specific value of the vocabulary (see Fig. 3). three training samples where the source IP address is the input
160 computers & security 82 (2019) 156–172

Table 2 – Generation of training samples in IP2Vec (Ring et al., 2017a). Input values are highlighted with cyan background
and expected output values are highlighted with gray background. The following abbreviations are used: src IP addr.
(source IP address), dst IP addr. (destination IP address), dst port (destination port), proto (transport protocol).

value, one training sample where the destination port is the in-
put value and one training sample where the transport protocol
3. Transformation approaches
is the input value.
This section describes three different methods to transform
In the training process, the neural network is fed with the
the heterogeneous flow-based network data such that they
input value and tries to predict the probabilities of the other
may be processed by Improved Wasserstein Generative Adver-
values from the vocabulary. For training samples, the probabil-
sarial Networks (WGAN-GP).
ity of the concrete output value is 1 and 0 for all other values.
In general, the output layer indicates the probabilities for each
value of the input vocabulary that it appears in the same flow 3.1. Preliminaries
as the given input value.
The network uses backpropagation for learning. This kind In general, we use in all three methods the same preprocess-
of training, however, could take a lot of time. Let us assume ing steps for the attributes date first seen, transport protocol, and
that the hidden layer comprises 32 neurons and the training TCP flags (see Table 1).
data set encompasses one million different IP addresses and Usually, the concrete timestamp is marginal for generating
ports. This results in 32 million weights in each layer of the realistic flow-based network data. Instead, many intrusion de-
network. Consequently, training such a large neural network tection systems derive additional information from the times-
is going to be slow. To make things worse, a huge amount of tamp like “is today a working day or weekend day” or “does the
training flows is required for adjusting that many weights and event occur during typical working hours or at night”. Therefore, we
for avoiding overfitting. Consequently, we have to update mil- do not generate timestamps. Instead, we create two attributes
lions of weights for millions of training samples. Therefore, weekday and daytime. To be precise, we extract the weekday
IP2Vec attempts to reduce the training time by using Nega- information of flows and generate seven binary attributes is-
tive Sampling in a similar way as Word2Vec does (Mikolov Monday, isTuesday and so on. Then, we interpret the daytime
et al., 2013a). In Negative Sampling, each training sample mod- as seconds [0,86400) and normalize them to the interval [0,1].
ifies only a small percentage of the weights, rather than all We transform the transport protocol (see #3 in Table 1) to three
of them. More details on Negative Sampling may be found in binary attributes, namely isTCP, isUDP, and isICMP. The same
Mikolov et al. (2013b). procedure is followed for TCP flags (see #10 in Table 1) which
are transformed to six binary attributes isURG, isACK, isPUS, is-
SYN, isRES, and isFIN.
2.3.4. Continuous representation of IP addresses
After the training phase, IP2Vec uses the weights of the hidden 3.2. Method 1 – numeric transformation
layer as m-dimensional vector representations of IP addresses.
That means, a 32 dimensional continuous representation of Although IP addresses and ports look like real numbers, they
each IP address, transport protocol and port is obtained if the hid- are actually categorical. Yet, the simplest approach is to inter-
den layer comprises 32 neurons. pret them as numbers after all and treat them as continuous
Intuition. Why does this approach work? If two IP addresses attributes. We refer to this method as Numeric-based Improved
refer to similar destination IP addresses, destination ports, and Wasserstein Generative Adversarial Networks (short: N-WGAN-
transport protocols, then the neural network needs to output GP). This method transforms each octet of an IP address to the
similar results for these IP addresses. One way for the neural interval [0,1], e.g., 192.168.220.14 is transformed to four con-
network to learn similar output values for different input val- tinuous attributes: (ip_1) 192/255 = 0.7529, (ip_2) 168/255 =
ues is to learn similar weights in the hidden layer of the net- 0.6588, (ip_3) 220/255 = 0.8627 and (ip_4) 14/255 = 0.0549. We
work. Consequently, if two IP addresses exhibit similar network do a similar procedure for ports by dividing them through the
behavior, IP2Vec attempts to learn similar weights (which are highest port number, e.g. the source port = 80 will be trans-
the vectors of the target feature space Rm ) in the hidden layer. formed to one continuous attribute 80/65535 = 0.00122.
computers & security 82 (2019) 156–172 161

Table 3 – Preprocessing of flow-based data. The first column provides the original flow attributes and examplarly values,
the other columns show the extracted features (column Attr.) and the corresponding values (column Value)) for each of
preprocessing method.

N-WGAN-GP B-WGAN-GP E-WGAN-GP


Attribute/Value Attr. Value Attr. Value Attr. Value
date first seen isMonday 1 isMonday 1 isMonday 1
2018-05-28 isTuesday 0 isTuesday 0 isTuesday 0
11:39:23 isWednesday 0 isWednesday 0 isWednesday 0
isThursday 0 isThursday 0 isThursday 0
isFriday 0 isFriday 0 isFriday 0
isSaturday 0 isSaturday 0 isSaturday 0
isSunday 0 isSunday 0 isSunday 0
41963 41963 41963
daytime 86400 = 0.485 daytime 86400 = 0.485 daytime 86400 = 0.485
⎛ ⎞
e1
1.503−durmin 1.503−durmin ⎜ ⎟
duration norm_dur d urmax −d urmin
norm_dur d urmax −d urmin
dur_1 ⎝ ... ⎠
1.503 ... em
dur_m

transport protocol isTCP 1 isTCP 1 isTCP 1


TCP isUDP 0 isUDP 0 isUDP 0
isICMP 0 isICMP 0 isICMP 0
⎛ ⎞
e1
192 ⎜ ⎟
IP address ip_1 255 = 0.7529 ip_1 to ip_8 1,1,0,0,0,0,0,0 ip_1 ⎝ ... ⎠
168
192.168.210.5 ip_2 255 = 0.6588 ip_9 to ip_16 1,0,1,0,1,0,0,0 ... em
210
ip_3 255 = 0.8627 ip_17 to ip_24 1,1,0,1,0,0,1,0 ip_m
5
ip_4 255 = 0.0196 ip_25 to ip_32 0,0,0,0,0,1,0,1
⎛ ⎞
e1
53872 ⎜ ⎟
port pt 65535 = 0.8220 pt_1 to pt_8 1,1,0,1,0,0,1,0 pt_1 ⎝ ... ⎠
53872 pt_9 to pt_16 0,1,1,1,0,0,0,0 ... em
pt_m
⎛ ⎞
e1
144−bytmin ⎜ ⎟
bytes 144 norm_byt bytmax −bytmin
byt_1 to byt_8 0,0,0,0,0,0,0,0 byt_1 ⎝ ... ⎠
byt_9 to byt_16 0,0,0,0,0,0,0,0 ... em
byt_17 to byt_24 0,0,0,0,0,0,0,0 byt_m
byt_25 to byt_32 1,0,0,1,0,0,0,0
⎛ ⎞
e1
1−pckmin ⎜ ⎟
packets 1 norm_pck pckmax −pckmin
pck_1 to pck_8 0,0,0,0,0,0,0,0 pck_1 ⎝ ... ⎠
pck_9 to pck_16 0,0,0,0,0,0,0,0 ... em
pck_17 to pck_24 0,0,0,0,0,0,0,0 pck_m
pck_25 to pck_32 0,0,0,0,0,0,0,1

TCP flags isURG 0 isURG 0 isURG 0


.A..S. isACK 1 isACK 1 isACK 1
isPSH 0 isPSH 0 isPSH 0
isRES 0 isRES 0 isRES 0
isSYN 1 isSYN 1 isSYN 1
isFIN 0 isFIN 0 isFIN 0

The attributes duration, bytes and packets (see attributes #2, 192.168.220.14 is transformed to 11000000 10101000 11011100
#8 and #9 in Table 1) are normalized to the interval [0,1]. Table 3 00001110. Ports are converted to their 16-bit binary repre-
provides examples and compares the three transformation sentation, e.g., the source port 80 is transformed to 00000000
methods. 01010000. For representing bytes and packets, we transform
them to a binary representation as well and limit their length
3.3. Method 2 – binary transformation to 32 bit. The attribute duration is normalized to the interval
[0,1]. Table 3 shows an example for this transformation proce-
The second method creates several binary attributes for IP dure.
addresses, ports, bytes, and packets. We refer to this method
as Binary-based Improved Wasserstein Generative Adversarial Net- 3.4. Method 3 – embedding transformation
works (short: B-WGAN-GP). Each octet of an IP address is
mapped to an 8-bit binary representation. Consequently, The third method transforms IP addresses, ports, duration, bytes,
IP addresses are transformed into 32 binary attributes, e.g., and packets into so-called embeddings in a m-dimensional fea-
162 computers & security 82 (2019) 156–172

Table 4 – Extended generation of training samples in IP2Vec. Input values are highlighted with cyan background and
expected output values are highlighted with gray background. The following abbrevations are used: src IP addr. (source IP
address), dst IP addr. (destination IP address), dst port (destintion port), proto (transport protocol).
input value output value
src IP addr. src port dst IP addr. dst port proto bytes packets duration → src IP addr. dst IP addr.
→ src IP addr. src port
→ src IP addr. proto
src IP addr. src port dst IP addr. dst port proto bytes packets duration → dst IP addr. src IP addr.
→ dst IP addr. dst port
→ dst IP addr. proto
src IP addr. src port dst IP addr. dst port proto bytes packets duration → src port src IP addr.
src IP addr. src port dst IP addr. dst port proto bytes packets duration → dst port dst IP addr.
src IP addr. src port dst IP addr. dst port proto bytes packets duration → bytes packets
→ bytes duration
src IP addr. src port dst IP addr. dst port proto bytes packets duration → packets bytes
→ packets duration
src IP addr. src port dst IP addr. dst port proto bytes packets duration → duration packets

ture space Rm following the ideas in Section 2.3. We refer to 4.1. Data set
this method as Embedding-based Improved Wasserstein Genera-
tive Adversarial Networks (short: E-WGAN-GP). We use the publicly available CIDDS-001 data set (Ring et al.,
E-WGAN-GP extends IP2Vec (see Section 2.3) for learning 2017b) which contains unidirectional flow-based network traf-
embeddings not only for IP addresses, ports, and transport pro- fic as well as detailed information about the networks and IP
tocols, but also for the attributes duration, bytes, and packets. To addresses within the data set. Fig. 4 shows an overview of the
that end, the input vocabulary of IP2Vec is extended by the val- emulated business environment of the CIDDS-001 data set. In
ues of the latter three attributes and additional training pairs essence, the CIDDS-001 data set contains four internal subnets
are extracted from each flow. Table 4 presents the extended which can be identified by their IP address ranges: a developer
training sample generation. subnet (dev) with exclusively Linux clients, an office subnet
Each flow produces 13 training samples each of which (off) with exclusively Windows clients, a management subnet
consists of an input and an expected output value. The in- (mgt) with mixed clients, and a server subnet (srv). Additional
put values are highlighted with cyan background in Table 4. knowledge facilitates the evaluation of the generated data (see
The expected output values for the corresponding input value Section 4.3).
are highlighted with gray background. Our adapted training The CIDDS-001 data set contains four weeks of network
sample generation extracts further training samples for the traffic. We consider only the network traffic which was cap-
attributes bytes, packets and duration. Further, we also cre- tured at the network device within the OpenStack environment
ate training pairs with the destination IP address as input. (see Fig. 4) and divide the network traffic in two parts: week1
Ring et al. (2017a) argue that it is not necessary to extract and week2-4. The first two weeks contain normal user behav-
training samples with destination IP addresses as input when ior and attacks, whereas week3 and week4 contain only normal
working on unidirectional flows. Yet, in this case, IP2Vec does user behavior and no attacks. We use this kind of splitting in
not learn meaningful representation for multi- and broadcast order to obtain a large training data set week2–4 for our gen-
IP addresses which only appear as destination IP addresses in erative models and simultaneously provide a reference data
flow-based network traffic. Table 3 shows the result of an ex- set week1 which contains normal and malicious network be-
emplary transformation. havior. Overall, week2–4 contains around 22 million flows and
E-WGAN-GP maps flows to embeddings which need to be week1 contains around 8.5 million flows. We consider only the
re-transformed to the original space after generation. To that TCP, UDP and ICMP flows and remove the 895 IGMP flows from
end, values are replaced by the closest embeddings generated the data set.
by IP2Vec. For instance, we calculate the cosine similarity be-
tween the generated output for the source IP address and all
existing IP address embeddings generated by IP2Vec. Then, we
4.2. Definition of a baseline
replace the output with the IP address which has the highest
similarity.
As baseline for our experiments, we build a generative
model which creates new flows based on the empiri-
cal probability distribution of the input data. The base-
4. Experiments line estimates the probability distribution for each attribute
by counting from the input data. New flows are gener-
This section provides an experimental evaluation of our three ated by drawing from the empirical probability distribu-
approaches N-WGAN-GP, B-WGAN-GP and E-WGAN-GP for syn- tions. Each attribute is drawn independently from other
thetic flow-based network traffic generation. attributes.
computers & security 82 (2019) 156–172 163

Fig. 4 – Overview of the simulated network environment from the CIDDS-001 data set (Ring et al., 2017b).

4.3. Evaluation methodology their generated data. Siska et al. (2010) and Iannucci et al.
(2017) build graphs and evaluate the diversity of the generated
Evaluation of generative models is challenging and an open traffic by comparing the number of nodes and edges between
research topic: Borji (2018) analyzed different evaluation mea- generated and real network traffic. Other flow-based network
sures for GANs. Images generated with GANs are often pre- traffic generators often focus on specific aspects in their eval-
sented to human judges and evaluated by visual comparison. uation, e.g. distributions of bytes or packets are compared with
Another well-known evaluation measure for images is the In- real NetFlow data in Sommers and Barford (2004) or Botta et al.
ception Score (IS) (Salimans et al., 2016). IS classifies gener- (2012).
ated images in 1000 different classes using the Inception Net Since there is no single widely accepted evaluation
v3 (Szegedy et al., 2016). IS, however, is not applicable in our methodology, we use several evaluation approaches to assess
scenario since the Inception Net v3 can only classify images, the quality of the generated data from different views. To eval-
but no flow-based network traffic. uate the diversity and distribution of the generated data, we
In the IT security domain, there is neither consensus on visualize attributes (see Section 4.4.2) and compute the Eu-
how to evaluate network traffic generators, nor a standard- clidean distances between generated and real flow-based net-
ized methodology (Molnár et al., 2013). Glasser and Lindauer work data (see Section 4.4.3). To evaluate the quality of the
(2013) discuss about the problem of evaluating synthetic data. content and relationships between attributes within a flow,
The authors conclude that synthetic data will only be real- we introduce domain knowledge checks (see Section 4.4.4) as a
istic in some limited and measurable dimensions in the ab- new evaluation method. This method is developed on the ba-
sence of a clear definition of realism. Therefore, Glasser and sic idea of Glasser and Lindauer (2013). While Glasser and Lin-
Lindauer use human feedback from domain experts to eval- dauer (2013) use feedback from human experts, the domain
uate the quality of generated data for anomaly detection. knowledge checks are automated test procedures on the basis
Stiborek et al. (2015) use an anomaly score to evaluate of domain knowledge.
164 computers & security 82 (2019) 156–172

Fig. 5 – Temporal distribution of flows per hour.

4.4. Generation of flow-based network data 4.4.2. Visualization


Fig. 5 shows the temporal distribution of the generated flows
Now, we evaluate the quality of the generated data by the base- and reference week week1. The y-axis shows the flows per
line (see Section 4.2), N-WGAN-GP, B-WGAN-GP, and E-WGAN- hour as a percentage of total traffic and the three lines rep-
GP (see Section 3). resent the reference week (week1), the generated data of the
baseline (baseline), and the generated data of the E-WGAN-
GP approach (E-WGAN-GP). Since all three transformation ap-
4.4.1. Parameter configuration proaches process the attribute date first seen in the same way,
For all four approaches, we use week2–4 of the CIDDS-001 data only (E-WGAN-GP) is included for the sake of brevity. E-WGAN-
set as training input and generate 8.5 million flows for each GP reflects the essential temporal distribution of flows. In the
approach. CIDDS-001 data set, the emulated users exhibit common be-
We configured N-WGAN-GP, B-WGAN-GP and E-WGAN-GP to havior including lunch breaks and offline work which results
use a feed-forward neural network as generator and discrim- in temporal limited network activities and a jagged curve (e.g.
inator. Furthermore, we used the default parameter configu- around 12:00 on working days). In contrast to that, the curve of
ration of (Heusel et al., 2017) and trained the networks for 5 E-WGAN-GP is smoother than the curve of the original traffic
epochs. An epoch is one training iteration over the complete week1.
training data set (Buduma and Locascio, 2017). Consequently, In the following, we use different visualization plots
we use each flow of the training data set five times for training in order to get a deeper understanding of the generated
the neuronal networks. We observed that a higher number of data.
epochs neither leads to increasing quality nor reduces the loss Figs. 6 and 7 show the real distributions (first row) sam-
values of the GANs. For identifying the number of neurons in pled from week1 respectively generated distributions by our
each hidden layer, we set up a small parameter study in which maximum likelihood estimator baseline (second row) and gen-
we varied the number of neurons from 8 to 192. We found that erated distributions by our WGAN-GP models using different
80 neurons in each hidden layer were sufficient for B-WGAN- data representations for each row (third to fifth row). Each vio-
GP and E-WGAN-GP. Similar numbers of neurons (e.g., 64 or 96) lin plot shows the data distribution of the attribute source port
in each hidden layer lead to no significant changes in the qual- (Fig. 6) respectively the attribute destination IP address (Fig. 7)
ity of the generated data. For N-WGAN-GP, we set the number for the different source IP addresses grouped by their subnet
of neurons in the hidden layer to 24 since the numerical rep- (see Section 4.1). IP addresses from different subnets come
resentation of flows is much smaller than for B-WGAN-GP or along with different network behavior. For instance, IP ad-
E-WGAN-GP. dresses from the mgt subnet are typically clients which use ser-
Additionally, we have to learn embeddings for E-WGAN- vices while IP addresses from the srv subnet are servers which
GP in a previous step. Therefore, we configured IP2Vec to use offer services. This knowledge was not explicitly modeled dur-
20 neurons in the hidden layer and trained the network like ing data generation.
Ring et al. (2017a) for 10 epochs.
computers & security 82 (2019) 156–172 165

Fig. 6 – Distribution of the attribute source port for the subnets. The rows show in order: (1) data sampled from real data
(week 1) and data generated by (2) baseline, (3) N-WGAN-GP, (4) E-WGAN-GP and (5) B-WGAN-GP.

We will now briefly discuss the conditional distribution N-WGAN-GP is incapable of representing the distributions
of source ports (Fig. 6). In the first row, we can clearly dis- properly. Note that almost exclusively flows with external
tinguish typical client-port (dev, mgt, off) and server-port (ext, source IP addresses are generated in the selected samples. In-
srv) distributions. As expected, the maximum likelihood base- depth analysis of the generated data suggests that numeric
line is not able to capture the differences of the distribu- representations fail to match the designated subnets exactly.
tions depending on the subnet of the source IP address and As nearly all generated data is assigned to the ext subnet, it
models a distribution which is a combination of all five sub- comes as no surprise that the distribution represents a com-
nets from the input data. In contrast, the B-WGAN-GP and E- bination of all five subnets from the input data for both source
WGAN-GP capture the conditional probability distributions for ports (Fig. 6) and destination IP addresses (Fig. 7).
the source port given the subnet of the source IP address very For the attribute destination IP address, the distribution is
well. a mixture of external and internal IP addresses for dev, mgt
166 computers & security 82 (2019) 156–172

Fig. 7 – Distribution of the attribute destination ip for the subnets. The rows show in order: (1) data sampled from real data
(week 1) and data generated by (2) baseline, (3) N-WGAN-GP, (4) E-WGAN-GP and (5) B-WGAN-GP.

and off subnets (see reference week week1). This matches addresses. E-WGAN-GP and B-WGAN-GP capture this property
the user roles, surfing on the internet (external) as well as very well while the baseline and N-WGAN-GP fail to capture this
accessing internal services (e.g., printers). For external sub- property.
nets, the destination IP address has to be within the internal IP
address range. Traffic from external sources to external tar-
4.4.3. Euclidean distances
gets does not run through the simulated network environ-
The second evaluation compares the distribution of the
ment of the CIDDS-001 data set. Consequently, there is no
generated and real flow-based network data in each at-
flow within the CIDDS-001 data set which has a source IP ad-
tribute independently. Therefore, we calculate Euclidean
dress and a destination IP address from the ext subnet. This
distances between the probability distributions of the gener-
fact can be seen for week1 in Fig. 7 where flows which have
ated data and the input flow-based network data (week2–4)
their origin in the ext subnet only address a small range of
in each attribute. We choose the Euclidean distance over
destination IP addresses which reflect the range of internal IP
the Kullback–Leibler divergence in order to avoid calculation
computers & security 82 (2019) 156–172 167

Table 5 – Euclidian distances between the training data (week2-4) and the generated flow-based network traffic in each
attribute.

Attribute Baseline N-WGAN-GP B-WGAN-GP E-WGAN-GP week1


duration 0.0002 0.4764 0.4764 0.0525 0.0347
transport protocol 0.0001 0.0014 0.0042 0.0015 0.0223
source IP address 0.0003 0.1679 0.0773 0.0988 0.1409
source port 0.0003 0.5658 0.0453 0.0352 0.0436
destination IP address 0.0003 0.1655 0.0632 0.1272 0.1357
destination port 0.0003 0.5682 0.0421 0.0327 0.0437
bytes 0.0002 0.5858 0.0391 0.0278 0.0452
packets 0.0004 1.0416 0.0578 0.0251 0.0437
TCP flags 0.0003 0.0217 0.0618 0.0082 0.0687

problems where the probability of generated data is zero. dress (source IP address or destination IP address) of each
Table 5 highlights the results. We refrain from calculating the flow must be internal (starting with 192.168.XXX.XXX).
Euclidean distance for the attribute date first seen since exact • Test 3: If the flow describes normal user behavior and
matches of timestamps (considering seconds and millisec- the source port or destination port is 80 (HTTP) or 443
onds) do not make sense. At this point, we refer to Fig. 5 which (HTTPS), the transport protocol must be TCP.
analyzes the temporal distribution of the generated • Test 4: If the flow describes normal user behavior and
timestamps. the source port or destination port is 53 (DNS), the transport
Network traffic is subject to concept drift and exact repro- protocol must be UDP.
duction of probability distributions is not desirable. This fact • Test 5: If a multi- or broadcast IP address appears in the
can be seen in Table 5 where the Euclidean distances between flow, it must be the destination IP address.
the probability distributions from week1 and week2–4 of the • Test 6: If the flow represents a netbios message (destina-
CIDDS-001 data set are between 0.02 and 0.14. Consequently, tion port is 137 or 138), the source IP addresses must be in-
generated network traffic should have similar Euclidean dis- ternal (192.168.XXX.XXX) and the destination IP address
tances to the training data like the reference week week1. must be an internal broadcast (192.168.XXX.255).
However, it should be mentioned that there is no perfect dis- • Test 7: TCP, UDP and ICMP packets have a minimum and
tance value x which indicates the correct amount of concept maximum packet size. Therefore, we check the relation-
drift. The generated data of E-WGAN-GP tends to have simi- ship between bytes and packets in each flow according to
lar distances to the training data (week2–4) like the reference the following rule:
data set week1. Table 5 shows that the baseline has the low-
est distance to the training data in each attribute. The gener- 42 ∗ packets ≤ bytes ≤ 65.535 ∗ packets
ated data of N-WGAN-GP differs considerably from the train-
ing data set in some attributes. This is because N-WGAN-GP Table 6 shows the results of checking the generated data
often does not generate the exact values but a large number against these rules.
of new values. The binary approach B-WGAN-GP has small dis- The reference data set week1 achieves 100 percent in each
tances in most attributes (except for attribute duration). This test which is not surprising since the data is real flow-based
may be caused by the distribution of duration in the training network traffic which is captured in the same environment as
data as most flows in the training data set have very small the training data set. The baseline approach does not capture
values in this attribute. Further, the normalization of the dura- dependencies between flow attributes and achieves worse re-
tion to interval [0,1] entails that almost all flows have very low sults. This can be especially observed in Tests 1, 4, and 6.
values in this attribute. N-WGAN-GP and B-WGAN-GP tend to Since multi- and broadcast IP addresses appear only in the at-
generate the smallest possible duration (0.000 seconds) for all tribute destination IP address, the baseline cannot fail Test 5 and
flows. achieves 100 percent.
For our generative models, E-WGAN-GP achieves the best
4.4.4. Domain knowledge checks results on average. The usage of embeddings leads to more
We use domain knowledge checks to evaluate the intrinsic qual- meaningful similarities within categorical attributes and fa-
ity of the generated data. To that end, we derive several prop- cilitates the learning of interrelationships. Embeddings, how-
erties that generated flow-based network data need to fulfill ever, also reduce the possible resulting space since no new
in order to be realistic. We use the following seven heuristics values can be generated. B-WGAN-GP generates flows which
as sanity checks: achieve high accuracy in Tests 1–4. However, this approach
shows weaknesses in Tests 5 and 6 where several internal re-
lationships must be considered. The numerical approach N-
• Test 1: If the transport protocol is UDP, then the flow must
WGAN-GP has the lowest accuracy in the tests. In particular,
not have any TCP flags.
Test 4 shows that normalization of source port or destination port
• Test 2: The CIDDS-001 data set is captured within an em-
to a single continuous attribute is inappropriate. Straightfor-
ulated company network. Therefore, at least one IP ad-
168 computers & security 82 (2019) 156–172

Table 6 – Results of the domain knowledge checks in percentage. Higher values indicate better results.

Baseline N-WGAN-GP B-WGAN-GP E-WGAN-GP week1


Test 1 14.08 96.46 97.88 99.77 100.0
Test 2 81.26 0.61 98.90 99.98 100.0
Test 3 86.90 95.45 99.97 99.97 100.0
Test 4 15.08 7.14 99.90 99.84 100.0
Test 5 100.0 25.79 47.13 99.80 100.0
Test 6 0.07 0.00 40.19 92.57 100.0
Test 7 71.26 100.0 85.32 99.49 100.0

ward mapping of 216 different port values to one continuous for representation. These two aspects support B-WGAN-GP
attribute leads to too many values for a good reconstruction. in generating better categorical values of a flow as can be
In contrast to that, the binary representation of B-WGAN-GP observed in the results of the domain knowledge checks (see
leads to better results in that test. e.g. Test 2 and Test 4 in Table 6). Further, Figs. 6 and 7 in-
dicate that B-WGAN-GP captures the internal structure of
the traffic very well even though it is less restricted than E-
5. Discussion WGAN-GP with respect to the treatment of previously unseen
values.
Flow-based network traffic consists of heterogeneous data E-WGAN-GP learns embeddings for IP addresses, ports, bytes,
and GANs can only process continuous input values. To solve packets, and duration. These embeddings are continuous vec-
this problem, we analyze three methods to transform categor- tor representations and take contextual information into ac-
ical to continuous attributes. The advantages and disadvan- count. As a consequence, the generation of flows is less error-
tages of these approaches are discussed in the following. prone as small variations in the embedding space generally do
N-WGAN-GP is a straightforward numeric method but leads not change the outcome in input space much. For instance,
to unwanted similarities between categorical values which if a GAN introduces a small error in IP address generation, it
are not similar considering real data. For instance, this trans- could find the embedding of the IP address 192.168.220.5 as
formation approach assesses the IP addresses 192.168.220.10 nearest neighbor instead of the embedding of the expected
and 191.168.220.10 as highly similar although the first IP IP address 192.168.220.13. Since both IP addresses are internal
address 192.168.220.10 is private and the second IP address clients, the error has nearly no effect. As a consequence, E-
191.168.220.10 is public. Hence, the two addresses should be WGAN-GP achieves the best results of the generative models in
ranked as fairly dissimilar. Obviously, even small errors in the the evaluation. Yet, this approach (in contrast to N-WGAN-GP
generation process can cause significant errors. This effect and B-WGAN-GP) cannot generate previously unseen values
can be observed in Test 2 (see Table 6) where N-WGAN-GP has due to the embedding translation. This is not a problem for
problems with the generation of private IP addresses. Instead, the attributes bytes, packets and duration. Given enough train-
this approach often generates non-private IP addresses such ing data, embeddings for all (important) values of bytes, du-
as 191.168.X.X or 192.167.X.X. In image generation, the origi- ration and packets are available. For example, consider the at-
nal application domain of GANs, small errors do not have se- tribute bytes. We assume that the available embedding val-
rious consequences. A brightness 191 instead of 192 in a gen- ues b1 , b2 , b3 , . . . , bk−1 , bk sufficiently cover the possible value
erated pixel has nearly no effect on the image and the error range of the attribute bytes. As specific byte-values have no
is (normally) not visible for human eyes. Further, N-WGAN-GP particular meaning, we are only interested in the magnitude
normalizes the numeric attributes bytes and packets to the in- of the attribute. Therefore, non existing values bx can be re-
terval [0,1]. The generated data are then de-normalized using placed with available embedding values without adversely af-
the original training data. Here, we can observe that real flows fecting the meaning.
often have typical byte sizes like 66 bytes which are also not The situation may be different for IP addresses and ports.
exactly matched. This results in higher Euclidean distances IP addresses represent hosts with a distinct complex network
in these attributes (see Table 5). Overall, the first method N- behavior, for instance as a web server, printer, or Linux client.
GAN-WP does not seem to be suitable for generating realistic Generating new IP addresses goes along with the invention of a
flow-based network traffic. new host with new network behavior. To answer the question
B-WGAN-GP extracts binary attributes from categorical at- whether the generation of new IP addresses is necessary (or de-
tributes and converts numerical attributes to their binary rep- sired), the purpose needs to be considered in which the gener-
resentation. Using this transformation, additional structural ated data shall be used later. If the training set comprises more
information (e.g., subnet information) of IP addresses can be than 10,000 or 100,000 different IP addresses, there is probably
maintained. Further, B-WGAN-GP assigns larger value ranges no need to generate new IP addresses for an IDS evaluation data
to categorical values in the transformed space than N-WGAN- set. However, this does not hold generally. Instead, one should
GP. While N-WGAN-GP uses a single continuous attribute to ask the following two questions: (1) are there enough different
represent a source port, B-WGAN-GP uses 16 binary attributes IP addresses in the training data set and (2) is there a need to
computers & security 82 (2019) 156–172 169

generate previously unseen IP addresses? If previously unseen data sets. Instead, a good network traffic generator for our pur-
IP addresses are required, E-WGAN-GP is not suitable as trans- pose should be able to generate new synthetic flow-based net-
formation method, otherwise E-WGAN-GP will generate better work traffic.
flows than all other approaches.
The situation for ports is similar to IP addresses. Generally, Category (II) Maximum Throughput Generators usually aim to
there are 65536 different ports and most of these ports should test end-to-end network performance (Molnár et al., 2013).
appear in the training data set. Generating new port values is Iperf (Jon et al.,) is such a generator and can be used for testing
also associated with generating new behavior. If the training bandwidth, delay jitter, and loss ratio characteristics. Conse-
data set comprises SSH connections (port 22) and HTTP con- quently, methods from this category primarily aim at evaluat-
nections (port 80), but no FTP connections (port 20 and 21), ing network bandwidth performance.
generators are not able to produce realistic FTP connections
if they have never seen such connections. Since the network Category (III) Attack Generators use real network traffic as in-
behavior of FTP differs greatly from SSH and HTTP, it does not put and combine it with synthetically created attacks. FLAME
make much sense to generate unseen service ports. However, (Brauckhoff et al., 2008) is a generator for malicious network
the situation is different for typical client ports. traffic. The authors use rule-based approaches to inject e.g.
Generally, GANs capture the implicit conditional probabil- port scan attacks or denial of service attacks. Vasilomanolakis
ity distributions very well, given that a proper data representa- et al. (2016) present ID2T, a similar approach which combines
tion is chosen which is the case for E-WGAN-GP and B-WGAN- real network traffic with synthetically created malicious net-
GP (see Figs. 6 and 7). While the visual differences between work traffic. For creating malicious network traffic, the au-
binary and embedded data representations are subtle, the do- thors use rule-based scripts or manipulate parameters of the
main knowledge checks show larger quality differences. Over- input network traffic. Sperotto et al. (2009) analyze ssh brute
all, this analysis suggests that E-WGAN-GP and B-WGAN-GP force attacks on flow level and use a Hidden Markov Model
are able to generate good flow-based network traffic. While to model the characteristics of them. However, their model
E-WGAN-GP achieves better evaluation results, B-WGAN-GP is generates only the number of bytes, packets and flows during a
not limited in the value range and is able to generate previ- typical attack scenario and does not generate complete flow-
ously unseen values. based data.

Category (IV) High-Level Generators aim to generate new syn-


6. Related work thetic network traffic which contains realistic network con-
nections. Stiborek et al. (2015) propose three statistical meth-
This work targets the generation of flow-based network traffic ods to model host-based behavior on flow-based network
using GANs. Therefore, we provide an overview of flow-based level. The authors use real network traffic as input and extract
traffic generators before we review the general use of GANs in typical inter- and intra-flow relations of host behavior. New
the application domain IT security. flow-based data is generated based on a time variant joint
probability model which considers the extracted user behav-
6.1. Traffic generators ior. Siska et al. (2010) propose a graph-based method to gener-
ate flow-based network traffic. The authors use real network
Molnár et al. (2013) give a comprehensive overview of network traffic as input and extract traffic templates. Traffic templates
traffic generators, categorize them with respect to their pur- are extracted for each service port (e.g., port 80 (HTTP) or 53
poses, and analyze the used validation measures. The authors (DNS)) and contain structural properties as well as the value
conclude that there is no consensus on how to validate net- distributions of other flow attributes (e.g., log-normal distribu-
work traffic generators. Since the proposed approach aims to tion of transmitted bytes). These traffic templates can be com-
generate flow-based network traffic, the following overview bined with user-defined traffic templates. A flow generator
considers primarily flow-based network traffic generators. selects flow attributes from the traffic templates and gener-
Neither approaches which emulate computer networks and ates new network traffic. Iannucci et al. (2017) propose PGPBA
capture their network traffic like Ring et al. (2017b) or and PGSK, two synthetic flow-based network traffic genera-
Shiravi et al. (2012), nor static intrusion detection data sets tors. Their generators are based on the graph generation algo-
like DARPA 98 or KDD CUP 99 will be considered in the fol- rithms Barabasi-Albert (PGPBA) and Kronecker (PGSK). The au-
lowing. We categorize flow-based network traffic generators thors initialize their graph-based approaches with network
into (I) Replay Engines, (II) Maximum Throughput Generators, (III) traffic in packet-based format. When generating new traffic,
Attack Generators, and (IV) High-Level Generators. the authors first compute the probability of the attribute bytes.
All other attributes of flow-based data are calculated based on
Category (I) As the name suggests, Replay Engines use previ- the conditional probability of the attribute bytes. To evaluate
ously captured network traffic and replay the packets from the quality of their generated traffic, Iannucci et al. (2017) an-
it. Often, the aim of Replay Engines is to consider the original alyze the degree and pagerank distribution of their graphs to
inter packet time (IPT) behavior between the network pack- show the veracity of the generated data.
ets. TCPReplay (Turner) and TCPivo (Feng et al., 2003) are well- The approach presented here does not simply replay exist-
known representatives of this category. Since network traffic ing network traffic like category (I). In fact, traffic generators
is subject to concept drift, replaying already known network from the first two categories have a different objective. Our ap-
traffic only makes limited sense for generating IDS evaluation proach belongs to category (IV) and generates new synthetic
170 computers & security 82 (2019) 156–172

network traffic and is not limited to generating only malicious resolution in image processing. Despite they are combining
network traffic like category (III). While Siska et al. (2010) and generative adversarial networks with the generation of fine-
Iannucci et al. (2017) use domain knowledge to generate flows grained network traffic, their approach is very different to ours
by defining conditional dependencies between flow attributes, since they only work with traffic as aggregated continuous at-
we use GAN-based approaches which learn all dependencies tribute, not with network data at flow-level and rely on coarse-
between the flow attributes inherently. grained information as input.
As can be seen in this section, GANs are already used in the
6.2. GANs domain IT security and prove their general suitability. How-
ever, existing works are only applied to specific application
This section analyses how GANs were recently introduced in scenarios and consider only continuous attributes. In contrast
the domain IT security. A more general discussion about at- to that, the proposed approach aims to generate network data
tacks and defenses for deep learning against adversarial ex- in standard flow-based NetFlow format and considers all typi-
amples may be found in Yuan et al. (2017). Rigaki and Garcia cal categorical attributes like IP addresses or port numbers.
(2016) use a GAN based approach to modify malware commu-
nication in order to avoid detection. The authors evaluate their
method using an Intrusion Prevention System (IPS) which is
based on a Markov model. The IPS considers the bytes, dura- 7. Summary
tion and time-delta of flows for determining malicious network
traffic. Therefore, Rigaki and Garcia use a GAN which learns Labeled flow-based data sets are necessary for evaluating and
to imitate Facebook chat traffic characteristics based on these comparing anomaly-based intrusion detection methods. Eval-
flow attributes. For capturing the time-delta of the flows, the uation data sets like DARPA 98 and KDD Cup 99 cover several
generator and discriminator of the GAN are Recurrent Neu- attack scenarios as well as normal user behavior. These data
ronal Networks (RNNs). After the training phase, the authors sets, however, were captured at some point in time such that
use the GAN to generate legitimate Facebook traffic character- concept drift of network traffic causes static data sets to be-
istics and adapt the malware to match these traffic patterns. come obsolete sooner or later.
Following this approach, the malware is able to successfully In this paper, we proposed three synthetic flow-based net-
bypass the IPS. Hu and Tan (2017) present a GAN based ap- work traffic generators which are based on Improved Wasser-
proach named MalGAN in order to generate synthetic mal- stein GANs (WGAN-GP) (Gulrajani et al., 2017) using the two
ware examples which are able to bypass anomaly-based de- time scale update rule from (Heusel et al., 2017). Our genera-
tection methods. Malware examples are represented as 160- tors are initialized with flow-based network traffic and then
dimensional binary attributes. generate new synthetic flow-based network traffic. In con-
Anderson et al. (2016) developed a character-based GAN trast to previous high-level generators, our GAN-based ap-
to mimic domain generation algorithms (DGA) as used by proaches learn all internal dependencies between attributes
malware to contact command and control servers. The au- inherently and no additional knowledge has to be modeled.
thors train an auto-encoder to generate domain names and re- Flow-based network traffic consists of heterogeneous data,
assembled encoder and decoder to an adversarial setting fool- but GANs can only process continuous input data. To over-
ing DGA-detection classifiers. DeepDGA generates domain- come this challenge, we proposed three different methods
names, which are categorical data, however as the domain- to handle flow-based network data. In the first approach N-
name is the only attribute generated, their setting is hardly WGAN-GP, we interpreted IP addresses and ports as continuous
comparable to our flow-based network data generation task. input values and normalized numeric attributes like bytes and
Yin et al. (2018) propose Bot-GAN, a framework which gen- packets to the interval [0,1]. In the second approach B-WGAN-
erates synthetic network data in order to improve botnet de- GP, we created binary attributes from categorical and numer-
tection methods. However, their framework does not consider ical attributes. For instance, we converted ports to their 16-bit
the generation of categorical attributes like IP addresses and binary representation and extracted 16 binary attributes. B-
ports which is one of the key contributions of our work. WGAN-GP is able to maintain more information (e.g., subnet
Zheng et al. (2018) use a generative adversarial network information of IP addresses) from the categorical input data.
based approach for fraud detection in bank transfers. To be The third approach E-WGAN-GP learns meaningful continu-
precise, the authors use a deep denoising autoencoder and ous representations of categorical attributes like IP addresses
two Gaussian Mixture Models (GMM). The encoder and one using IP2Vec (Ring et al., 2017a). The preprocessing of E-WGAN-
GMM act as discriminator and the decoder act as generator. GP is inspired from the text mining domain which also has to
The second GMM classifies the bank transfers in combination deal with non-continuous input values. Then, we generated
with a threshold into the classes normal or fraud. Zheng et al. new flow-based network traffic based on the CIDDS-001 data
achieve good results with this approach and are able to beat set (Ring et al., 2017b) in an experimental evaluation. Our ex-
non GAN-based approaches. However, their input data differ periments indicate that especially E-WGAN-GP is able to gen-
significantly from flow-based data and consist primarily of erate realistic data which achieves good evaluation results. B-
continuous attributes like amount of transferred money, balance WGAN-GP achieves similarly good results and is able to create
of the account or frequency of transfers. new (unseen) values in contrast to E-WGAN-GP. The quality
For analysis of mobile traffic Zhang et al. (2017) propose of network data generated by N-WGAN-GP is less convincing,
ZipNet-GAN, a GAN-based approach for fine-grained pattern which indicates that straight forward numeric transformation
extraction from coarse-grained network data, similar to super- is not appropriate.
computers & security 82 (2019) 156–172 171

Our research indicates that GANs are well suited for gen- reproducible network research. ACM; 2003. p.
erating flow-based network traffic. We plan to extend our ap- 57–64.
proach in order to generate sequences of flows instead of indi- Garcia S, Grill M, Stiborek J, Zunino A. An empirical comparison
of botnet detection methods. Comput Secur 2014;45:100–23.
vidual flows. Therefore, we want to evaluate further network
Glasser J, Lindauer B. Bridging the gap: a pragmatic approach to
structures (e.g., LSTMs or CNN) which are able to learn tem-
generating insider threat data. In: Proceedings of the security
poral relationships of flow sequences. In addition, we want to and privacy workshops (SPW). IEEE; 2013. p. 98–104.
work on the development of further evaluation methods. Goodfellow I. NIPS 2016 tutorial: Generative Adversarial
Networks. arXiv preprint 2016. arXiv: 1701.00160.
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D,
Acknowledgments Ozair S, Courville A, Bengio Y. Generative adversarial nets. In:
Proceedings of the advances in neural information processing
systems (NIPS); 2014. p. 2672–80.
M.R. was supported by the BayWISS Consortium Digitization. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC.
We gratefully acknowledge the support of NVIDIA Corporation Improved training of wasserstein GAN. In: Proceedings of the
with the donation of the Titan Xp GPU used for this research. advances in neural information processing systems (NIPS);
2017. p. 5769–79.
Han J, Pei J, Kamber M. Data mining: concepts and techniques.
Supplementary material 3rd. Elsevier; 2011.
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S.
GANs Trained by a Two Time-Scale Update Rule Converge to a
Supplementary material associated with this article can be Local Nash Equilibrium. In: Advances in Neural Information
found, in the online version, at doi:10.1016/j.cose.2018.12.012. Processing Systems (NIPS); 2017. p. 6629–40.
Hu W, Tan Y. Generating Adversarial Malware Examples for
R E F E R E N C E S
Black-Box Attacks Based on GAN. 2017. arXiv:1702.05983.
Iannucci S, Kholidy HA, Ghimire AD, Jia R, Abdelwahed S,
Banicescu I. A comparison of graph-based synthetic data
generators for benchmarking next-generation intrusion
Anderson HS, Woodbridge J, Filar B. Deepdga: adversarially-tuned detection systems. In: Proceedings of the IEEE International
domain generation and detection. In: Proceedings of the 2016 conference on cluster computing (CLUSTER). IEEE; 2017.
ACM workshop on artificial intelligence and security. ACM; p. 278–89.
2016. p. 13–21. Isola P, Zhu JY, Zhou T, Efros AA. Image-to-image translation
Arjovsky M, Chintala S, Bottou L. Wasserstein generative with conditional adversarial networks. In: Proceedings of the
adversarial networks. In: Proceedings of the international IEEE conference on computer vision and pattern recognition
conference on machine learning (ICML); 2017. p. 214–23. (CVPR). IEEE; 2017. p. 5967–76.
Beigi EB, Jazi HH, Stakhanova N, Ghorbani AA. Towards effective Jon D, Seth E, Bruce MA, Jeff P, Kaustubh P. Iperf: the TCP/UDP
feature selection in machine learning-based botnet detection bandwidth measurement tool. (Date last accessed
approaches. In: Proceedings of the IEEE conference on 14-June-2018).
communications and network security. IEEE; 2014. p. 247–55. Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A,
Borji A. Pros and cons of GAN evaluation measures. arXiv Aitken A, Tejani A, Totz J, Wang Z, Shi W. Photo-realistic single
preprint 2018. arXiv: 1802.03446. image super-resolution using a generative adversarial
Botta A, Dainotti A, Pescapé A. A tool for the generation of network. In: Proceedings of the IEEE conference on computer
realistic network workload for emerging networking vision and pattern recognition (CVPR). IEEE; 2017. p. 105–14.
scenarios. Comput Netw 2012;56(15):3531–47. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word
Brauckhoff D, Wagner A, May M. FLAME: a flow-level anomaly representations in vector space. 2013a. arXiv:1301.3781.
modeling engine. In: Proceedings of the workshop on cyber Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed
security experimentation and test (CSET). USENIX representations of words and phrases and their
Association; 2008. p. 1:1–1:6. compositionality. In: Proceedings of the advances in neural
Buczak AL, Guven E. A survey of data mining and machine information processing systems (NIPS); 2013b. p. 3111–19.
learning methods for cyber security intrusion detection. IEEE Molnár S, Megyesi P, Szabo G. How to validate traffic generators?.
Commun Surv Tutor 2016;18(2):1153–76. In: Proceedings of the IEEE international conference on
Buduma N, Locascio N. Fundamentals of deep learning: designing communications workshops (ICC). IEEE; 2013. p. 1340–4.
next-generation machine intelligence algorithms. O’Reilly Najafabadi MM, Khoshgoftaar TM, Kemp C, Seliya N, Zuech R.
Media; 2017. Machine learning for detecting brute force attacks at the
Cao VL, Nicolau M, McDermott J. A hybrid autoencoder and network level. In: Proceedings of the IEEE international
density estimation model for anomaly detection. In: conference on bioinformatics and bioengineering (BIBE). IEEE;
Proceedings of the international conference on parallel 2014. p. 379–85.
problem solving from nature. Springer; 2016. p. 717–26. Najafabadi MM, Khoshgoftaar TM, Napolitano A, Wheelus C.
Catania CA, Garino CG. Automatic network intrusion detection: RUDY attack: detection at the network level and its important
current techniques and open issues. Comput Electr Eng features. In: Proceedings of the international Florida artificial
2012;38(5):1062–72. intelligence research society conference; 2016. p. 288–93.
Claise B. Cisco systems netflow services export version 9. RFC Preuer K, Renz P, Unterthiner T, Hochreiter S, Klambauer G.
3954; 2004. Fréchet ChemblNet distance: a metric for generative models
Claise B. Specification of the IP flow information export (IPFIX) for molecules. CoRR 2018 abs/1803.09518.
protocol for the exchange of IP traffic flow information. RFC Radford A, Metz L, Chintala S. Unsupervised representation
5101; 2008. learning with deep convolutional generative adversarial
Feng Wc, Goel A, Bezzaz A, Feng WC, Walpole J. TCPivo: a networks. Proceedings of the international conference on
high-performance packet replay engine. In: Proceedings of the learning representations (ICLR), 2016.
ACM workshop on models, methods and tools for
172 computers & security 82 (2019) 156–172

Rigaki M, Garcia S. Bringing a GAN to a knife-fight: adapting Yin C, Zhu Y, Liu S, Fei J, Zhang H. An enhancing framework for
malware communication to avoid detection. Proceedings of botnet detection using generative adversarial networks. In:
the first deep learning and security workshop, San Francisco, Proceedings of the international conference on artificial
USA, 2016. intelligence and big data (ICAIBD); 2018. p. 228–34.
Ring M, Landes D, Dallmann A, Hotho A. IP2Vec: learning doi:10.1109/ICAIBD.2018.8396200.
similarities between IP adresses. In: Proceedings of the Yu L, Zhang W, Wang J, Yu Y. SeqGAN: sequence generative
workshop on data mining for cyber security (DMCS), adversarial nets with policy gradient. In: Proceedings of the
international conference on data mining. IEEE; 2017a. conference on artificial intelligence (AAAI). AAAI Press; 2017.
p. 657–66. p. 2852–8.
Ring M, Landes D, Hotho A. Detection of slow port scans in Yuan X, He P, Zhu Q, Bhat RR, Li X. Adversarial examples: attacks
flow-based network traffic. PLOS ONE 2018;13(9):1–18. and defenses for deep learning. arXiv preprint 2017. arXiv:
doi:10.1371/journal.pone.0204507. 1712.07107.
Ring M, Wunderlich S, Grüdl D, Landes D, Hotho A. Flow-based Zhang C, Ouyang X, Patras P. Zipnet-gan: inferring fine-grained
benchmark data sets for intrusion detection. In: Proceedings mobile traffic patterns via a generative adversarial neural
of the European conference on cyber warfare and security network. In: Proceedings of the thirteenth international
(ECCWS). ACPI; 2017b. p. 361–9. conference on emerging networking experiments and
Salakhutdinov R, Larochelle H. Efficient learning of deep technologies. ACM; 2017. p. 363–75.
Boltzmann machines. In: Proceedings of the international Zheng YJ, Zhou XH, Sheng WG, Xue Y, Chen SY. Generative
conference on artificial intelligence and statistics; 2010. adversarial network based telecom fraud detection at the
p. 693–700. receiving bank. Neural Netw. 2018;102:78–86.
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A,
Chen X. Improved techniques for training GANs. In: Markus Ring is a research associate at Coburg University of Ap-
Proceedings of the advances in neural information processing plied Sciences and Arts where he is working on his doctoral thesis.
systems (NIPS); 2016. p. 2234–42. He previously studied Informatics at Coburg and worked as a net-
Shiravi A, Shiravi H, Tavallaee M, Ghorbani AA. Toward developing work administrator at T-Systems Enterprise GmbH. His research
a systematic approach to generate benchmark datasets for interests include the generation of realistic flow-based network
intrusion detection. Comput Secur 2012;31(3):357–74. data and the application of data-mining methods for intrusion de-
Siska P, Stoecklin MP, Kind A, Braun T. A flow trace generator tection.
using graph-based traffic classification techniques. In:
Daniel Schlör is a Ph.D. student at the Department of Computer
Proceedings of the international wireless communications
Science at University of Würzburg, Germany. He is also a mem-
and mobile computing conference (IWCMC). ACM; 2010.
ber of the junior research group Computational Literary Stylistics
p. 457–62. doi:10.1145/1815396.1815503.
at the Department of Computational Philology. His research inter-
Sommer R, Paxson V. Outside the closed world: on using machine
ests include data and text mining, machine learning and general
learning for network intrusion detection. In: Proceedings of
applications of computer science methods in the field of digital
the IEEE symposium on security and privacy. IEEE; 2010.
humanities.
p. 305–16.
Sommers J, Barford P. Self-configuring network traffic generation.
Dieter Landes is a professor of software engineering and database
In: Proceedings of the ACM internet measurement conference
systems at Coburg University of Applied Sciences and Arts. He
(ACM IMC). ACM; 2004. p. 68–81.
holds a diploma in informatics from the University of Erlangen-
Sperotto A, Sadre R, de Boer PT, Pras A. Hidden Markov model
Nuremberg, and a doctorate in Knowledge-Based Systems from
modeling of SSH brute-force attacks. In: Proceedings of the
the University of Karlsruhe. After several years working in indus-
international workshop on distributed systems: operations
try – including time with Daimler Research – he joined Coburg
and management. Springer; 2009. p. 164–76.
in 1999. He has published 70 papers in journals, books, and at
Stevanovic M, Pedersen JM. An analysis of network traffic
conferences. His research interests include requirements engi-
classification for botnet detection. In: Proceedings of the IEEE
neering, software-engineering education, learning analytics, and
international conference on cyber situational awareness, data
data mining.
analytics and assessment (CyberSA). IEEE; 2015. p. 1–8.
Stiborek J, Rehák M, Pevnỳ T. Towards scalable network host Andreas Hotho is professor at the University of Wü rzburg. He
simulation. In: Proceedings of the international workshop on holds a Ph.D. from the University of Karlsruhe, where he worked
agents and cybersecurity; 2015. p. 27–35. from 1999 to 2004 at the Institute of Applied Informatics and For-
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking mal Description Methods (AIFB) in the areas of text, data, and
the inception architecture for computer vision. In: web mining, semantic web and information retrieval. From 2004
Proceedings of the IEEE conference on computer vision and to 2009 he was a senior researcher at the University of Kassel. He
pattern recognition; 2016. p. 2818–26. joined the L3S in 2011. Since 2005 he has been leading the de-
Tran QA, Jiang F, Hu J. A real-time netflow-based intrusion velopment of the social bookmark and publication sharing plat-
detection system with improved BBNN and high-frequency form BibSonomy. Andreas Hotho has published over 100 articles in
field programmable gate arrays. In: Proceedings of the journals and at conferences, co-edited several special issues and
international conference on trust, security and privacy in books, and co-chaired several workshops. He worked as a reviewer
computing and communications. IEEE; 2012. p. 201–8. for journals and was a member of many international conferences
Turner A. Tcpreplay. (Date last accessed 14-June-2018). and workshops program committees. His general research area is
Vasilomanolakis E, Cordero CG, Milanov N, Mühlhäuser M. on Data Science with focuses on the combination of data mining,
Towards the creation of synthetic, yet realistic, intrusion information retrieval and the semantic web. More specifically, he
detection datasets. In: Proceedings of the IEEE network is interested in the analysis of social media systems, especially
operations and management symposium (NOMS). IEEE; 2016. tagging, sensor data emerging trough ubiquitous and social ac-
p. 1209–14. tivities, security, and the application of Text Mining on historic
Wagner C, François J, Engel T, et al. Machine learning approach literature.
for IP-flow record anomaly detection. In: Proceedings of the
international conference on research in networking. Springer;
2011. p. 28–39.

You might also like