Pattern Search

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Electric Power Systems Research 130 (2016) 139–147

Contents lists available at ScienceDirect

Electric Power Systems Research


journal homepage: www.elsevier.com/locate/epsr

Pattern-based local linear regression models for short-term load


forecasting
Grzegorz Dudek ∗
Department of Electrical Engineering, Czestochowa University of Technology, Al. Armii Krajowej 17, 42-200 Czestochowa, Poland

a r t i c l e i n f o a b s t r a c t

Article history: In this paper univariate models for short-term load forecasting based on linear regression and patterns
Received 2 March 2015 of daily cycles of load time series are proposed. The patterns used as input and output variables simplify
Received in revised form 2 September 2015 the forecasting problem by filtering out the trend and seasonal variations of periods longer than the daily
Accepted 4 September 2015
one. The nonstationarity in mean and variance is also eliminated. The simplified relationship between
variables (patterns) is modeled locally in the neighborhood of the current input using linear regression.
Keywords:
The load forecast is constructed from the forecasted output pattern and the current values of variables
Linear regression
describing the load time series. The proposed stepwise and lasso regressions reduce the number of pre-
Partial least-squares regression
Patterns of seasonal cycles
dictors to a few. In the principal components regression and partial least-squares regression only one
Short-term load forecasting predictor is used. This allows us to visualize the data and regression function. The performances of the
Time series proposed methods were compared with that of other models based on ARIMA, exponential smoothing,
neural networks and Nadaraya–Watson estimator. Application examples confirm valuable properties of
the proposed approaches and their high accuracy.
© 2015 Elsevier B.V. All rights reserved.

1. Introduction Among the conventional STLF methods the most commonly


employed are: the Holt–Winters exponential smoothing (ES) and
Short-term load forecasting (STLF) is necessary for economic the autoregressive integrated moving average (ARIMA) models
generation of power and system security. It refers to forecasts of [1]. In ES the time series is decomposed into a trend component
system load from hours to several days ahead. The accurate load (expressed by level and growth terms) and seasonal components
forecasts lead to lower operating cost which contributes to sav- which can be combined additively or multiplicatively. ES allows
ings in electric utilities. The STLF accuracy is also important for the us to model nonlinear and heteroscedastic time series but the
deregulated electricity markets. The amount of energy which the exogenous variables cannot be introduced into the model. Other
utility has to buy or sell in the real time market at unfavorable important disadvantages of ES are overparameterization and a large
prices depends on the forecast error. Thus STLF is a very important number of starting values to estimate. In [2] to reduce the dimen-
problem for electric utilities, regional transmission organizations, sion of the model new ES formulation called parsimonious seasonal
energy suppliers and financial institutions. This is reflected in the ES was proposed. But there are still dozens or hundreds of terms to
literature by many forecasting methods that have been applied, initialize and update in the model. The recently developed expo-
including conventional methods and new computational intelli- nentially weighted methods in application to STLF are presented in
gence and machine learning methods. A large research activity in [3].
the field of STLF is related with the problem complexity: the load ARIMA processes are a very rich class of possible models and
time series is nonstationary in mean and variance, expresses trend, allows us to model multiple seasonal cycles. The stochastic nature
multiple seasonal variations (daily, weekly and annual) and ran- of load is often modeled with seasonal ARIMA models in practice.
dom noise. In addition, load is affected by many external factors A disadvantage of ARIMA models is their linear nature. The order
such as weather, time, demography, economy, electricity prices, selection process of ARIMA is usually considered subjective and
geographical conditions, consumer types and their habits. difficult to apply, which is a main obstacle in using these mod-
els. To simplify the forecasting problem the time series is often
decomposed. The components: trend, seasonal components and
∗ Corresponding author. Tel.: +48 343250896; fax: +48 343250803. irregular component, showing less complexity than the original
E-mail address: [email protected] series, are modeled independently (e.g. [4]). Another time series

http://dx.doi.org/10.1016/j.epsr.2015.09.001
0378-7796/© 2015 Elsevier B.V. All rights reserved.
140 G. Dudek / Electric Power Systems Research 130 (2016) 139–147

decomposition method using lifting scheme (the second generation models for STLF are introduced. In the last section the real load
wavelet transform) was described in [5]. data are used to provide examples of model building and forecast-
The most popular computational intelligence methods applied ing in practice. The results of the proposed methods are compared
in STLF are neural networks. They have many attractive features to results of other STLF methods: ARIMA, ES, multilayer perceptron
such as: universal approximation property, learning capabilities, and Nadaraya–Watson estimator.
massive parallelism, robustness in the presence of noise, and fault
tolerance. The drawbacks of neural network include: disruptive and 2. Patterns of the times series seasonal cycles
unstable training, difficulty in matching the network structure to
the problem complexity, weak capacity of extrapolation and many Data preprocessing based on patterns simplifies the forecasting
parameters to estimate (hundreds of weights). Some examples of time series with multiple seasonal cycles. In our case the patterns
using neural networks in STLF are: [6], where the complexity of of the daily cycles are introduced: the input patterns x and output
the network is controlled by the Bayesian approach, [7], where a ones y. The input pattern is a vector x = [x1 x2 . . . xn ]T ∈ X = Rn , rep-
new hybrid forecasting method composed of wavelet transform, resenting the vector of loads in successive timepoints of the daily
multilayer perceptron and evolutionary algorithm is proposed, period: L = [L1 L2 . . . Ln ]T , where n = 24 for hourly load time series,
[8], where a generic framework combining similar day selection, n = 48 for half-hourly load time series and n = 96 for quarter-hourly
wavelet decomposition, and multilayer perceptron is presented, load time series. The functions mapping the time series elements
and [9], where the neural network generates the prediction inter- L into patterns are dependent on the time series (trend, seasonal
vals. variations), the forecast period and horizon. They should maximize
Another branch of computational intelligence, fuzzy logic, the model quality. In this study the input pattern xi , representing
allows us to enter information by facts and rules formulated ver- the ith daily period, is defined as follows:
bally by experts and describing the behavior of complex systems by
Li,t − L̄i
using linguistic expressions. With the help of fuzzy rules the impre- xi,t =  2 , (1)
cise, incomplete and ambiguous information can be introduced into n 
l=1
Li,l − L̄i
the STLF models. When it is difficult to gain knowledge directly
from the experts, to generate a set of if-then rules the neuro-fuzzy where i = 1, 2, . . ., N is the daily period number, N is the number
approach is applied, which learns from examples. But the neuro- of days in the time series, t = 1, 2, . . ., n is the time series element
fuzzy system structure is complex and the number of parameters number in the period i, Li,t is the tth time series element (load) in
is usually large (it depends on the problem dimensionality and the period i, L̄i is the mean load value in the period i.
complexity), so the learning is difficult and does not guarantee con- According to definition (1), first we subtract the vector Li mean
vergence to the global minimum. Examples of STLF models based on from its components and then we divide the resulting vector by its
fuzzy logic are: [10], where the neuro-fuzzy system is used to adjust length. As a result we get the normalized vectors xi with length
the results of load forecasting obtained by radial basis function neu- 1, zero mean and the same variance. Note that the time series
ral network, [11], where two neuro-fuzzy networks are proposed: a which is nonstationary in mean and variance is represented now by
wavelet fuzzy neural network using the fuzzified wavelet features x-patterns having the same mean and variance. The trend and addi-
as the inputs, and fuzzy neural network employing the Choquet tional seasonal variations (weekly and annual ones in our case) are
integral as the outputs, [12], where an integrated approach which filtered. The x-patterns contain information only about the shapes
combines a self-organizing fuzzy neural network learning method of daily curves.
with a bilevel optimization method is described, and [13], where Whilst the x-patterns represent input variables (predictors), i.e.
the forecasting model combines fuzzy logic, wavelet transform and the loads for the day i, the y-patterns represent the output vari-
neural network. Another useful computational intelligence tools ables, i.e. the forecasted loads for the day i + , where  is a forecast
for STLF are: support vector machines (SVM) [14,15], ensembles of horizon in days. The components of the n-dimensional output pat-
models [16,17] and artificial immune systems [18] (description of tern yi = [yi,1 yi,2 . . . yi,n ]T ∈ Y = Rn , representing the load vector Li+
more STLF models you can find on the website http://gdudek.el.pcz. are defined as follows:
pl/publications).
L − L̄i
It is noteworthy that many of the STLF models developed in yi,t =  i+,t 2 , (2)
recent years are hybrid solutions. They combine data preprocess- n
l=1
Li,l − L̄i
ing methods (e.g. wavelet transform) with approximation methods
(such as neural and neuro-fuzzy networks or SVM) and optimiza- where i = 1, 2, . . ., N, t = 1, 2, . . ., n.
tion or learning methods (e.g. evolutionary and swarm algorithms). This is the similar equation to (1)  but in this case we do not
The disadvantages of the above mentioned complex forecast- use the mean load of the day i +  L̄i+ in the numerator and

ing models with many parameters are: hard and time-consuming n  2
training, problems with generalization, unclear structure and unin- l=1
Li+,l − L̄i+ in the denominator, because these values
terpretable parameters. Most often time series with multiple are not known
 in the moment of forecasting. We use known values
seasonal cycles and trend, expressing nonstationarity in mean and n  2
of L̄i and l=1
Li,l − L̄i instead. This is very important because
variance cannot be modeled directly and additional treatments
such as detrending, deseasonality or decomposition are needed. when the forecast of pattern yi is generated by the model we can
In contrast to the complex models commonly used in STLF in determine the forecast of vector Li+ using transformed Eq. (2):

this work simple methods of linear regression are proposed. The
 
n  2
number of parameters here is small and they can be estimated using Li+,t = yi,t Li,l − L̄i + L̄i , (3)
l=1
simple least squares approach. The key element of the proposed

methods is data preprocessing: defining patterns of seasonal cycles. where yi,t is the forecasted tth component of the pattern yi .
This simplifies the STLF problem eliminating nonstationarity, and Note that L̄i and the value of square root in (3) are known at the
 
filtering trend and seasonal cycles longer than the daily one. time of forecasting and can be used for decoding of yi,t to get Li+,t .

The paper is organized in a theoretical and an empirical part. n  2
In the beginning the patterns of daily cycles of load time series Note also that l=1
Li,l − L̄i is the carrier of the dispersion of
are defined. Then the main concepts of the linear regression the current daily cycle. Using this square root in the denominator of
G. Dudek / Electric Power Systems Research 130 (2016) 139–147 141

x 10
4 and annual cycles. The information about the position of the daily
(a)
2.2 period in the weekly and annual cycles, which is contained in L̄i ,

2 Winter we introduce to the forecast Li+,t in (3) by adding L̄i as well as
we introduce the information
 about current dispersion of the time
1.8

n  2
series multiplying yi,t by l=1
Li,l − L̄i . So when we forecast
1.6
L

the load time series using the pattern-based approach, first we filter
1.4
out the information about the position of the days i and i +  in the
1.2 weekly and annual cycles ((1) and (2)). Then we build the model on
Summer patterns and we generate the forecast of the y-pattern. Finally we
1
0 24 48 72 96 120 144 168 introduce the information about the position of the forecasted day
hours in the weekly and annual cycles using (3).
(b) 0.6 More functions defining patterns you can find in [19]. A fragment
of time series represented by x-patterns does not have to coincide
0.4 with the daily period (e.g. x-pattern can represent loads at hours 13
0.2 to 24 of the day i – 1 and hours 1 to 12 of the day i). It can include
several adjacent daily cycles (e.g. two days preceding the day of
0
forecast) or a part of one cycle (e.g. t = 1, 2, . . ., 12 h of the day i). It
x

-0.2 does not have to include the contiguous sequence of elements. We


can select elements to the input pattern, e.g. loads at hours 2, 5, 14
-0.4
and 22. We can also use the feature extraction methods to create
new pattern components from the original time series.
0 24 48 72 96 120 144 168
hours
3. Linear regression models for STLF
(c) 1.5
The relationship between x- and y-patterns can be nonlinear.
1 In our approach this function is approximated locally in the neigh-
borhood around the current input pattern for which we want to get
0.5 the forecast (we call this pattern a query pattern x*). By the neigh-
y

borhood of x* in the simplest case we mean the set of its k nearest


0
neighbors defined as the k x-patterns from the history which are
closest to x* in terms of Euclidean distance and which represent the
-0.5
same day type ı as x*. In the neighborhood of x* the target function
0 24 48 72 96 120 144 168 mapping x-patterns to y-patterns is less complex than in the entire
hours range of x variation. It is assumed that for small k this function can
be approximated locally using linear regression (in the experimen-
Fig. 1. Two fragments of the load time series (a) and their x-patterns (b) and y- tal part of the work it is assumed k = 12). To simplify the regression
patterns (c). model the problem of approximating the vector-valued function
g: X → Y is decomposed into a set of problems of approximating
(1) we unify the dispersion of x-patterns. Using L̄i in the numerator the scalar-valued functions gt : X → Yt , t = 1, 2, . . ., n. Now instead of
of (1) we unify the level of patterns. So the x-patterns are filtered multivariate linear regression model the multiple linear regression
versions of the daily curves. They carry information about the shape models can be used, one for each component of y.
of the daily curves. This is shown in Fig. 1(a) and (b). It can be seen The idea of the proposed pattern-based linear regression in Fig. 2
from these figures that the daily cycles of different days of the week is presented and summarized in the following steps:
ı ∈ {Monday, . . ., Sunday} and from different periods of the year are
represented by x-patterns having the same mean and variance. So 1. Mapping the original time series elements to patterns x and y
we can simply compare the shapes of different days. using (1) and (2).
In the case of y-patterns using transformation (2) we unify the 2. Selection of the k nearest neighbors of the query pattern x* and
patterns for each type of the day of the week ı separately. The pat- creation of the training set ˚ = {(xi , yi,t )}, where xi are the nearest
terns of different days can be incomparable. This is because in (2) neighbors of x* representing the same day of the week as x*.
we use the mean and dispersion of the ith day to encode Li+ . So the 3. Construction of the linear regression model M mapping X → Yt
y-patterns of Mondays for  = 1 are located higher than y-patterns based on ˚.

of other days because we use L̄i of Sundays in (2), which are usually 4. Determination of the forecasted yt value for x* using M.
 
lower than L̄i+ of Mondays. For analogous reasons y-patterns of 5. Decoding yt to get the forecast Lt using (3).
Saturdays and Sundays are located at a lower level than y-patterns
of weekdays. This is shown in Fig. 1(c). In step 1 the load time series is preprocessed. The input and
Because the position (level) of y-pattern depends on the day output patterns are determined for the successive daily periods, up
type ı, the forecasting models are built for the particular day type to the current period. Pattern xi represents ith daily period from
using training set ˚ containing patterns corresponding to this day the history, and yi paired with it represents (i + )th daily period.
type. For example when we build the model for Monday and for The query pattern x* represents the current daily period. When the
 = 1 (next day forecast) we learn it on the training set containing day type of x* is ı and we forecast the load at hour t for the next
x-patterns of Sundays and y-patterns of corresponding Mondays. If day, we find k patterns xi most similar to pattern x* (step 2). We
 = 2 (two days ahead forecast) we use x-patterns of Saturdays and include these x-patterns and tth components of paired with them
y-patterns of Mondays in the training set. This approach and the y-patterns into the training set ˚. Then, in step 3 the linear model
unification of input and output variables using patterns simplify the is built. The estimation of the model parameters and model opti-
forecasting model in which we do not need to implement weekly mization procedures (described in Sections 3.1–3.5) are performed
142 G. Dudek / Electric Power Systems Research 130 (2016) 139–147

25
L,
Original load time series
GW
20
... L
15
...
10 Li 2
1 Similar
y mapping x mapping paern
selection

0.5 Decoding
x
0.0 ... ...
x* 5
-0.5 xi
3
Linear
regression
model
1.0
y 4
0.5
0.0 ...
...
-0.5 yi y

Fig. 2. The flowchart of pattern-based linear regression for STLF.

on the training set ˚. The response of the model to the query pat-

tern x* is the forecasted value of the y-pattern tth component yt
0.8
(step 4). In step 5 to get the forecasted load at hour t for the next

day we decode yt according to (3). We use for this current, known 0.6
n  2
ACF

parameters of the time series: L̄i and l=1


Li,l − L̄i . 0.4
The proposed linear models do not include exogenous variables
such as weather factors or price of electricity. To take them into 0.2
account additional model can be built that corrects load forecast 0
generated by the base linear model depending on the exogenous
variables. This is the subject of future work. An example of such -0.2
approach is presented in [10]. Atypical days such as public holidays 0 48 96 144 192 240 288 336
are not handled for the proposed models. This is because there is Lag
often no information about atypical daily curve of the forecasted
day in the previous day which is represented by the x-pattern. So Fig. 3. The autocorrelation function of the load time series of the Polish power
system.
the proposed forecasting models cannot predict atypical y-pattern
using this x-pattern as input.
that some of them are linear combination of others (multicollinear-
3.1. Multiple linear regression ity). Building model on collinear predictors leads to imprecise
estimate of coefficients and missing importance of predictors. If
The multiple linear regression (MLR [20]) model for STLF is of predictors carry similar information about the response variable,
the form: some of them can be ignored. To select the most informative pre-
dictors the stepwise and lasso regression procedures described in
y = ˇ0 + ˇ1 x1 + · · · + ˇn xn + ε, (4) Sections 3.2 and 3.3 are used, respectively. Another cure for collin-
earity is the ridge regression which reduces the absolute value of
where ˇ0 , ˇ1 , . . ., ˇn are coefficients and ε is a random disturbance
coefficients. As a result their estimates have lower variance which
or error.
may lead to better prediction. However this does not lead to dimen-
The coefficients are estimated using least-squares fit. Notice that
sionality reduction. Yet another way to deal with collinearity and
in the local approach the number of points used to build a model
excessive dimensionality is the creation of new predictors combin-
(k) can be less than their dimensionality and the number of free
ing the original ones. Two ways of extraction of predictors: principal
parameters of the model. In such a case the model is oversized: it
component regression and partial least-squares regression are pre-
has too many degrees of freedom in relation to the problem com-
sented in Sections 3.4 and 3.5, respectively. New predictors are
plexity expressed by only a few training points. In m-dimensional
uncorrelated and to explain the variability of the response less
space (in model (4) m = n + 1), we need at least m points to define
predictors is needed.
a hyperplane. When m > k we get an infinite number of solutions
of regression model (4), i.e. the least squares coefficients ˇj are not
uniquely defined. 3.2. Stepwise regression
It is worth notice also that the components of x-patterns repre-
senting subsequent elements of time series are usually strongly Stepwise regression [21] is a method for selection of the best
correlated (see Fig. 3). Correlations between predictors indicate subset of predictors to the linear model by adding and removing
G. Dudek / Electric Power Systems Research 130 (2016) 139–147 143

predictors in a systematic way. The linear model in this case is of 3.4. Principal components regression
the form:
Principal component regression (PCR) [20,22] produces new
y = ˇ0 + w1 ˇ1 x1 + · · · + wn ˇn xn + ε, (5) predictors (principal components) which are linear combinations
where w1 , . . ., wn are binary weights. of the original ones and are linearly uncorrelated. The first principal
The goal is to find the binary weight vector w = [w1 w2 . . . wn ]T component has the largest sample variance. Subsequent principal
and the coefficient values ˇj for predictors with nonzero weights. components have the highest variances possible under the con-
The criterion of adding or removing the predictor is based on the straint that they are orthogonal to the preceding components. The
p-value for an F test of the change in sum of squared error. When principal components are used in place of the original predictors in
we analyze the introduction of the tth predictor to the model, the the regression model:
null hypothesis is that this predictor would have a zero weight if y = ˇ0 + ˇ1 z1 + · · · + ˇc zc + ε, (8)
added to the model. If the null hypothesis is rejected, the predictor
is added to the model (wt = 1). When we analyze removing of the tth where zj is the jth principal component, c ≤ n is the numbers of
predictor from the model, the null hypothesis is that this predictor components included into the model.
has a zero weight. If there is insufficient evidence to reject the null There is no need to use all principal components in the model
hypothesis, the predictor is removed from the model (wt = 0). but only the first few ones (c) because usually they explain most
The selection procedure starts with empty set of predictors in of the variability in the response variable. So the components with
the model. At each step it adds the predictor with the smallest the lowest variance can be discarded.
p-value until there is no predictors having p-value less than an
entrance tolerance (0.05 was used). Then the elimination proce- 3.5. Partial least-squares regression
dure is activated which removes from the model the predictor with
the largest p-value, if it is greater than an exit tolerance (0.1). The Partial least-squares regression (PLSR) has some relationship to
selection and elimination procedures are repeated alternately until PCR. It also constructs new predictors by linear combination of orig-
there is no predictor for removal. inal ones, but unlike PCR it uses also the response variable to do so.
Some other criteria can be used to add and remove predictors The new predictors called latent variables are the best orthogonal
like [20]: R-squared, Mallows Cp , Akaike information criterion or linear combinations of xt for predicting y (they explain best the
Bayes information criterion. The solutions generated by the step- response). PLSR searches for such orthogonal directions to project
wise regression are suboptimal and there is no guarantee that a x-points that have the highest variance and highest correlation with
different initial model or different sequence of steps will not lead the response. The number of predictors used in the final model is
to a better model. a parameter of PLSR like in PCR. These both methods are useful
when there is more predictors than observations and when there is
multicollinearity among predictors. The algorithms of partial least-
3.3. Regularized least-squares regressions
squares and connections between PLSR and PCR can be found in
[23].
In the regularized least-squares regression the minimized cri-
terion is composed of the usual regression criterion and a penalty
4. Simulation examples
term dependent on the coefficient values. Thus the coefficients are
shrunk toward zero. This can greatly reduce the variance, result-
In the first example the proposed linear models are examined
ing in a better mean-squared error. The ridge regression estimates
in the tasks of load forecasting of the Polish power system for the
coefficients by minimizing the criterion containing sum of squared
next day ( = 1). The hourly load time series is from the period
coefficients as a penalty term [22]:
2002–2004 (see Fig. 4; these data can be downloaded from the
⎛ ⎞
2 website http://gdudek.el.pcz.pl/varia/stlf-data). The test samples


N 
n 
n
ˇ = argmin ⎝ yi − ˇ0 − xi,t ˇt + ˇt2 ⎠ , (6) are from January 2004 (without atypical 1 January; atypical days
ˇ such public holidays are not handled for the proposed methods)
i=1 t=1 t=1
and July 2004, i.e. we forecast loads in the successive days of January
where  ≥ 0 is a parameter that controls the amount of shrinkage. and July. Models are constructed using 12 nearest neighbors of
The large value of  leads to more shrinkage. We get different the query point x* from the history, i.e. they are selected from the
coefficient estimates for different values of . But the coefficient period from 1 January 2004 until the day before the day of the fore-
values are never set to zero exactly, and therefore cannot perform cast. The Euclidean distance is used to select the nearest x-patterns.
predictor selection in the linear model. For each hour of the day of forecast a separate model is built. So
The alternative way of regularization is lasso (least absolute for our test samples (30 + 31)·24 = 1464 models were constructed.
selection and shrinkage operator). This is a shrinkage method like Because m > k (m = 25, k = 12) before using model (4) ten predictors
ridge, with subtle but important difference: the penalty term in were selected using stepwise regression (only with selection pro-
lasso is a sum of absolute values of coefficients. The lasso estimate cedure and without elimination procedure). The algorithm starts
is defined as [22]: with empty set of predictors and it adds at each step the predictor
⎛ 2
⎞ with the smallest p-value until the number of predictors reaches 10.
1  
N n n The similar approach for reducing the initial number of predictors

ˇ = arg min ⎝ yi − ˇ0 − xi,t ˇt + |ˇt |⎠ . (7) in the ridge and lasso regressions was used.
ˇ 2
i=1 t=1 t=1 The frequencies of predictors selected in the stepwise and lasso
regressions and the predictor number frequencies in Fig. 5 are
The nature of penalty term in lasso causes some coefficients to shown. Among all 24 predictors included in x-patterns the most
be shrunken to zero exactly, thus the lasso is able to perform pre- often selected predictor represents the load at hour 24. It means
dictor selection in the linear model. As  increases more and more that this predictor, which is the nearest in time to the forecasted
coefficients are set to zero (see Fig. 6, right). In the experimental variable among all predictors, carries much information about this
part of the work the value of  in the ridge and lasso regressions variable. Fig. 3 suggests that good candidate as a predictor could be
was tuned in the leave-one-out procedure. that one with lag 168 having high value of autocorrelation function.
144 G. Dudek / Electric Power Systems Research 130 (2016) 139–147

(a) Table 1
24 Forecast errors and their interquartile ranges in the first example.
22 Linear model January July Average
20
MAPEtst IQRtst MAPEtst IQRtst MAPEtst IQRtst
18
L, GW

MLR 2.37 2.44 2.63 2.42 2.50 2.45


16 Stepwise 1.52 1.44 1.14 1.20 1.33 1.28
14 Ridge 1.59 1.50 1.23 1.23 1.41 1.29
Lasso 1.51 1.39 1.06 1.02 1.28 1.18
12 PCR 1.36 1.21 0.94 0.99 1.15 1.09
10 PLSR 1.18 1.29 1.00 1.03 1.09 1.14
2003 2004
Year
(b)
Winter only one, but it is worth remembering that this new component
22
Spring compresses information extracted from all original predictors. The
20 Summer local regressions using PCR and PLSR for one of the forecasting task
Autumn in Fig. 7 are shown. In this case MAPE (Mean Absolute Percentage
18
Error) for PCR was 1.15 and for PLSR was 1.49.
L, GW

16 In Table 1 the forecast errors (MAPE = 100·mean(|(forecast −


14
actual value)/actual value|)) for the test samples are presented.
MAPE is traditionally used as an error measure in STLF. As a mea-
12 sure of error dispersion interquartile ranges (IQR) were used. As
Mon Tue Wed Thu Fri Sat Sun we can see from this table the lowest errors were achieved by PLSR
10
0 24 48 72 96 120 144 168 and PCR. It is noteworthy the for MLR errors are about twice larger
Hour
than for other models. The density functions of percentage errors
Fig. 4. The hourly electricity demand in Poland in three-year (a) and one-week (b)
(PE = 100·(forecast − actual value)/actual value) are presented in
intervals. Fig. 8. These functions for PLSR and PCR are very similar as well
as for stepwise and lasso regressions.
In the second example best linear models, PCR and PLSR, were
But in the proposed approach x-patterns were defined on the basis
examined in the STLF problems on four time series:
of the last daily period from the history.
The final models of stepwise regression most often were based
on only one predictor, and in many cases they included only an • PL: time series of the hourly load of the Polish power system
intercept. In the case of lasso in over 30% of cases the final model from the period of 2002–2004 (this time series was used in the
included only an intercept. In such a case for some query pattern first example). The test sample includes data from 2004 with the
 exception of 13 atypical days (e.g. public holidays),
x* we get the forecast as y = ˇ0 . It means that the y-values of the
nearest neighbors of x* are similar to each other, i.e. the approxi- • FR: time series of the half-hourly load of the French power system
mating hyperplane is parallel to all x-axis of the coordinate plane. from the period of 2007–2009. The test sample includes data from
Remember that this liner model is valid only for this query point. 2009 except for 21 atypical days,
For another query point we determine another set of neighbors and • GB: time series of the half-hourly load of the British power system
the hyperplane changes. from the period of 2007–2009. The test sample includes data from
In Fig. 6, the ridge and lasso traces (simultaneous graph of the 2009 except for 18 atypical days,
regression coefficients plotted against parameter ) for one of the • VC: time series of the half-hourly load of the power system of
forecasting task are shown. Lasso selects three predictors in this Victoria, Australia, from the period of 2006–2008. The test sample
case: x7 , x10 and x18 at the optimal value of  = 0.0092. includes data from 2008 except for 12 atypical days.
A small number of predictors selected in stepwise and lasso
regressions confirms the assumption that the same information Our models were compared with other popular models of
about the response variable is repeated in many predictors. Thus STLF: ARIMA, exponential smoothing (ES) and multilayer percep-
there is no sense to generate many of orthogonal components in tron (MLP) as well as with the nonparametric regression model:
PCR and PLSR models. The preliminary tests performed on the dif- Nadaraya–Watson estimator (N–WE).
ferent load time series were shown that although the training error In ARIMA and ES the time series were decomposed into n series,
decreases with the number of principal components, the test error i.e. for each t a separate series was created. This eliminates the daily
increases, in general. So the number of components was limited to seasonality and simplifies the forecasting problem. The ARIMA and

0.3
stepwise regression 0.4
lasso
Frequency

0.3
Frequency

0.2

0.2
0.1
0.1

0 0
1 6 12 18 24 0 1 2 3 4 5 6 7 8 9 10
Hour Number of predictors

Fig. 5. The frequencies of predictors (left) and the frequencies of the predictor numbers (right) in stepwise and lasso regressions.
G. Dudek / Electric Power Systems Research 130 (2016) 139–147 145

1.5
1 2

0.5
1

β
0

β
-0.5 0
-1
-1.5 -1
0 50 100 0 0.005 0.01 0.015 0.02
λ λ

Fig. 6. The ridge (left) and lasso (right) traces for the forecasting task of July 1, 2004, hour 12.

0.3 0.3

0.26 0.26
y

y
0.22 0.22

0.18 0.18
-4 -2 0 2 4 6 -0.4 -0.2 0 0.2 0.4
z1 z1

Fig. 7. The local regressions using PCR (left) and PLSR (right) for the forecasting task of July 1, 2004, hour 12 ( o–query point).

PLSR PCR linear models. For each forecasting task a separate MLP is learned.
0.3 To prevent overfitting MLP is learned using Levenberg–Marquardt
Ridge Stepwise
algorithm with Bayesian regularization [27]. Since the target func-
Lasso
0.2 tion is approximated locally using a small number of learning
f(PE)

MLR points, rather a simple form of this function should be expected.


0.1 This implies small number of neurons. Based on the research
reported in [26] the network composed of only one neuron with
bipolar sigmoid (or hyperbolic tangent) activation function was
0
-6 -4 -2 0 2 4 6 chosen as an optimal architecture.
PE The pattern-based STLF model using Nadaraya–Watson esti-
mator was proposed in [28]. This is a representative of
Fig. 8. The probability density functions of percentage errors. pattern-similarity based forecasting models [29]. The model
parameters: smoothing parameters or bandwidts h1 , h2 , . . ., hn are
ES parameters were estimated for each forecasting task (forecast of estimated in the grid search procedure using the same training
system load at time t of day i) using 12-week time series fragments sample as in linear models and MLP. In the grid search the starting
immediately preceding the forecasted day. Atypical days in these point is determined using the Scott’s rule and then the neighbor-
fragments were replaced with the days from the previous weeks. hood of this point is searched in the iteration process (see [28]
Due to using short time series fragments for parameter estima- for details). To avoid overfitting the model was optimized using
tion (much shorter than the annual period) and due to time series leave-one-out cross-validation.
decomposition into n series we do not have to take into account In Table 2, errors for one day ahead load forecasting are pre-
the annual and daily seasonality in the models. In such a case the sented. The errors generated by the naïve model of the form: the
number of the parameters is much smaller and they are easier to forecasted daily curve is the same as seven days ago, are also shown
estimate compared to models with triple seasonality. in this table. The best results are marked with an asterisk and the
For each forecasting task the seasonal ARIMA(p, d, q) × (P, D, second best results are marked with a double asterisk (best results
Q)v model was created (where the period of the seasonal pat- were confirmed by Wilcoxon rank sum test with 5% significance
tern appearing v = 7, i.e. one week period) as well as the ES state
space model. ES models are classified into 30 types [24] depending
Table 2
on how the seasonal, trend and error components are taken into Forecast errors and their interquartile ranges (MAPEtst /IQRtst ) in the second
account (they can be expressed additively or multiplicatively, and example.
the trend can be damped or not). To estimate parameters of ARIMA
Model PL FR GB VC
and ES stepwise procedures for traversing the model spaces imple-
mented in the forecast package for the R environment for statistical PCR 1.35/1.33** 1.71/1.78 1.60/1.68** 3.00/2.70
PLSR 1.34/1.32** 1.57/1.61* 1.54/1.61* 2.83/2.60**
computing [25] were used. These automatic procedures return the ARIMA 1.82/1.71 2.32/2.53 2.02/2.07 3.67/3.42
optimal models with the lowest Akaike information criterion value. ES 1.66/1.57 2.10/2.29 1.85/1.84 3.52/3.35
The MLP model is learned locally [26] using training patterns MLP 1.44/1.41 1.64/1.70** 1.65/1.70** 2.92/2.69
selected from the neighborhood of the query pattern. These are N–WE 1.30/1.30* 1.66/1.67 1.55/1.63* 2.82/2.56*
Naïve 3.43/3.42 5.05/5.96 3.52/3.82 4.88/4.55
the same 12 patterns that are used to construct the proposed
146 G. Dudek / Electric Power Systems Research 130 (2016) 139–147

PL FR
3.5 6

3 5

2.5

MAPEtst
MAPEtst 4

2
3

1.5
2
1
1 2 3 4 5 6 7 1 2 3 4 5 6 7
τ τ
GB VC
4 5.5

5
3.5

4.5
3
MAPEtst

MAPEtst
4
2.5
3.5

2
3

1.5 2.5
1 2 3 4 5 6 7 1 2 3 4 5 6 7
τ τ
PCR PLSR ARIMA ES MLP N-WE

Fig. 9. Errors for different forecast horizons.

level). As we can see from this table PLSR takes the first place among PLSR
tested models for FR and GB data and second place for PL and VC PCR
data. Note that N–WE generates the best results for three datasets, MLP
but the difference in errors between this model and PLSR is small N-WE
except FR data, where PLSR is better. The conventional forecasting Stepwise
models: ARIMA and ES work significantly worse than pattern-based ARIMA
models. To see how PCR and PLSR work on more recent data they MLR
were tested on time series of hourly load of the Polish power sys- Ridge
tem from the period of 2012–2014 (test sample includes data from Lasso
2014 with the exception of 14 atypical days). The results for PLSR
ES
did not differ from those for PL data: MAPEtst = 1.34. For PCR results
were a little worse: MAPEtst = 1.44. 0 50 100 150 200
In Fig. 9, the errors for forecast horizons up to 7 days are com- Time in seconds
pared. For longer horizons the linear regression models generated
Fig. 10. The total time of building the forecasting models for 24 h of the next day
good results compared to the reference models. For VC data and for PL data.
horizons more than two days the conventional models ARIMA and
ES outperformed other models.
In Fig. 10, the time efficiency of the forecasting models is com- 5. Conclusions
pared. The times presented in this figure include time of the model
optimization and forecasting for all hours of the next day for PL data The major contribution of this work is to propose new sim-
(computing environment: Intel Core 2 Quad CPU Q9550 2.83 GHz, ple univariate linear regression models based on patterns of daily
4 GB RAM, Matlab R2012b). In the MLR, ridge and lasso regres- cycles for STLF. Patterns allows the forecasting problem to be sim-
sions, where stepwise regression is used first to reduce the initial plified by filtering out the trend, annual and weekly cycles. The
number of predictors, the algorithm spends most of the time in relationship between input and output patterns is approximated
stepwise phase (about 62 s). The most time efficient models are locally in the neighborhood of the query pattern using linear regres-
PLSR and PCR. When using these models the process of model build- sion. Thus we resign from the global modeling of the target function
ing and forecasting for 24 h of the next day takes less than half of a in the entire range creating the locally competent model for the
second. region around the query point. Since the local complexity is lower
G. Dudek / Electric Power Systems Research 130 (2016) 139–147 147

than the global one, we can use a simple model. This model brings [8] Y. Chen, P.B. Luh, C. Guan, Y. Zhao, L.D. Michel, M.A. Coolbeth, P.B. Fried-
good results for the current query point, but we have to construct land, S.J. Rourke, Short-term load forecasting: similar day-based wavelet neural
networks, IEEE Trans. Power Syst. 25 (1) (2010) 322–330.
new models for other query points. [9] Hao Quan, D. Srinivasan, A. Khosravi, Short-term load and wind power fore-
The similar approach based on patterns and local modeling casting using neural network-based prediction intervals, IEEE Trans. Neural
was used earlier in other STLF models: MLP and N–WE. Although Netw. Learn. Syst. 25 (2) (2014) 303–315, http://dx.doi.org/10.1109/TNNLS.
2013.2276053.
these models are nonlinear, the proposed linear models have bet- [10] Z. Yun, Z. Quan, S. Caixin, L. Shaolan, L. Yuming, S. Yang, RBF neural network
ter extrapolation property. Because the linear models are not as and ANFIS-based short-term load forecasting approach in real-time price envi-
flexible as neural networks and nonparametric regression models ronment, IEEE Trans. Power Syst. 23 (3) (2008) 853–858.
[11] M. Hanmandlu, B.K. Chauhan, Load forecasting using hybrid models, IEEE Trans.
there is no problem with overfitting. The cumbersome and time-
Power Syst. 26 (1) (2011) 20–29.
consuming procedures to prevent overfitting are unnecessary. In [12] H. Mao, X.-J. Zeng, G. Leng, Y.-J. Zhai, J.A. Keane, Short-term and midterm load
the application examples the STLF methods based on patterns and forecasting using a bilevel optimization model, IEEE Trans. Power Syst. 24 (2)
(2009) 1080–1090.
local modeling outperform conventional models: ARIMA and expo-
[13] D.K. Chaturvedi, A.P. Sinha, O.P. Malik, Short term load forecast using fuzzy logic
nential smoothing especially for shorter horizons. and wavelet transform integrated generalized neural network, Int. J. Electr.
Using principal component regression or partial least-squares Power Energy Syst. 67 (2015) 230–237.
regression the number of predictors can be reduced to only one [14] Q. Wu, Power load forecasts based on hybrid PSO with Gaussian and adaptive
mutation and Wv-SVM, Expert Syst. Appl. 37 (2010) 194–201.
which allows us to visualize the regression function. In this case [15] E. Ceperic, V. Ceperic, A. Baric, A Strategy for short-term load forecasting by
the models have only two parameters simply estimated using least- support vector regression machines, IEEE Trans. Power Syst. 28 (4) (2013)
squared approach. This is a great advantage in comparison to the 4356–4364.
[16] M. De Felice, Short-term load forecasting with neural network ensembles: a
complex STLF models based on ARIMA, exponential smoothing, comparative study, IEEE Comput. Intell. Mag. 6 (3) (2011) 47–56.
neural and neuro-fuzzy networks or SVM, where there are dozens [17] Rui Zhang, Zhao Yang Dong, Yan Xu, Ke Meng, Kit Po Wong, Short-term
or hundreds of parameters and their estimation requires advanced load forecasting of Australian National Electricity Market by an ensemble
model of extreme learning machine, IET Gener. Transm. Distrib. 7 (4) (2013)
optimization methods. 391–397.
[18] G. Dudek, Artificial immune system for forecasting time series with multiple
Acknowledgments seasonal cycles, Trans. Comput. Collec. Intell. XI LNCS 8065 (2013) 176–197.
[19] G. Dudek, Pattern similarity-based methods for short-term load
forecasting—Part 1: Principles, Appl. Soft Comput. 37 (2015) 277–287,
I am very grateful to James W. Taylor with the Saïd Business http://dx.doi.org/10.1016/j.asoc.2015.08.040, ISSN 1568-4946.
School, University of Oxford for supplying British and French data, [20] S. Chatterjee, A.S. Hali, Regression Analysis by Example, John Wiley & Sons, Inc.,
Hoboken, New Jersey, 2006.
and Shu Fan and Rob J. Hyndman with the Business and Economic
[21] N. Draper, H. Smith, Applied Regression Analysis, Wiley, New York, 1981.
Forecasting Unit, Monash University for supplying Australian data. [22] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning. Data
Mining, Inference, and Prediction, Springer-Verlag, New York, 2009.
[23] R. Rosipal, N. Kramer, Overview and recent advances in partial least squares
References
Subspace, Latent Structure and Feature Selection LNCS, 3940, Springer Berlin
Heidelberg, 2006, pp. 34–51.
[1] R. Weron, Modeling and Forecasting Electricity Loads and Prices, Wiley, Chich- [24] R.J. Hyndman, A.B. Koehler, J.K. Ord, R.D. Snyder, Forecasting with Exponential
ester, 2006. Smoothing: The State Space Approach, Springer, - Springer series in statistics,
[2] J.W. Taylor, R.D. Snyder, Forecasting intraday data with multiple seasonal cycles Berlin/London, 2008.
using parsimonious seasonal exponential smoothing, Omega 40 (6) (2012) [25] R.J. Hyndman, Y. Khandakar, Automatic time series forecasting: the forecast
748–757. package for R, J. Stat. Softw. 27 (3) (2008) 1–22.
[3] J.W. Taylor, Short-term load forecasting with exponentially weighted methods, [26] G. Dudek, Forecasting time series with multiple seasonal cycles using neural
IEEE Trans. Power Syst. 27 (1) (2012) 458–464. networks with local learning Artificial Intelligence and Soft Computing, ICAISC
[4] J. Nowicka-Zagrajek, R. Weron, Modeling electricity loads in California: ARMA 2013, LNCS, vol. 7894, Springer Berlin Heidelberg, 2010, pp. 52–63.
models with hyperbolic noise, Signal Process. 82 (2002) 1903–1915. [27] F.D. Foresee, M.T. Hagan, Gauss-Newton approximation to Bayesian reg-
[5] C.-M. Lee, C.-N. Ko, Short-term load forecasting using lifting scheme and ARIMA ularization, in: Proc. Inter. Joint Conference on Neural Networks, 1997,
models, Expert Syst. Appl. 38 (2011) 5902–5911. pp. 1930–1935.
[6] H.S. Hippert, J.W. Taylor, An evaluation of Bayesian techniques for controlling [28] G. Dudek, Short-term load forecasting based on kernel conditional density esti-
model complexity and selecting inputs in a neural network for short-term load mation, Przegl. Elektrotech. 86 (8) (2010) 164–167.
forecasting, Neural Netw 23 (2010) 386–395. [29] G. Dudek, Pattern-Similarity Machine Learning Models for Short-Term Load
[7] N. Amjady, F. Keynia, Short-term load forecasting of power systems by com- Forecasting, Academic Publishing House “Exit”, Warsaw, 2012 (in Polish).
bination of wavelet transform and neuro-evolutionary algorithm, Energy 34
(2009) 46–57.

You might also like