a r t i c l e i n f o a b s t r a c t
Article history: In this paper univariate models for short-term load forecasting based on linear regression and patterns
Received 2 March 2015 of daily cycles of load time series are proposed. The patterns used as input and output variables simplify
Received in revised form 2 September 2015 the forecasting problem by filtering out the trend and seasonal variations of periods longer than the daily
Accepted 4 September 2015
one. The nonstationarity in mean and variance is also eliminated. The simplified relationship between
variables (patterns) is modeled locally in the neighborhood of the current input using linear regression.
The load forecast is constructed from the forecasted output pattern and the current values of variables
Linear regression
describing the load time series. The proposed stepwise and lasso regressions reduce the number of pre-
Partial least-squares regression
Patterns of seasonal cycles
dictors to a few. In the principal components regression and partial least-squares regression only one
Short-term load forecasting predictor is used. This allows us to visualize the data and regression function. The performances of the
Time series proposed methods were compared with that of other models based on ARIMA, exponential smoothing,
neural networks and Nadaraya–Watson estimator. Application examples confirm valuable properties of
the proposed approaches and their high accuracy.
© 2015 Elsevier B.V. All rights reserved.
decomposition method using lifting scheme (the second generation models for STLF are introduced. In the last section the real load
wavelet transform) was described in [5]. data are used to provide examples of model building and forecast-
The most popular computational intelligence methods applied ing in practice. The results of the proposed methods are compared
in STLF are neural networks. They have many attractive features to results of other STLF methods: ARIMA, ES, multilayer perceptron
such as: universal approximation property, learning capabilities, and Nadaraya–Watson estimator.
massive parallelism, robustness in the presence of noise, and fault
tolerance. The drawbacks of neural network include: disruptive and 2. Patterns of the times series seasonal cycles
unstable training, difficulty in matching the network structure to
the problem complexity, weak capacity of extrapolation and many Data preprocessing based on patterns simplifies the forecasting
parameters to estimate (hundreds of weights). Some examples of time series with multiple seasonal cycles. In our case the patterns
using neural networks in STLF are: [6], where the complexity of of the daily cycles are introduced: the input patterns x and output
the network is controlled by the Bayesian approach, [7], where a ones y. The input pattern is a vector x = [x1 x2 . . . xn ]T ∈ X = Rn , rep-
new hybrid forecasting method composed of wavelet transform, resenting the vector of loads in successive timepoints of the daily
multilayer perceptron and evolutionary algorithm is proposed, period: L = [L1 L2 . . . Ln ]T , where n = 24 for hourly load time series,
[8], where a generic framework combining similar day selection, n = 48 for half-hourly load time series and n = 96 for quarter-hourly
wavelet decomposition, and multilayer perceptron is presented, load time series. The functions mapping the time series elements
and [9], where the neural network generates the prediction inter- L into patterns are dependent on the time series (trend, seasonal
vals. variations), the forecast period and horizon. They should maximize
Another branch of computational intelligence, fuzzy logic, the model quality. In this study the input pattern xi , representing
allows us to enter information by facts and rules formulated ver- the ith daily period, is defined as follows:
bally by experts and describing the behavior of complex systems by
Li,t − L̄i
using linguistic expressions. With the help of fuzzy rules the impre- xi,t = 2 , (1)
cise, incomplete and ambiguous information can be introduced into n
Li,l − L̄i
the STLF models. When it is difficult to gain knowledge directly
from the experts, to generate a set of if-then rules the neuro-fuzzy where i = 1, 2, . . ., N is the daily period number, N is the number
approach is applied, which learns from examples. But the neuro- of days in the time series, t = 1, 2, . . ., n is the time series element
fuzzy system structure is complex and the number of parameters number in the period i, Li,t is the tth time series element (load) in
is usually large (it depends on the problem dimensionality and the period i, L̄i is the mean load value in the period i.
complexity), so the learning is difficult and does not guarantee con- According to definition (1), first we subtract the vector Li mean
vergence to the global minimum. Examples of STLF models based on from its components and then we divide the resulting vector by its
fuzzy logic are: [10], where the neuro-fuzzy system is used to adjust length. As a result we get the normalized vectors xi with length
the results of load forecasting obtained by radial basis function neu- 1, zero mean and the same variance. Note that the time series
ral network, [11], where two neuro-fuzzy networks are proposed: a which is nonstationary in mean and variance is represented now by
wavelet fuzzy neural network using the fuzzified wavelet features x-patterns having the same mean and variance. The trend and addi-
as the inputs, and fuzzy neural network employing the Choquet tional seasonal variations (weekly and annual ones in our case) are
integral as the outputs, [12], where an integrated approach which filtered. The x-patterns contain information only about the shapes
combines a self-organizing fuzzy neural network learning method of daily curves.
with a bilevel optimization method is described, and [13], where Whilst the x-patterns represent input variables (predictors), i.e.
the forecasting model combines fuzzy logic, wavelet transform and the loads for the day i, the y-patterns represent the output vari-
neural network. Another useful computational intelligence tools ables, i.e. the forecasted loads for the day i + , where is a forecast
for STLF are: support vector machines (SVM) [14,15], ensembles of horizon in days. The components of the n-dimensional output pat-
models [16,17] and artificial immune systems [18] (description of tern yi = [yi,1 yi,2 . . . yi,n ]T ∈ Y = Rn , representing the load vector Li+
more STLF models you can find on the website are defined as follows:
L − L̄i
It is noteworthy that many of the STLF models developed in yi,t = i+,t 2 , (2)
recent years are hybrid solutions. They combine data preprocess- n
Li,l − L̄i
ing methods (e.g. wavelet transform) with approximation methods
(such as neural and neuro-fuzzy networks or SVM) and optimiza- where i = 1, 2, . . ., N, t = 1, 2, . . ., n.
tion or learning methods (e.g. evolutionary and swarm algorithms). This is the similar equation to (1) but in this case we do not
The disadvantages of the above mentioned complex forecast- use the mean load of the day i + L̄i+ in the numerator and
ing models with many parameters are: hard and time-consuming n 2
training, problems with generalization, unclear structure and unin- l=1
Li+,l − L̄i+ in the denominator, because these values
terpretable parameters. Most often time series with multiple are not known
in the moment of forecasting. We use known values
seasonal cycles and trend, expressing nonstationarity in mean and n 2
of L̄i and l=1
Li,l − L̄i instead. This is very important because
variance cannot be modeled directly and additional treatments
such as detrending, deseasonality or decomposition are needed. when the forecast of pattern yi is generated by the model we can
In contrast to the complex models commonly used in STLF in determine the forecast of vector Li+ using transformed Eq. (2):
this work simple methods of linear regression are proposed. The
n 2
number of parameters here is small and they can be estimated using Li+,t = yi,t Li,l − L̄i + L̄i , (3)
simple least squares approach. The key element of the proposed
methods is data preprocessing: defining patterns of seasonal cycles. where yi,t is the forecasted tth component of the pattern yi .
This simplifies the STLF problem eliminating nonstationarity, and Note that L̄i and the value of square root in (3) are known at the
filtering trend and seasonal cycles longer than the daily one. time of forecasting and can be used for decoding of yi,t to get Li+,t .
The paper is organized in a theoretical and an empirical part. n 2
In the beginning the patterns of daily cycles of load time series Note also that l=1
Li,l − L̄i is the carrier of the dispersion of
are defined. Then the main concepts of the linear regression the current daily cycle. Using this square root in the denominator of
x 10
4 and annual cycles. The information about the position of the daily
2.2 period in the weekly and annual cycles, which is contained in L̄i ,
2 Winter we introduce to the forecast Li+,t in (3) by adding L̄i as well as
we introduce the information
about current dispersion of the time
n 2
series multiplying yi,t by l=1
Li,l − L̄i . So when we forecast
the load time series using the pattern-based approach, first we filter
out the information about the position of the days i and i + in the
1.2 weekly and annual cycles ((1) and (2)). Then we build the model on
Summer patterns and we generate the forecast of the y-pattern. Finally we
0 24 48 72 96 120 144 168 introduce the information about the position of the forecasted day
hours in the weekly and annual cycles using (3).
(b) 0.6 More functions defining patterns you can find in [19]. A fragment
of time series represented by x-patterns does not have to coincide
0.4 with the daily period (e.g. x-pattern can represent loads at hours 13
0.2 to 24 of the day i – 1 and hours 1 to 12 of the day i). It can include
several adjacent daily cycles (e.g. two days preceding the day of
forecast) or a part of one cycle (e.g. t = 1, 2, . . ., 12 h of the day i). It
Original load time series
... L
10 Li 2
1 Similar
y mapping x mapping paern
0.5 Decoding
0.0 ... ...
x* 5
-0.5 xi
y 4
0.0 ...
-0.5 yi y
on the training set ˚. The response of the model to the query pat-
tern x* is the forecasted value of the y-pattern tth component yt
(step 4). In step 5 to get the forecasted load at hour t for the next
day we decode yt according to (3). We use for this current, known 0.6
n 2
predictors in a systematic way. The linear model in this case is of 3.4. Principal components regression
the form:
Principal component regression (PCR) [20,22] produces new
y = ˇ0 + w1 ˇ1 x1 + · · · + wn ˇn xn + ε, (5) predictors (principal components) which are linear combinations
where w1 , . . ., wn are binary weights. of the original ones and are linearly uncorrelated. The first principal
The goal is to find the binary weight vector w = [w1 w2 . . . wn ]T component has the largest sample variance. Subsequent principal
and the coefficient values ˇj for predictors with nonzero weights. components have the highest variances possible under the con-
The criterion of adding or removing the predictor is based on the straint that they are orthogonal to the preceding components. The
p-value for an F test of the change in sum of squared error. When principal components are used in place of the original predictors in
we analyze the introduction of the tth predictor to the model, the the regression model:
null hypothesis is that this predictor would have a zero weight if y = ˇ0 + ˇ1 z1 + · · · + ˇc zc + ε, (8)
added to the model. If the null hypothesis is rejected, the predictor
is added to the model (wt = 1). When we analyze removing of the tth where zj is the jth principal component, c ≤ n is the numbers of
predictor from the model, the null hypothesis is that this predictor components included into the model.
has a zero weight. If there is insufficient evidence to reject the null There is no need to use all principal components in the model
hypothesis, the predictor is removed from the model (wt = 0). but only the first few ones (c) because usually they explain most
The selection procedure starts with empty set of predictors in of the variability in the response variable. So the components with
the model. At each step it adds the predictor with the smallest the lowest variance can be discarded.
p-value until there is no predictors having p-value less than an
entrance tolerance (0.05 was used). Then the elimination proce- 3.5. Partial least-squares regression
dure is activated which removes from the model the predictor with
the largest p-value, if it is greater than an exit tolerance (0.1). The Partial least-squares regression (PLSR) has some relationship to
selection and elimination procedures are repeated alternately until PCR. It also constructs new predictors by linear combination of orig-
there is no predictor for removal. inal ones, but unlike PCR it uses also the response variable to do so.
Some other criteria can be used to add and remove predictors The new predictors called latent variables are the best orthogonal
like [20]: R-squared, Mallows Cp , Akaike information criterion or linear combinations of xt for predicting y (they explain best the
Bayes information criterion. The solutions generated by the step- response). PLSR searches for such orthogonal directions to project
wise regression are suboptimal and there is no guarantee that a x-points that have the highest variance and highest correlation with
different initial model or different sequence of steps will not lead the response. The number of predictors used in the final model is
to a better model. a parameter of PLSR like in PCR. These both methods are useful
when there is more predictors than observations and when there is
multicollinearity among predictors. The algorithms of partial least-
3.3. Regularized least-squares regressions
squares and connections between PLSR and PCR can be found in
In the regularized least-squares regression the minimized cri-
terion is composed of the usual regression criterion and a penalty
4. Simulation examples
term dependent on the coefficient values. Thus the coefficients are
shrunk toward zero. This can greatly reduce the variance, result-
In the first example the proposed linear models are examined
ing in a better mean-squared error. The ridge regression estimates
in the tasks of load forecasting of the Polish power system for the
coefficients by minimizing the criterion containing sum of squared
next day ( = 1). The hourly load time series is from the period
coefficients as a penalty term [22]:
2002–2004 (see Fig. 4; these data can be downloaded from the
⎛ ⎞
2 website The test samples
ˇ = argmin ⎝ yi − ˇ0 − xi,t ˇt + ˇt2 ⎠ , (6) are from January 2004 (without atypical 1 January; atypical days
ˇ such public holidays are not handled for the proposed methods)
i=1 t=1 t=1
and July 2004, i.e. we forecast loads in the successive days of January
where ≥ 0 is a parameter that controls the amount of shrinkage. and July. Models are constructed using 12 nearest neighbors of
The large value of leads to more shrinkage. We get different the query point x* from the history, i.e. they are selected from the
coefficient estimates for different values of . But the coefficient period from 1 January 2004 until the day before the day of the fore-
values are never set to zero exactly, and therefore cannot perform cast. The Euclidean distance is used to select the nearest x-patterns.
predictor selection in the linear model. For each hour of the day of forecast a separate model is built. So
The alternative way of regularization is lasso (least absolute for our test samples (30 + 31)·24 = 1464 models were constructed.
selection and shrinkage operator). This is a shrinkage method like Because m > k (m = 25, k = 12) before using model (4) ten predictors
ridge, with subtle but important difference: the penalty term in were selected using stepwise regression (only with selection pro-
lasso is a sum of absolute values of coefficients. The lasso estimate cedure and without elimination procedure). The algorithm starts
is defined as [22]: with empty set of predictors and it adds at each step the predictor
⎛ 2
⎞ with the smallest p-value until the number of predictors reaches 10.
N n n The similar approach for reducing the initial number of predictors
ˇ = arg min ⎝ yi − ˇ0 − xi,t ˇt + |ˇt |⎠ . (7) in the ridge and lasso regressions was used.
ˇ 2
i=1 t=1 t=1 The frequencies of predictors selected in the stepwise and lasso
regressions and the predictor number frequencies in Fig. 5 are
The nature of penalty term in lasso causes some coefficients to shown. Among all 24 predictors included in x-patterns the most
be shrunken to zero exactly, thus the lasso is able to perform pre- often selected predictor represents the load at hour 24. It means
dictor selection in the linear model. As increases more and more that this predictor, which is the nearest in time to the forecasted
coefficients are set to zero (see Fig. 6, right). In the experimental variable among all predictors, carries much information about this
part of the work the value of in the ridge and lasso regressions variable. Fig. 3 suggests that good candidate as a predictor could be
was tuned in the leave-one-out procedure. that one with lag 168 having high value of autocorrelation function.
(a) Table 1
24 Forecast errors and their interquartile ranges in the first example.
22 Linear model January July Average
MAPEtst IQRtst MAPEtst IQRtst MAPEtst IQRtst
stepwise regression 0.4
0 0
1 6 12 18 24 0 1 2 3 4 5 6 7 8 9 10
Hour Number of predictors
Fig. 5. The frequencies of predictors (left) and the frequencies of the predictor numbers (right) in stepwise and lasso regressions.
1 2
-0.5 0
-1.5 -1
0 50 100 0 0.005 0.01 0.015 0.02
λ λ
Fig. 6. The ridge (left) and lasso (right) traces for the forecasting task of July 1, 2004, hour 12.
0.3 0.3
0.26 0.26
0.22 0.22
0.18 0.18
-4 -2 0 2 4 6 -0.4 -0.2 0 0.2 0.4
z1 z1
Fig. 7. The local regressions using PCR (left) and PLSR (right) for the forecasting task of July 1, 2004, hour 12 ( o–query point).
PLSR PCR linear models. For each forecasting task a separate MLP is learned.
0.3 To prevent overfitting MLP is learned using Levenberg–Marquardt
Ridge Stepwise
algorithm with Bayesian regularization [27]. Since the target func-
0.2 tion is approximated locally using a small number of learning
3.5 6
3 5
MAPEtst 4
1 2 3 4 5 6 7 1 2 3 4 5 6 7
τ τ
4 5.5
1.5 2.5
1 2 3 4 5 6 7 1 2 3 4 5 6 7
τ τ
level). As we can see from this table PLSR takes the first place among PLSR
tested models for FR and GB data and second place for PL and VC PCR
data. Note that N–WE generates the best results for three datasets, MLP
but the difference in errors between this model and PLSR is small N-WE
except FR data, where PLSR is better. The conventional forecasting Stepwise
models: ARIMA and ES work significantly worse than pattern-based ARIMA
models. To see how PCR and PLSR work on more recent data they MLR
were tested on time series of hourly load of the Polish power sys- Ridge
tem from the period of 2012–2014 (test sample includes data from Lasso
2014 with the exception of 14 atypical days). The results for PLSR
did not differ from those for PL data: MAPEtst = 1.34. For PCR results
were a little worse: MAPEtst = 1.44. 0 50 100 150 200
In Fig. 9, the errors for forecast horizons up to 7 days are com- Time in seconds
pared. For longer horizons the linear regression models generated
Fig. 10. The total time of building the forecasting models for 24 h of the next day
good results compared to the reference models. For VC data and for PL data.
horizons more than two days the conventional models ARIMA and
ES outperformed other models.
In Fig. 10, the time efficiency of the forecasting models is com- 5. Conclusions
pared. The times presented in this figure include time of the model
optimization and forecasting for all hours of the next day for PL data The major contribution of this work is to propose new sim-
(computing environment: Intel Core 2 Quad CPU Q9550 2.83 GHz, ple univariate linear regression models based on patterns of daily
4 GB RAM, Matlab R2012b). In the MLR, ridge and lasso regres- cycles for STLF. Patterns allows the forecasting problem to be sim-
sions, where stepwise regression is used first to reduce the initial plified by filtering out the trend, annual and weekly cycles. The
number of predictors, the algorithm spends most of the time in relationship between input and output patterns is approximated
stepwise phase (about 62 s). The most time efficient models are locally in the neighborhood of the query pattern using linear regres-
PLSR and PCR. When using these models the process of model build- sion. Thus we resign from the global modeling of the target function
ing and forecasting for 24 h of the next day takes less than half of a in the entire range creating the locally competent model for the
second. region around the query point. Since the local complexity is lower
than the global one, we can use a simple model. This model brings [8] Y. Chen, P.B. Luh, C. Guan, Y. Zhao, L.D. Michel, M.A. Coolbeth, P.B. Fried-
good results for the current query point, but we have to construct land, S.J. Rourke, Short-term load forecasting: similar day-based wavelet neural
networks, IEEE Trans. Power Syst. 25 (1) (2010) 322–330.
new models for other query points. [9] Hao Quan, D. Srinivasan, A. Khosravi, Short-term load and wind power fore-
The similar approach based on patterns and local modeling casting using neural network-based prediction intervals, IEEE Trans. Neural
was used earlier in other STLF models: MLP and N–WE. Although Netw. Learn. Syst. 25 (2) (2014) 303–315,
these models are nonlinear, the proposed linear models have bet- [10] Z. Yun, Z. Quan, S. Caixin, L. Shaolan, L. Yuming, S. Yang, RBF neural network
ter extrapolation property. Because the linear models are not as and ANFIS-based short-term load forecasting approach in real-time price envi-
flexible as neural networks and nonparametric regression models ronment, IEEE Trans. Power Syst. 23 (3) (2008) 853–858.
[11] M. Hanmandlu, B.K. Chauhan, Load forecasting using hybrid models, IEEE Trans.
there is no problem with overfitting. The cumbersome and time-
Power Syst. 26 (1) (2011) 20–29.
consuming procedures to prevent overfitting are unnecessary. In [12] H. Mao, X.-J. Zeng, G. Leng, Y.-J. Zhai, J.A. Keane, Short-term and midterm load
the application examples the STLF methods based on patterns and forecasting using a bilevel optimization model, IEEE Trans. Power Syst. 24 (2)
(2009) 1080–1090.
local modeling outperform conventional models: ARIMA and expo-
[13] D.K. Chaturvedi, A.P. Sinha, O.P. Malik, Short term load forecast using fuzzy logic
nential smoothing especially for shorter horizons. and wavelet transform integrated generalized neural network, Int. J. Electr.
Using principal component regression or partial least-squares Power Energy Syst. 67 (2015) 230–237.
regression the number of predictors can be reduced to only one [14] Q. Wu, Power load forecasts based on hybrid PSO with Gaussian and adaptive
mutation and Wv-SVM, Expert Syst. Appl. 37 (2010) 194–201.
which allows us to visualize the regression function. In this case [15] E. Ceperic, V. Ceperic, A. Baric, A Strategy for short-term load forecasting by
the models have only two parameters simply estimated using least- support vector regression machines, IEEE Trans. Power Syst. 28 (4) (2013)
squared approach. This is a great advantage in comparison to the 4356–4364.
[16] M. De Felice, Short-term load forecasting with neural network ensembles: a
complex STLF models based on ARIMA, exponential smoothing, comparative study, IEEE Comput. Intell. Mag. 6 (3) (2011) 47–56.
neural and neuro-fuzzy networks or SVM, where there are dozens [17] Rui Zhang, Zhao Yang Dong, Yan Xu, Ke Meng, Kit Po Wong, Short-term
or hundreds of parameters and their estimation requires advanced load forecasting of Australian National Electricity Market by an ensemble
model of extreme learning machine, IET Gener. Transm. Distrib. 7 (4) (2013)
optimization methods. 391–397.
[18] G. Dudek, Artificial immune system for forecasting time series with multiple
Acknowledgments seasonal cycles, Trans. Comput. Collec. Intell. XI LNCS 8065 (2013) 176–197.
[19] G. Dudek, Pattern similarity-based methods for short-term load
forecasting—Part 1: Principles, Appl. Soft Comput. 37 (2015) 277–287,
I am very grateful to James W. Taylor with the Saïd Business, ISSN 1568-4946.
School, University of Oxford for supplying British and French data, [20] S. Chatterjee, A.S. Hali, Regression Analysis by Example, John Wiley & Sons, Inc.,
Hoboken, New Jersey, 2006.
and Shu Fan and Rob J. Hyndman with the Business and Economic
[21] N. Draper, H. Smith, Applied Regression Analysis, Wiley, New York, 1981.
Forecasting Unit, Monash University for supplying Australian data. [22] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning. Data
Mining, Inference, and Prediction, Springer-Verlag, New York, 2009.
[23] R. Rosipal, N. Kramer, Overview and recent advances in partial least squares
Subspace, Latent Structure and Feature Selection LNCS, 3940, Springer Berlin
Heidelberg, 2006, pp. 34–51.
