Bayesian neural network approach to short time load forecasting PDF

Title	Bayesian neural network approach to short time load forecasting
Author	R. Randrianarivony
Pages	11
File Size	601.7 KB
File Type	PDF
Total Downloads	258
Total Views	499

Preview

CLICK TO PREVIEW PDF

Summary

Description

Available online at www.sciencedirect.com

Energy Conversion and Management 49 (2008) 1156–1166 www.elsevier.com/locate/enconman

Bayesian neural network approach to short time load forecasting Philippe Lauret a,*, Eric Fock a, Rija N. Randrianarivony b, Jean-Franc¸ois Manicom-Ramsamy a a

Laboratoire de Physique du Baˆtiment et des Syste`mes, Universite´ de La Re´union, BP 7151, 15 Avenue Rene´ Cassin, 97715 Saint-Denis, France b IME-Institut pour la maıˆtrise de l’e´nergie, University of Antananarivo – BP 566, Madagascar Received 23 January 2007; accepted 9 September 2007 Available online 23 October 2007

Abstract Short term load forecasting (STLF) is an essential tool for eﬃcient power system planning and operation. We propose in this paper the use of Bayesian techniques in order to design an optimal neural network based model for electric load forecasting. The Bayesian approach to modelling oﬀers signiﬁcant advantages over classical neural network (NN) learning methods. Among others, one can cite the automatic tuning of regularization coeﬃcients, the selection of the most important input variables, the derivation of an uncertainty interval on the model output and the possibility to perform a comparison of diﬀerent models and, therefore, select the optimal model. The proposed approach is applied to real load data. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Load modelling; Short term load forecasting; Neural networks; Bayesian inference; Model selection

1. Introduction Short term load forecasting (STLF) is essential for planning the day to day operation of an electric power system [1]. Accurate forecasts of the system load on an hour by hour basis from one day to a week ahead help the system operator to accomplish a variety of tasks like economic scheduling of generating capacity, scheduling of fuel purchases, etc. In particular, forecasting the peak demand is important as the generation capacity of an electric utility must meet this requirement. As this forecasting leads to increased security operation conditions and economic cost savings, numerous techniques have been used to improve STLF [2]. Among these techniques, the use of Neural networks (NNs) is particularly predominant in the load forecasting ﬁeld [2,3]. Indeed, the availability of historical load data on the utility databases and the fact that NNs are data driven approaches capable of performing a non-linear mapping between sets of input and output variables make

*

Corresponding author. Tel.: +262 93 81 27; fax: +262 93 86 65. E-mail address: [email protected] (P. Lauret).

0196-8904/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.enconman.2007.09.009

this modelling tool very attractive. However, as stated by Refs. [2,4], NNs are such ﬂexible models that the task of designing a NN for a particular application is far from easy. This statement stems from the fact that NNs are able to approximate any continuous function at an arbitrary accuracy, provided the number of hidden neurons is suﬃcient [5]. However, this ability has a downside that such close approximation can become an approximation to the noise. As a consequence, the model yields solutions that generalize poorly when new data are presented. In the NN community, this problem is called overﬁtting and may come about because the NN model is too complex (i.e. it possesses too many parameters). In fact, it is necessary to match the complexity of the NN to the problem being solved. The complexity determines the generalization capability (measured by the generalization or test error) of the model since a NN that is either too simple or too complex will give poor predictions.1 There are mainly two 1 As an illustration, consider a polynomial model whose complexity is controlled by the number of coeﬃcients. A too low-order polynomial will be unable to capture the underlying trends in the data whilst a too highorder polynomial will model the noise on the data.

P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166

approaches to controlling NN complexity, namely architecture selection and regularization techniques. Architecture selection controls the complexity by varying the number of NN parameters (called weights and biases). One of the simplest ways involves the use of networks with a single hidden layer, in which the number of free parameters is controlled by adjusting the number of hidden units. Other approaches consist in growing or pruning the network structure during the training process. The approach taken by the pruning methods is to start with a relatively large network and gradually remove either connections or complete hidden units [6–8]. The technique of regularization encourages smoother network mappings by favouring small values for the NN parameters. Indeed, it has been shown [6] that small values for the weights decrease the tendency of the model to overﬁt. One of the simplest forms of regularizer is called weight decay and a regularization coeﬃcient (also called weight decay term) allows controlling the degree of regularization. However, each of these techniques requires tuning of a parameter (i.e. regularization coeﬃcient, number of hidden units, pruning parameter) in order to maximize the generalization performance of the NN. Classically, the setting of this control parameter is done by using the so called cross-validation (CV) techniques. Indeed, CV provides an estimation of the generalization error and, therefore, oﬀers a possibility to select the best architecture or the optimal regularization coeﬃcient. Unfortunately, CV presents several disadvantages. First, the cross-validation (CV) technique needs a separate data set (so fewer data for the training set), named validation set, in order to evaluate the variation of the generalization error (called here validation error) as the number of hidden neurons or the value of the regularization coeﬃcient is changed. The optimal number of hidden nodes or the optimal value of the regularization coeﬃcient corresponds to the minimum validation error. Secondly, because intrinsic noise exists in real datasets and because of the limited amount of data, one has to repeat the CV experiment multiple times using diﬀerent divisions of the data into training and validation sets. This leads to well known CV techniques like k-fold cross-validation or the leave-oneout method [6]. Consequently, CV may become computationally demanding and tedious, and regarding for instance the regularization technique, a small range of weight decay coeﬃcients is usually tested. Another critical issue is determination of the relevant input variables. Indeed, too many input variables, of which some are irrelevant to estimation of the output, could hamper the model. This is particularly true in cases of limited data sets where random correlations between the output and the input variables can be modelled. Again, unfortunately, classical NN methods are unable to treat this non-trivial task in a satisfactory way. Most researchers in the realm of STLF have emphasized the need to design the NN model properly but deplored the lack of a consistent method that allows deriving the opti-

1157

mal NN model [2,4,9]. As a consequence, Hippert [2] stated that some researchers are sceptical and believe that no systematic evidence that NNs outperform standard forecasting methods (such as those based on time series . . .) exists. Furthermore, Alves da Silva [10] stated that one should not produce a forecast of any kind without an idea of its reliability, and point predictions are meaningless when the time series is noisy. In this paper, we argue that in order to obtain a good model for electric load forecasting, emphasis has to be put on the design of the NN. In other words, conventional NN learning methods must be improved. For this purpose, we propose a probabilistic interpretation of NN learning by using Bayesian techniques. MacKay [11] originally developed Bayesian methods for NNs. The Bayesian approach to modelling oﬀers signiﬁcant advantages over the classical NN learning process. Among others, one can cite automatic tuning of the regularization coeﬃcient using all the available data and selection of the most important input variables through a speciﬁc technique called automatic relevance determination (ARD). In addition, reliabilities in the forecast are taken into account as the method computes an error bar on the model output. As we emphasize NN modelling, it is important (in our view) to search for the optimal NN structure (i.e. the optimal number of hidden nodes). Again, we will see that the Bayesian method oﬀers a means to select the optimal NN model (by performing a model comparison). In this survey, the Bayesian approach to neural learning is applied to real load data. The data were provided by EDF, the French electricity utility. 2. Model description and context of study In order to assess the feasibility of the proposed approach, we designed a NN whose goal is to forecast the next day’s load at the same hour. Actually, this model constitutes an hourly module that consists in determining the non-linear relationship between each hour’s load proﬁle with past load and weather readings for the same hour. This hourly module is part of a global forecaster (that yields the complete load proﬁle for the next day) obtained by combining the 24 hourly modules. A bibliographic survey helps us to a priori retain the input variables given in Table 1. The set of inputs typically contains exogenous variables (weather related variables), indicators such as the week end or holiday ﬂag and past values of the load. Obviously, there is a strong correlation between the load demand and the weather related variables. According to Ref. [2], load series are known to be non-linear functions of the exogenous variables. For instance, a U shaped scatter plot between load demand and temperature has been reported by Ref. [9]. The data were collected from a micro-region of the south of Reunion Island (21.06 S, 55.36 E) in 2001. The database contains up to 1074 hourly records.

1158

P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166

Table 1 Inputs of the NN model NN input variables

Description

x1 x2 x3 x4 x5 x6

wfd1 h Wdd1 Wsd1 Td1 Gd1

x7 x8 x9 x10 x11 x12 x13 x14

Ld1 wfd Wdd Wsd Td Gd Ld Td+1

Week end or holiday ﬂag for day d 1 Hour Yesterday’s actual wind direction at this hour Yesterday’s actual wind speed at this hour Yesterday’s actual temperature at this hour Yesterday’s actual global solar irradiance at this hour Yesterday’s actual load at this hour Week end or holiday ﬂag for day d Actual wind direction at this hour Actual wind speed at this hour Actual temperature at this hour Actual global solar irradiance at this hour Actual load at this hour Temperature forecast for the next day at this hour

3. Neural network approach to STLF We chose a NN to model the electric load. Indeed, in the case of NN based models and as opposed, for instance, to polynomial regression techniques, no explicit knowledge of the functional form between the inputs and the output is needed. The most popular form of NN is the so called multilayer perceptron (MLP) structure. The MLP structure consists of an input layer, one or several hidden layers and an output layer. The input layer gathers the model’s inputs vector x while the output layer yields the model’s output vector y. In our case, the input vector x is given by the hourly values of the variables x1 to x14 given in Table 1, and the output vector y consists of only one output y, which is the corresponding forecast of the next day’s load at the same hour. Fig. 1 represents a one hidden layer MLP. The hidden layer is characterized by several non-linear units (or neurons). The non-linear function (also called

Fig. 1. Sketch of a MLP with d inputs and h hidden units, in our case, d = 14 (see Table 1). The output y is the next day’s load at the same hour.

activation function) is usually the tangent hyperbolic function f(x) = tanh(x) = ex ex/ex + ex. Therefore, a NN with d inputs, h hidden neurons and a single linear output unit deﬁnes a non-linear parameterized mapping from an input x to an output y given by the following relationship: " !# d h X X y ¼ yðx; wÞ ¼ wji xi wj f ð1Þ j¼0

i¼0

The parameters of the NN model are given by the so called weights and biases that connect the layers between them (notice that in Eq. (1), the biases are denoted by the subscripts i = 0 and j = 0 and are not represented in Fig. 1). The NN parameters, denoted by the parameter vector w, govern the non-linear mapping. The NN parameters w are estimated during a phase called the training or learning phase. During this phase, the NN is trained using a dataset (called training set) of N input and output examples, pairs of the form N D ¼ fxi ; ti gi¼1 . The vector x contains samples of each of the 14 input variables described in Table 1. The variable t, also called the target variable, is the corresponding measurement of the electric load. The training phase consists in adjusting w so as to minimize an error function ED, which is usually the sum of squares error between the experimental or actual output ti and the network output, yi = y(xi; w): ED ðwÞ ¼

N N 1X 1X fy i ti g2 ¼ e2 2 i¼1 2 i¼1 i

ð2Þ

The second phase, called the generalization phase, consists of evaluating the ability of the NN to generalize, that is to say, to give correct outputs when it is confronted with examples that were not seen during the training phase. Notice that these examples are part of a data set called test set. In the NN community, the performance measure (also called the generalizationP error) is usually given by the mean 2 squared error ðMSEÞ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðei =N Þ or the root mean squared error ðRMSEÞ ¼ MSE. In the electricity supply industry, the PN mean absolute percent errors ðMAPEÞ ¼ ð1=N Þ i¼1 jei j=t i 100 is usually reported. A good generalization (i.e. good predictions for new inputs) is obtained through control of the NN complexity by using techniques such as pruning or regularization [6,8,11]. Nonetheless, it must noted that, contrary to architecture selection methods like pruning, regularization does not explicitly delete the irrelevant weight (or connection, see Fig. 1) but drives to zero (or nearly zero) the parameters that do not participate in the relationship. In order to better introduce the Bayesian optimization of regularization coeﬃcients, we describe below the principles of the regularization technique. As mentioned above, the technique of regularization encourages smoother network mappings by adding a penalty term (also known as a weight decay term) to the preceding objective function (see Eq. (2)), resulting in this new objective function:

1159

P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166

SðwÞ ¼ ED ðwÞ þ lEW ðwÞ

ð3Þ Pm 2 The additional term EW ðwÞ ¼ ð1=2Þ i¼1 wi (where m is the total number of parameters) penalizes large values for the weights that are known to be responsible for excessive curvature in the model [6]. The parameter l, called regularization coeﬃcient, controls the degree of regularization. If l is too large, then the model cannot ﬁt the data well. Conversely, if l is too small, then overﬁtting occurs. Thus, there exists an optimal value of l that gives the best trade oﬀ between overﬁtting and underﬁtting. Rigorously, this trade oﬀ is called the bias variance trade oﬀ. The interested reader should refer to Refs. [6,11]. However, the optimal value of the weight decay constant l must be found through cross-validation techniques. However, since there are drawbacks of the cross-validation methods, a limited pre-speciﬁed range of regularization coeﬃcients can be optimized in a discrete manner. In order to compensate for these limits, we propose a probabilistic interpretation of NN learning that allows automatically controlling the NN complexity.

constant of the pdf. In the Bayesian framework, a is called a hyper-parameter as it controls the distribution of other parameters. The choice of Gaussians distributions simpliﬁes the analysis and allows proceeding analytically. Further, the choice of a Gaussian prior for the weights leads to a probabilistic interpretation of the preceding regularizer EW (see Eq. (3)). Indeed, the preceding regularizer can be interpreted as minus the logarithm of the prior probability distribution over the parameters. It is important to note that the previous distribution of weights is deﬁned for a given value of a. Hence, for the moment, we shall assume that its value is known. In a next paragraph, we shall relax this assumption. As we chose a Gaussian prior, the normalization factor ZW(a) is given by m=2 Z 2p ðaEW Þ ð5Þ Z W ðaÞ ¼ e dw ¼ a

4. Bayesian neural network approach to SLTF

4.1.2. The likelihood function or the noise model The derivation of the likelihood function is linked to the deﬁnition of the noise model. Given a training dataset D of N N examples of the form D ¼ fxi ; ti gi¼1 , the goal of NN learning is to ﬁnd a relationship R between xi and ti. Since there are uncertainties in this relation as well as noise or phenomena that are not taken into account, this relation becomes

4.1. Principle of Bayesian NN learning: a probabilistic approach to NN learning In this paper, the principles of Bayesian reasoning [12– 15] are outlined and applied to estimation of the NN parameters. The remainder of this section summarizes the Bayesian approach as stated by Ref. [11]. It will be shown that the overﬁtting problem can be solved by using a Bayesian approach to control the model complexity. The Bayesian approach considers a probability density function (pdf) over the weight space. This pdf represents the degrees of belief taken by the diﬀerent values of the weight vector. This pdf is set initially to some prior distribution and converted into a posterior distribution once the data have been observed through the use of Bayes’ theorem [14]. So, instead of the single ‘best’ set of weights computed by the classical approach of maximum likelihood (through minimization of an error function), Bayesian methods yield a complete distribution for the NN parameters. This posterior distribution can then be used, for instance, to infer predictions of the network for new values of the input variables. 4.1.1. The prior As we have a priori little idea of what the weight values should be, the prior is, therefore, chosen as a rather broad distribution. This can be done by expressing the prior pdf as a Gaussian distribution with a large variance: 1 pðwjaÞ ¼ eðaEW Þ Z W ðaÞ

ð4Þ

where a represents the inverse of the variance on the set of weights and biases and ZW(a) represents the normalization

We recall that m is the total number of NN parameters.

ti ¼ Rðxi Þ þ ei

ð6Þ

where the noise ei is an expression of the various uncertainties. In our case, we want to approximate R(x) by y(x; w), i.e. a non-linear regression model given by a MLP. Hence, in the following, we assume that the ith target variable ti (or measurement) is given by some deterministic function of input vector x with added independent Gaussian noise. Further, if we assume that the errors have a normal distribution with zero mean and variance r2 = 1/b, the distribution of the noise is given by rﬃﬃﬃﬃﬃﬃ b b exp e2i ð7Þ pðei jbÞ ¼ 2p 2 Assuming independent noise, the joint probability of a set of N noise values can be written as pðejbÞ ¼ pðe1 ; . . . ; eN jbÞ ¼ ¼

b 2p

N =2

N Y i¼1

pðei jbÞ

N bX e2 exp 2 i¼1 i

!

ð8Þ

One can then take the diﬀerence between the data and the model ei = ti yi and substitute this into Eq. (8) to obtain the likelihood function: pðDjw; bÞ ¼
...