Editor's note: Mike Fassino is president of EnVision Knowledge Products, Media, Pa. This is the final installment of a three-part series on neural networks. The first part, "Understanding back-propagation" appeared in the April issue of QMRR. Part two, "Unsupervised learning neural nets," appeared in the May issue.
The third and final installment in our series on neural networks in market research deals specifically with forecasting the future from knowledge of the past. The first article described some of the basics of neural networks, like back-propagation of errors to create a mapping from input to output and the use of hidden processing units to fit highly non-linear relationships between the inputs and outputs. The neural networks described in the first article, known as supervised learning networks, can be used in contexts of conjoint analysis, discrete choice, customer satisfaction studies and in any condition where one would use regression or discriminant analysis or the various forms of CHAID and logistic regression.
The second article addressed unsupervised learning neural networks that can be used in segmentation and perceptual mapping contexts and in any condition where one would use cluster analysis, factor analysis, multidimensional scaling or latent class analysis. Our focus in the third article is specifically on time series forecasting and methods that supplement ARIMA and regression models. In these models, the explicit goal is to predict the future based on properties of the past. The techniques described are appropriate for sales forecasting, inventory management, demand management, price-elasticity of demand modeling, econometric modeling, advertising impact and advertising allocation, industry structure analysis and any context in which one has sampled one or more variables at consistent intervals of time and wishes to predict the most likely value of the variable at future points in time.
In time series forecasting with neural networks, one typically has two or more networks yoked together and, as I will illustrate, a great deal of time is spent preprocessing the data.
To keep the discussion from becoming abstract, I will use a single example where the goal is to generate accurate sales forecasts four months into the future and to assess the relationship between sales and specific indicators of general economic health.
The data
In this case we have 72 periods of data with which to work. If the series is based on monthly data, it would represent six years of monthly data. If it were a quarterly series, it would represent 18 quarters, or 4.5 years. The time unit really doesn't matter that much - it could be 72 days of sales or, as is common in financial models of the stock market, it could be 72 minutes of trading volume for some particular stock, bond or futures contract. All that matters is that the series is sampled at equal intervals of time. For the sake of this discussion, we will say that the data is monthly, so we have six years of monthly data as shown in Figure 1.
This is a very difficult series to forecast because it is non-stationary. That is, the mean increases over time. It also has a seasonality that changes over time. Early in the series, peaks occur in July; later in the series, peaks occur in December. In short, the structure of the series changes over time, perhaps due to advertising, marketing or competitive activity.
The goal, as mentioned, is to predict the next four periods in the series. The traditional approach to this problem is to use a type of regression analysis called ARIMA models. ARIMA stands for "autoregressive integrated moving average" models. ARIMA models view future values of a series as arising from two components:
- past values of the series, the AR component;
- a moving average of previous errors of prediction (referred to as shocks), the MA component.
In the world of ARIMA models, the best prediction of a point in the future is built up from past values of the series and how far off our forecast of past values was. ARIMA modeling is still very much an art superimposed on a science. The analyst must determine how far back in the past the AR component extends. Are July 1997 sales influenced by June 1997 sales, by July 1996 sales, or by sales in January through March? Similarly, are recent errors more important than errors long ago? Is there seasonality, and is this seasonality constant over time?
Typically, one inspects a variety of graphs that show various forms of correlations to answer these questions. One then comes up with a pretty good guess about the lag length of the AR and MA terms, runs the ARIMA using these guesses, calculates prediction errors, adjusts the AR and MA terms and tries again . . . and again . . . and again.
There are well-established conventions about what various parameters will look like when you have a good ARIMA model, and one keeps trying, tuning and testing until these conventions are reached. It takes a lot of practice, experience, time and patience to generate a really good ARIMA forecast. In recent years, software has evolved that uses expert systems to do a lot of the initial model identification, but even with the best of this software (and the best is PC-EXPERT from Scientific Computing Associates), taking the model the last few steps so that it makes forecasts with very small confidence intervals is, to say the very least, onerous.
If nothing else, the neural network approach to time series analysis is easier than ARIMA and, when done correctly, the forecast will be just as precise.
The basic strategy for harnessing neural networks to make time series forecasts is shown in Figure 2.1 will describe and illustrate each of the steps with our 72 periods of sales data.
1. The time series
You will recall from the first article that all that is necessary to build a supervised learning neural net are values of the input (or independent variables) and values of the output (or dependent variables). A neural network uses layers of processing units whose weights are iteratively adjusted until the network's prediction of the output variable(s), based on the input variable(s), is as close to the actual output variable(s) as possible. Once this state is reached, the network is said to be trained. With our time series data, we easily meet these conditions. We have input variables, generally previous values of the series; and output values, the value of the series at points in time later than the input series.
2. Fourier transform
A Fourier transform is a very specific data transformation that reveals short- and long-term trends. The basic equation of the Fourier transformation is shown in Exhibit 1. For our purposes, the important thing to notice about Exhibit 1 is that the transformed series, Wj, is made up of sines and cosines of the original series, the Xks. You will recall from high school trigonometry that the most outstanding feature of sines and cosines is that they are periodic - they repeat at very regular intervals, as shown in Figure 3 where we simply take the sine and cosine of the integers from 1 to 20. A Fourier transformation re-expresses the time series, no matter what it looks like originally, into a set of pure sine and cosine waves, thereby revealing any periodicity, or patterns that repeat at various intervals of time. Figure 4 shows a particular series of monthly data and Figure 5 shows a Fourier transformation of this data, revealing that the series contains three robust cycles: one that repeats every 12 months, another that repeats every six months, and a third that repeats every three months.
3. Splitting the signal
One of the important features of the Fourier transform is that once you've identified the various components of the series, like the slow 12-month and fast three-month cycles in Figure 4, you can do basic arithmetic on the series. For instance, you can split the original time series into two or three separate series, each with very specific attributes (like long-term trend) and model each of these series separately. The equal sign in the Fourier transformation means that once the series is split up into its basic components, adding all the components back together again results in the original series. Similarly, adding together forecasts based on each of the components results in a forecast for the entire series. This idea is fundamental to the neural network approach to forecasting: split the series into components having specific properties using the Fourier transform; train separate neural networks on each of the components; forecast future values of each of the separate components (since each component has a specific property, each will be ruthlessly precise); add the forecasts together and the equal sign in the Fourier transformation assures one of having a forecast of the raw time series, albeit a much better forecast than would ever have been achieved if you tried to forecast the raw signal rather than its constituents.
4. Splitting the series
Splitting the time series is done with filters. There are all kinds of filters, but the three that are the most useful to market research are highpass, lowpass and bandpass filters. The idea is very simple. A highpass filter of a time series only lets through those components that repeat at a certain high frequency. Conversely, a lowpass filter only lets through those components of a series that repeat at specific low frequencies. The low frequency components are long-term trend. Long-term trend occurs at low frequency because if you have a three-year trend in a database that spans six years, it will repeat with a frequency of two. If there is also a quarterly cycle in this data, it will repeat with a frequency of 24 times - it is a much higher frequency than the three-year trend. The highpass components are quickly fluctuating, often times random noise, in the data. A bandpass filter is a combination of a lowpass and highpass filter, allowing through only components that repeat between a low and high cutoff point.
Once we know the spectral components of the series from the Fourier transformation, we can split our signal into a variety of pieces and train separate neural networks on each of the pieces. Some pieces, especially the lowpass component, are very easy for a neural network to learn. Other components will be more difficult. More importantly, though, external variables that we might wish to include in our model will selectively influence components of the signal. For instance, if we are forecasting demand for refrigerators, the number of new housing starts will probably effect the low frequency long-term trend component while price discounts, advertising and promotion, etc., will probably effect the higher frequency, more transient aspects, of the data. Splitting the series into components where external variables can differentially impact each of the components results in models of much greater precision (i.e., much higher R2) and vastly more accurate forecasts.
Returning to our example, Figure 6 shows the output of a lowpass filter. The long-term trend in the data is now quite clear:
- Sales have a clear 12-month seasonality that peaks in July and troughs in February.
- Sales have been trending upward from 1989 to 1994.
- In 1994, something occurred to shift the curve upward such that the basic seasonality continues, but now around a higher baseline.
Even without a neural network, it is pretty easy to predict the August point of Figure 6.
Figure 7 shows some higher frequency, more transient, aspects of the series. Casual observation will convince you that even this high-frequency series is far from random. In fact, a neural network will have very little difficulty learning this series.
Figure 8 shows the two series together. If you simply add the two lines together you will get Figure 1. Since these two series with very different properties are mixed together in Figure 1, it should be obvious that forecasting each of the two filtered series is easier than forecasting the combined series. Moreover, the equal sign in the Fourier transform gives us a way to put the two forecasts back together. In this example, we have split the signal into only two pieces. It is not unusual to split a signal into four or six pieces. Knowing how many filters and how the filters should be set comes from studying the Fourier transform. For the signal shown in Figures 4 and 5, splitting into three pieces - a very low frequency cycle, an intermediate six-month cycle, and a high frequency series - is required.
5. Exogenous variables
Exogenous variables are variables that arise outside of a model. That is, the model does not describe their cause. In our refrigerator example, housing starts are exogenous since the model has nothing to say about what causes increases and decreases in housing starts.
We've already covered this in our discussion of why the Fourier transform is used. Frequently in market research forecasting, we want to know how other variables affect the series - how does price affect demand, how does advertising affect sales, and so forth. The neural net allows a simple way to factor all these variables, especially survey-based variables, into the model. As described above with refrigerator sales, part of the great beauty of the neural network approach to time series forecasting is that it allows exogenous variables to differentially impact various components of the series, a variable like customer satisfaction could be highly correlated with just a specific component of the series, but when you look at the correlation with the entire series, the relationship may be completely lost.
6. Training and test samples
At this point we're ready to actually build a neural net model from the data. We will have two distinct neural networks, one predicting long-term trend in sales and the other predicting the high frequency transients. As with any supervised learning network, we need two things in our data: values of the independent variables and values of the dependent variable(s). The network will learn a nonlinear mapping between the two. In a time series analysis, the data line is a little weird. To make what the network sees concrete, let's talk about the input for January 1995. The January 1995 data consists of three elements:
1. Since we want a four-month forecast, the January 1995 line would contain February, March, April and May 1995 sales. These are the dependent variables, the values the neural network will be trained to predict.
2. January 1994 to January 1995 sales. We will base each four-month forecast on sales of the previous 12 months so the value of sales in each of the previous 12 months is on the input line. This is part of the independent variable.
3. The value of the exogenous variables in January 1995. In this particular application, there are 16 exogenous variables as shown in Figure 9. This is part of the independent variable also. One could also add lagged values of the exogenous variables.
Each line of input contains 32 data values. We randomly split our data into two pools, a training pool and a testing pool. We will use the training pool to train the network, and we will use the testing pool to see how well it does with data its never seen before.
7. Recurrent neural network
We can now train our network using the technique of back-propagation described in my first article. We will have an input layer consisting of elements two and three above, and an output layer with element one above. We will also have some hidden processing elements. In this case, some of the hidden processing elements take on a very special form and rather than being connected to all of the input data, they are connected only to lagged sales values that appear on the input line. This special form makes these elements act like a memory in the network: the network is able to remember past values of sales and past predictions. That is the recurrent part of a recurrent neural network, and these special processing units are usually referred to as a context layer.
Alternatively, one can load past values of the series onto a single line of the input file, as we have done in our example, and use a technique known as time-delayed neural networks (TDNN). In either case, back-propagation works like it did before: the network adjusts weights from the input units to the hidden units, and from the hidden units to the output units, until its estimate of the output best matches the actual outputs. Notice, however, that the outputs in this case are the level of sales four months in the future. By this specific arrangement, the network is being trained to minimize the error between its four-month ahead forecast and the actual four-month ahead data. The weights that lead to the best match between inputs and outputs show the impact of the exogenous variables and previous values of sales on sales four months in the future.
Networks for each of the outputs that resulted from splitting the signal are trained separately. When each is done training, adding their forecasts together gives a four-month ahead forecast. Since exogenous variables have been included, it is a simple matter to show how strongly they impact current and future sales, as shown in Figure 9. Figure 9 shows the relative impact of a variety of econometric measures on our sales forecast. The large positive weight on personal consumption expenditures means that if this measure increases in January, sales will increase in May. Similarly, if interest rates go up in January, sales will go down in May because of the large negative impact of prime rate. These impact scores are easily obtained from the neural network since they are nothing more than the weights with which the variables connect to the output layer. The ability to factor in exogenous variables in such an easy and powerful way is one of the most important features of neural net based forecasting.
I will leave it to the reader to figure out, but the fundamentals of the approach I've just outlined can be used for survey-based pricing studies, in which case the weights show completely non-linear price elasticity and cross-elasticity demand coefficients. By including other econometric variables, you could see how the larger economic environment impacts price sensitivity. If some of the exogenous variables concern advertising, the weights show advertising elasticities of demand. Neural networks have been extensively used to model all the elements of the marketing mix. Similarly, a discrete choice experiment can be set up where the hidden unit's weights are equivalent to those obtained from multinomial logit. In this case, the input lines consist of choice probabilities and dummy-coded design information.
8. Error correction filter
Figure 10 shows a six-month forecast and actual sales volume for our series. The forecast was made in August 1996 before values for September 1996 through February 1997 were available. Each month's forecast is within $100,000 of actual, even in December when sales surged to $120 million, their highest level ever. Figure 11 shows the output from what is known as an error correction filter. The name is really bad, since the filter doesn't correct any errors. All it does is calculate how likely the forecast is to be wrong. The output of the error correction filter can be turned into confidence intervals, as shown in Figure 11. The combination of all the neural nets yoked together are 90 percent confident that the true sales value for August 1997 will lie between $79 and $96 million. As intuition would demand, the area within the confidence intervals increases - we are more confident about our predictions for the immediate future than we are about predictions concerning the distant future.
Valuable to researchers
In summary, the neural network approach to time series forecasting has several properties that should make it valuable to market researchers:
1. Compared to ARIMA and other econometric forecasting methods, it is very, very easy to implement. Software for Fourier transformations are available in all of the major statistical packages such as SPSS or SAS.
2. The neural network approach makes integrating the effect of exogenous variables a snap. This step is very cumbersome in ARIMA models.
3. In ARIMA and regression models, you initially guess at the order of the AR and MA terms. The neural network approach essentially solves for the correct order of these terms through training.
4. A neural network model can learn from its past mistakes. When new data becomes available, ARIMA models must be developed from scratch.
On the downside,
1. I quickly glossed over developing confidence intervals for a forecast. The precise details of how the error correction filter is built and used require a strong working knowledge of calculus.
2. In the example, the neural network was trained to forecast four time periods into the future. Usually, one forecasts only one period into the future and then uses this forecasted value to build a forecast for the second period and so on. This is called bootstrapping and can be moderately cumbersome to implement.