Editor’s note: Kent Leahy is vice president of TMI Associates, Inc., a North Caldwell, N.J., research firm.
Predictive segmentation modeling techniques vary in regard to the size of the sample needed to achieve a given level of reliability. This is due to the fact that some methods have a greater propensity to capitalize on chance or “overfit” the sample data on which the model is derived. A model that overfits the data gives attention to elements in the sample that not only represent the systematic variation inherent in the population, but also to sampling error. When this happens the result is a model that can fare less well, and sometimes dramatically so, when applied to a new sample or a different set of data.
The goal of model building is to obtain a predictive model that generalizes across many such samples to the universe at large, and not merely to the sample at hand. We want our model to be as effective as possible when we use it on future data, and not just on the sample on which it was based.
To accomplish this goal, we need to develop our model so that it captures innate systematic processes rather than discrete, temporal events, or those that are sample-specific or peculiar to a given sample. We thus need to restrain our model from fitting the sample too well. Prediction methods such as neural networks, classification trees, rough-sets, and those based upon fractal geometry or chaos theory, are all examples of predictive techniques that have a tendency toward sample-dependence or overfitting. Standard statistical methods such as classical regression analysis, logistic regression (logit), and discriminant analysis, on the other hand, are less susceptible to the overfitting problem and thus tend to be more reliable.
The reason for this is that these latter or parametric methods explicitly specify on an a priori basis the assumed functional form of the relationship between the dependent variable and the respective predictors. In statistical terminology, this has the effect of decreasing the variances of the estimated model coefficient values or weights, thereby reducing the likelihood of the estimated model coefficients values varying substantially across samples or from sample to the population. Thus, such predictive techniques tend to generate parameter values (or effects) that are stable across samples or are otherwise reliable.
This is not to imply that classical regression and logit are necessarily better at predicting future outcomes than are the nonparametric methods such as neural networks. In fact, a glaring disadvantage of the parametric techniques is that they are inflexible when faced with data that diverge from the model specification, as for example, when a linear specification is used on data that are prominently, intrinsically and unexpectedly nonlinear. In such a situation regression or logit would likely perform less optimally than neural networks.
Furthermore, all modeling techniques, parametric as well as nonparametric, have the potential for overfitting, including classical regression and logit. However, the more simple or less complex the model specification (in the absence of any knowledge or theory as to the form of the relationship to the contrary), the less problematic overfitting tends to be. This is because a model is based upon limited sample information that invariably contains error. Thus, it is not unreasonable to expect that the more unrestrained the technique and/or complex the model, the greater is the tendency to incorporate or subsume such error into the model itself.
Consider, for example, the comparative potential for overfitting with linear regression versus neural networks. Since linear regression is constrained to fitting the data in a straight line only, it is less likely to incorporate sampling error than a more powerful technique such as neural networks, which can contort themselves into extremely complex forms.
Paradoxically then, the strength of complex data-fitting techniques is also their potential weakness: the inability to differentiate actual relationships or systematic structure from random perturbations in the sample. Thus, a trade-off exists between fitting the data with an ever-higher degree of exactitude and the confidence one can have in the closeness of the fit.
Given this potential for overfitting, what steps can be taken to decrease its likelihood? One obvious step is to increase the size of the sample on which the model is built. In fact, the ultimate theoretical solution to sample-error overfitting would be to use the entire universe or population. There would then be no such error with which to overfit, so regardless of how close our model came to fitting the observed data, without error the issue of sample-based overfitting is moot.
In many if not most applications, however, increasing the size of the sample, let alone to the size of the universe, is simply not feasible. And for a model of even moderate complexity, the sample size needed to decrease the likelihood of overfitting can be quite large.
Second, and even more important, sampling error is but one component of error that differentiates our sample from our future target universe and in many instances may well be the least-pronounced source of such error. Several factors - including unforeseen changes not incorporated into the model - can occur, introducing a new source of nonsampling error that a larger sample will not protect us against. In addition there is always inherent, non-systematic random variation present even in the population.
An especially problematic source of such extra-sampling error occurs when the end-user violates the assumptions under which the model was built. Marketing personnel, for example, may use the model on different audiences, with a different product or product options, with different creatives, and with different price lines and so forth. Thus, a model that has been built that doesn’t allow for the likely possibility of such violations occurring may very well disappoint.
Alternative steps
Since increasing the size of the sample is not always feasible, or even effective, in addressing the overfitting problem, what additional and/or alternative steps can we take? Cross-validation, in which the model is developed on half the sample and tested or validated on the other half, is frequently used as a reliability check in model building. Known as “split-half” validation, this method, or some variation of it, can be effective in remedying the overfitting problem when the source of error is primarily if not exclusively sampling error. However, when substantial amounts of nonsampling error are present, a model can give the appearance of being reliable when cross-validated, but still be ineffective when used on new data.
The effectiveness of cross-validation procedures can also be compromised by going back and forth between the training and validation samples subsequent to making model changes for the purpose of achieving a maximally acceptable fit in both samples. The more frequently this process is replicated, the greater the likelihood that the model will be an artifact of random variation and the less likely it is to predict beyond the two samples. This problem is especially acute when many predictors are available with which to model.
Since the tendency to overfit varies both within and between each of the respective modeling techniques, in the final analysis the best safeguard against its more egregious manifestations is an awareness or sensitivity on the part of the person building the model. This would include a knowledge of which technique or class of techniques, parametric or nonparametric, is the most appropriate for any given application, as for example one containing many sources of potential error, sampling and nonsampling, versus one where only sampling errors are likely to be present.
Likewise, a knowledge of the options available to limit the likelihood of overfitting within each of the respective techniques is also important. For example, with neural networks, analysts can limit the number of “input” and “hidden” nodes used in addition to the number of “training epochs,” all of which have the effect of reducing the complexity of the model.
Regardless of the technique one chooses to use in a given application, there are compelling reasons to adhere to the law of parsimony or the “simpler is better” maxim. Simple or less complex models have a greater capacity not only to be flexible in terms of their relationship to sampling error, but also to withstand any external shocks or unanticipated sources of error, relative to models that are tightly built around the sample data. The best approach to building a viable prediction model, therefore, is to obtain the closest fit to the sample data without over-specializing. Thus, our model should be just complex enough to provide a good fit to the data, but at the same time be flexible enough to allow for nonstructural or non-systematic error variance.
Explanatory vs. predictive
Our discussion of overfitting thus far has focused exclusively on the issue of reliability, or on how well model predictions hold up. There is another area where overfitting is an issue, and that is when models are developed for explanatory rather than for predictive purposes. Whereas predictive models are designed to predict future values of the dependent variable given known values of the predictor variables, explanatory (or interpretive) models are developed to gain insight into the interrelationships existing among the predictor variables as they relate to the dependent variable. A model developed by marketing research to determine how selected variables relate to product purchase is an example of an explanatory model, while a model designed to save mailing costs by predicting which customers on a file are most likely to respond is an example of a predictive model.
These two purposes or goals are by no means mutually exclusive, however there are certain constraints or assumptions attached to interpretive or explanatory type models that are not required of a predictive model, the first being that the model be correctly or at least reasonably well specified. This means that there should be some knowledge of which variables belong in the model, either based on a priori theory or intuition or perhaps discovered empirically after a judicious/insightful examination of the data via the use of OLAP, standard EDA, or certain data mining EDA software modules. If such knowledge is lacking, then the model effects or coefficient values to one degree or another may be biased and thus spurious relationships may well result.
In addition, explanatory models should be at a sufficient level of simplicity to allow for comprehensibility, since highly complex models are by their very nature non-interpretable. An overfit explanatory model thus captures too many nuances in the data or otherwise embodies excessive, confusing detail which prevents us from discerning the essence in the data. Or, to borrow from Lotfi Zadeh, creator of the concept of fuzzy logic, “As complexity arises, precise statements lose meaning, and meaningful statements lose precision.”