Editor's note: Gary M. Mullet, Ph.D., is principal of Gary Mullet Associates, Inc., a suburban Atlanta, Ga., statistical consulting/analysis firm.
Much has recently been written, on these pages and elsewhere, about using regression analysis in marketing research. In fact, awhile back in Quirk's Marketing Research Review there was a very enlightening, informative and somewhat heated series on using (or not using) regression analysis in determining derived attribute importance weights in customer/client satisfaction measurement studies. (cf. "Regression-based satisfaction analyses: Proceed with caution" October 1992 QMRR, "Appropriate use of regression in customer satisfaction analyses: a response to William McLauchlan" February 1993 QMRR.)
The following few paragraphs will not enter further into that fray. Instead, we'll go back into some of the more fundamental underpinnings of regression analysis, irrespective of particular applications. This is motivated by some recent questions, comments and observations from a variety of sources, all aimed at a better understanding of this fundamental statistical analysis tool. Of course, a lot of these issues have been treated elsewhere, too, but the time seems opportune for a memory refresher.
Missing data (or How come I paid for 750 interviews and your regression analysis only used 37 of 'em?)
Unlike when you took your introductory statistics course (or vice versa), real respondents frequently fail to answer every question on a survey. Consequently, many times when we run a regression analysis using canned software programs, we end up with many fewer respondents than anticipated. Why? Because, the packages assume regression analysis to be a multivariate procedure (many statisticians don't, by the way) and drop any respondent from the analysis who fails to answer even a single statement from your set of independent or dependent variables.
What to do, what to do? I'm not sure that there is a definitive answer, but here's what some analysts do. First, there's always the option of using just the respondents who've answered everything. This has the effect of you basing your research report on 50 percent or so of the respondents, sometimes less. Many times this is (subjectively) appealing, since you have the assurance that your model is based on those who answered each and every question. It's not so appealing, however, when you start with a sample of 1,000 or so and end up with only 100 of these dictating your regression results - especially if the model is to determine compensation or to forecast sales.
Many analysts prefer to use mean substitution for any values which are missing. The software packages automatically, if you tell them to, substitute the mean of everyone who did answer a particular question for the missing answer of those who, for whatever reason, didn't. Here's what can happen when you use this option:
- If there is a given question that, say, 90% don't answer, the mean answer of the remaining 10% who did is substituted and used as if that's what the other 90% said.
- A respondent who answers few, or no, questions can still be included in the analysis, unless the analyst overrides the automatic substitution of means, since a mean value is also used for the dependent regression variable. Oops!
Now, both of these things happen, probably rarely, but certainly more than they should. It seems obvious that maybe we should look at each question to see what the item nonresponse is and if there are questions that are particularly high on item non-response, try the regression without them. Common sense also says to exclude respondents who fail to answer the question we're using as a dependent variable (overall opinion, overall satisfaction, or some such) and also to exclude any who don't answer a specified minimum number of independent variable ratings, 75% or whatever criterion you decide on. Luckily, both of these can be easily done with the current software. Sadly, as noted, they aren't always.
An alternative that seems to be gaining favor is to use only those who've answered the dependent variable (kind of a nobrainer in most circles) and then substitute the respondent's own mean on the ones that he or she answered for the ones he or she didn't. Again, you'll probably only want to do this for those who've answered a majority of the items. This variation takes scale usage into account and is appealing because to some respondents "there ain't no tens."
Significance testing (or If they're all significant how come no single one is?)
Regression software packages generally test two types of statistical hypotheses simultaneously. The first type has to do with all of the independent variables as a group. It can be worded in several equivalent ways: Is the percentage of variance of the dependent variable that is explained by all of the independent variables (taken as a bunch) greater than zero or can I do a better job of predicting/explaining the dependent variable using all of these independent variables than not using any of them? Anyway, you'll generally see an F-statistic and its attendant significance, which helps you make a decision about whether or not all of the variables, as a group, help out. This, by the way, is a one-sided alternative hypothesis, since you can't explain a negative portion of variance.
Next, there's usually a table which shows a regression coefficient for each variable in the model, a t-statistic for each and a two-sided significance level. (This latter can be converted to a one-sided significance level by dividing by two, which you'll need to do if, for example, you've posited a particular direction or sign, a priori, for a given regression coefficient.)
Now here's the funny thing: You will sometimes see regression results in which the overall regression (the F-statistic, above) is significant, but none of the individual coefficients are accompanied by a t-statistic which is even remotely significant. This is especially common when you are not using stepwise regression and are forcing the entire set of independent variables into the equation. How can this be? It can be because of the correlation between the independent variables. If they are highly correlated, then as a set they can have a significant effect on the dependent variable. Individually they may not.
Look at it this way. Let's say that we are measuring temperature in both degrees Fahrenheit and degrees Celsius, measuring with thermometers, rather than measuring one and calculating the other using the formula most of us hoped we'd never have to remember beyond high school chemistry. (By measuring, given the inaccuracies of most thermometers, the computer won't give us nasty messages. It probably would if we measured only one and calculated the other.) Next let's say we're going to use these two temperatures to predict something else we've experimentally determined, like pressure. Clearly, the two temperatures together will explain a significant proportion of the pressure in a closed container. Also, maybe not as clearly, neither will individually be significant because they are correlated with each other and each is redundant given the other. That last clause is the kicker.
When you look at the significance of a regression coefficient (or the coefficient itself, for that matter) you are seeing the effect of that particular variable, given all of the others in the model. This is properly called the significance of a partial regression coefficient and the B is the partial regression coefficient itself. Either degrees Fahrenheit or degrees Celsius, alone, would be a significant predictor of pressure (this is the total regression) but either given the other, the partial effects, will not be significant.
This makes sense, I hope, and will help explain some seemingly strange regression outputs. It also leads us to the next issue.
Wrong signs (or How can the slope be negative when the correlation isn't?)
This happens all the time. You know that overall satisfaction and convenience are positively correlated -higher ratings on one of these go with higher ratings on the other and the lower ratings go together, too. Yet in a multiple regression, the sign of the coefficient for convenience is negative. How come?
There are a couple of things which can be going on. First, the t-statistic for the coefficient may not be statistically significant. We interpret this as an indication that the coefficient is not significantly different from zero and, hence, the sign (and magnitude for that matter) of the coefficient are spurious. Fully half the time, for a truly non significant effect, the sign will be "wrong."
The other thing that can be happening is the partialling effect noted above. It could be that the slope is negative given the effect of the other variables in the regression (partial) even though all by itself the variable shows a positive correlation and slope (total).
Try this. Here's a small data set.
Resp. # | OVERALL | Package | Value | Taste | Color |
1 | 5 | 6 | 9 | 13 | 11 |
2 | 3 | 8 | 9 | 13 | 9 |
3 | 9 | 8 | 9 | 11 | 11 |
4 | 7 | 10 | 9 | 11 | 9 |
5 | 13 | 10 | 11 | 9 | 11 |
6 | 11 | 12 | 11 | 9 | 9 |
7 | 17 | 12 | 11 | 7 | 11 |
8 | 15 | 14 | 11 | 7 | 9 |
Let OVERALL be the dependent variable and run various regressions, first with each independent variable by itself and then with various combinations of the independent variables. Some things to look for are:
- Correlations between all variables, both in magnitude and sign;
- Sign and magnitude of regression coefficients (B) when each variable is by itself as a predictor;
- Sign and magnitude of regression coefficients for the variables when they are working with the other variable(s).
You should see some interesting things with respect to both the magnitude and sign of some of the regression coefficients. Are the signs sometimes "wrong?" If so, are they wrong when the variable is used alone or wrong in later models? Possibly they are wrong neither time, they just happen to not agree. By the way, checking the F-statistic for the models including more than one predictor and the individual t-statistics will also hammer home some of the earlier points.
Conclusions
The preceding paragraphs have addressed three or four pragmatic issues about using regression analysis. While they do more than scratch the surface, there are certainly other regression fundamentals that were neglected. However, taking the small data set provided and doing a thorough set of regressions should allay many of the qualms of that regression analysts face.