Editor’s note: Gary M. Mullet is president of Gary Mullet Associates, a Lawrenceville, Ga., data analysis and consulting firm.
In response to a reporter’s question, a member of one of the Sweet 16 basketball teams in this year’s NCAA tourney discussed and defined regression analysis. His answer had to do with the idea that if he (or an opponent) had a bad game, his next game would probably be better, due to the regression effect. The question was posed to the student-athlete after it was learned that he had to take an examination on regression analysis while on the road for this session of March Madness. As most readers know, his example of the regression effect closely parallels that of the relationship between fathers’ and sons’ heights, which some sources say gave rise to the term regression analysis over 100 years ago.
What follows are answers to several other questions, some dealing with regression analysis, some not. The pages of this column have covered a wide variety of topics and some of the answers below are borrowed liberally from them and some aren’t. (Remember: stealing from one author is plagiarism; stealing from several is research. Since I stole that statement from only one source, it’s plagiarism. I’d love to give credit but honestly don’t know the original source.)
Why do some regression coefficients have the wrong sign?
What exactly is meant by the "wrong sign?" Computationally, hardware and software are at the stage where, for a given data set, the signs are undeniably calculated correctly. (Such was not necessarily the case back in the days of punched cards, about which more later.) There could be a couple of things going on. First, your theory could be wrong; that is, the sign might not really be wrong. Second, it could be a statistically non-significant result, in which case the sign of the coefficient is meaningless. Third, it might be that, due primarily to collinearity in the data set, you are comparing the sign of a partial regression coefficient with expectations from a total regression relationship. The partial coefficient involves the relationship of the particular independent variable of interest accounting for what goes on with other independent variables. The total coefficient ignores what goes on with the other independent variables and looks at only the relationship between the criterion variable and a single predictor.
It’s not at all uncommon to see a positive sign attached to a correlation coefficient involving a single predictor and single dependent variable (predictee?) and yet the regression coefficient for this same predictor, when other predictors are in the equation, will be negative. Run a handful of regressions with a larger handful of predictors and you will almost assuredly see several such "wrong" signs. They may or may not be cause for concern, depending on the intent of the study for which you are doing regression analysis in the first place. If you are among those who use beta coefficients to allocate relative "importance," you might be in for a headache due to these sign reversals.
How does sample size impact regression and multiple correlation?
Well, we see a couple of things going in opposite directions here. A smaller sample will usually show a larger R2 and a smaller number of statistically significant predictors than will a larger one from the same population. It’s a degree of freedom phenomenon and makes a least a modicum of sense.
Remembering back to when you took geometry (or, in my case, vice versa), you saw that two points perfectly determine a straight line, three points determine a plane, four points determine a hyperplane in four-dimensional space, and so on. Regression analysis is really doing nothing other than estimating the coefficients of the equations for those lines, planes and hyperplanes. It should be fairly easy to see that the fewer data points we have, the better the fit of the planes to the data, usually. Thus, R2 will be larger with fewer data points, generally speaking. That’s why some of your models are "better," if you use R2 to determine goodness of the model as many are wont to do, when you look at small subsets of the sample and compare the results with the total sample. Economists were among the first to recognize this and in most introductory econometrics texts you’ll find a definition of R2-adjusted-for-degrees-of-freedom. This adjusted R2 is routinely shown as part of the output of most current software packages. It’s very disconcerting when this value shows up as negative. If it does, you are woefully short on sample!
As for the number of significant predictors, as the number of observations increases, the denominator in the statistic that determines whether a regression coefficient is significant decrease, other things remaining constant. Thus, it’s "easier" for a coefficient to be statistically significant and, with bigger samples, more will be declared significant than when you are analyzing smaller samples.
As you are no doubt aware, samples that are inordinately large are troublesome in other statistical analyses, too. Even with simple t-tests for independent means, big samples will show that even minuscule sample mean differences are significant. In these cases, as well as those above, the real issue is, are the results substantive in addition to being statistically significant? The answer may be that they are not, just because the sample was too large.
Why are my R2 values so lousy when I use yes-no predictors?
Yes-no predictors, or dummy variables, come about when we use qualitative (e.g., gender, brand used most often, education category, etc.) rather than quantitative variables as predictors in a regression analysis. You won’t find much about this in print, but Michael Greenacre wrote about it, maybe with a proof, a few years ago in a Journal of the American Statistical Association paper. For our purpose here, it’s something to acknowledge and, in part, it points out what many consider the folly of comparing the goodness of regression models by using R2.
If you have some statistically significant regression coefficients and your regression equation makes a degree of substantive sense, then you might want to ignore the magnitude of R2 when using dummy predictors. See immediately below for more on small correlation coefficients.
Why do performance items with higher means commonly seem less "important" than those with lower means in CSM studies where we find derived importance?
While this is not always the case, what you’ll see as often as not is that items with higher variance are more correlated with the overall measure than those with lower variance. This is because of the whole idea of correlation having to with joint variation. Thus, a measure with low variance usually can’t explain or account for as much variation in the criterion or overall measure as will one with a higher variance. Finally, then, an item on which the performance is universally high in the minds of respondents can’t explain much variance in (or can’t be strongly correlated with) the overall measure just because it (the item) doesn’t vary. However, an item on which mean performance is so-so usually has a lot of individual variation (some respondents think you perform great, others think just the opposite, hence a middling mean) and, thus, can account for a lot of shared variation in the overall measure. Correlation is really non-directional and is looking at nothing other than shared variation, much as we’d like to imply a dependence relationship to it.
The same thing can occur when you correlate a lot of stated importance items with an overall measure. The higher the stated mean importance, the lower the correlation. Again, recognize that this is not a universal phenomenon, but something those of you doing lots of CSM work might want to chew on.
Some say never use individual respondent data in multiple correspondence analysis (MCA). Why?
Beats me, although it’s undeniably easier to use a big crosstab table when doing an MCA than using the proper individual respondent data. In fact, without a lot of inventiveness there’s no way that some data sets can even be put into an appropriate crosstab table prior to MCA. The major reason that one should use individual respondent level data is that you then see the effect of correlated answers in the resulting perceptual map. If the various categories (age, sex, BUMO, etc.) are truly statistically independent, then it’s O.K. to use summary data for an MCA. Otherwise, be very careful.
Does respondent order really make a difference in segmentation studies?
This can be answered with an unequivocal, "maybe, maybe not." If you’re using an older clustering program, you might see some differences in your segmentation results depending on the order of the respondent data. On the other hand, if you are using a newer program that does iterations (such as the k-means program found in PC-MDS and elsewhere) input order is pretty much immaterial. We’ve run several analyses where we’ve taken a data set, clustered it, randomized the order of the data, reclustered, and so on. The differences in the segments were so minor as to be negligible. They certainly weren’t substantive. Using the data in the order collected probably works as well as any other order.
What is "card image" data?
This question really makes me feel old and I hear it frequently. It seems like only yesterday when data were input into a computer via a card reader. The card reader read information that was punched into cards, commonly called IBM-cards but more properly known as Hollerith cards (although Hollerith was one of the founders of IBM), which had 80 columns. Into each column we could put either a number, a letter or a special symbol by punching a small rectangle out of one or more of 12 possible positions down the column by using a machine with a keyboard similar to a typewriter (another archaic instrument). No wonder the cards were/are also sometimes called punch(ed) cards. These cards were also used to enter the computer programs. For the budget-conscious, one could correct errors in the cards by filling unwanted holes with a wax-like substance - less expensive but harder than just retyping the whole thing.
Anyway, in marketing research, then, card image data refers to data that consists of one or more "cards" of 80 columns per respondent. Of course, we can’t call them cards since they aren’t, so the new nomenclature is "records." Thus, a study using card-image data with five records per respondent means that we have up to 400 columns of data per respondent, arranged in blocks of 80 columns each. If we arranged this same data into one long record of 400 columns per respondent, then we have what some call string data, which is not the same as what some computer programs mean by the term "string data." In either case, the tough job is to tell the computer what can be found where so you can get the answers that you seek before the Friday afternoon deadline.
Also, you should note that while most programs used in marketing research data processing can handle card image data, data which are not card image cannot be analyzed by all of them.
By the way, punched cards were run through a card sorter, sorting/matching on particular punches in particular columns, as early versions of computer matching of potential dates (for social activities, not for calendars). Finally, an interesting item to put on a list for a scavenger hunt is a "punch card." Good luck finding one!
Any ideas for some readable reference material on statistical analysis?
Most anything that Jim Myers has ever published is well worth reading. His papers and books are very pragmatic, offer alternative viewpoints and make the reader think. I personally like materials that are not dogmatic and recognize that, particularly in marketing research, there may be two or more ways of looking at a particular type of analysis. (Jim also remembers what punched cards were, I’m sure.)
In our business, applications articles are interesting and sometimes directly useful. However, recognize that if a company or individual comes up with a truly unique way to solve a thorny data analysis problem it will probably never be shown to the general research community, instead remaining proprietary. That said, of course, not every technique that is put forth as proprietary is necessarily a unique problem solving tool.
The heavy duty technical articles and books are probably more cutting edge, but difficult for most of us to read and even harder to directly apply. This is not to say that they should be discontinued, but there may be a long time between the publication of such an article and when you can use it in an ATU study, say.
I enjoy the articles in columns such as this one, but the reader has to recognize that these papers are not peer reviewed and so may contain some things that are not 100 percent factual. That’s O.K. as long as the reader doesn’t take them as gospel, but notes that they are merely opinions about how to solve certain problems. You should also use a large salt shaker when pulling material from the Internet. I don’t hesitate to ask others where to find information on such-and-such; you shouldn’t either. There is no perfect source of information.