Editor’s note: Ken Faro is VP of research and Elie Ohana is a researcher at Hill Holliday's Decision Science practice, Boston.
During the early 2000s the field of psychology saw a rise in the number of cases of scientific misconduct. In fact, in a 2009 paper on scientific misconduct, 34 percent of researchers admitted to partaking in some form of misconduct. The number of reported misconducts jumped to 72 percent when they were asked about their colleagues’ questionable research practices. The ugly problem of fabricating research and data has been and continues to be a very real threat to the validity of science.
While these instances of fraud were primarily found in academic psychology, those of us practicing in applied settings are also at risk. In part one of this two-part article we will look at a number of cases of observed misconduct. These examples are a reminder for all of us in MR that the work we do should be held to a high standard.
False or misleading data
The misconduct observed in academic psychology in the early 2000s fell into one of three categories:
- Doing the research and massaging the data. In this version of scientific misconduct, a researcher plans and executes their research. The misconduct occurs as the researcher begins to “smooth over” the data – meaning that they shaped and molded the data into a form that makes it more likely to present the results the researcher wants. Many times this is committed in the data cleaning phase of research. While some data cleaning is required to meet certain statistical assumptions, this phase can also be abused by deleting cases that present opposite patterns to the ones a researcher is looking for. While research is actually done, and the data may be real, the fact that statistical and theoretical hypotheses are not tested but rather are selected for make this a blatant case of scientific misconduct.
- Doing the research but adding fabricated data. In this version of scientific misconduct, the research is actually done and the data is also real – mostly. In this case, the researcher might collect the data, clean it properly and then test their hypotheses, which could turn out to be trending significant (or have some low p-value but still not a significant value). To push the results over the line from trending significant to actually significant, the researcher might add some fake respondent data.
- Not doing the research and making up all of the data. The last case of scientific misconduct, and by far the worst, is where the researcher doesn’t do the research and doesn’t have any real data. Instead, they fabricate the statistical tests, the p-values and everything else.
Table 1 catalogues some of the bigger cases in scientific misconduct over the past 18 years. Note how many papers had to be retracted as a result of the scientific misconduct.
Table 1
Name | Year of Discovery | Retractions | Reference |
Karen Ruggiero | 2001 | 2 | http://www.apa.org/ |
Marc Hauser | 2002 | 1 | https://www.bostonglobe.com/ |
Roxana Gonzales | 2008 | 2 | https://www.the-scientist.com/ |
Diederik Stapel | 2011 | 58 | https://www.chronicle.com/ |
Dirk Smeesters | 2012 | 4+ | https://www.sciencemag.org/ |
Michael J. Lacour | 2015 | 1 | https://fivethirtyeight.com/ |
Jens Förster | 2015 | 3 | http://blogs.discovermagazine.com/ |
The bad: questionable research practices
So how can marketing researchers and clients learn from the questionable practices observed in psychology? First, clients must ask themselves if they feel that their vendors have processes in place to prevent scientific misconduct. After all, imagine making business decisions based on data that resulted from such delinquency. We like to think no vendor would go as far as some of the examples we’ve seen in psychology – making up data, doctoring data or massaging data. However, there are questionable practices that exist in MR that are smaller but still lead to erroneous conclusions. We’ve outlined two of them here in hopes that clients and vendors will be more aware of the practices of bad science.
Mindless data mining
Mindless data mining, commonly referred to as data dredging, is looking at data with no theory and no hypothesis – just the goal of exploration. This is the opposite of the scientific process proper. We are taught to review the literature, to identify hypotheses from the literature and to test them in a well-designed study that can falsify the hypotheses. But the truth is that there isn’t always literature out there. For new fields or new problems, someone has to be the first to explore possible relationships among variables. So while there is nothing inherently wrong with exploratory research and data mining, several problems arise if researchers are not aware of two methodological issues:
- P-hacking: Let’s say we want to test the difference between males and females on purchase intent for a new car. Many researchers and analysts are taught that testing for a significant difference comes down to one value – the p-value. If the p-value yielded from your test is equal or less than .05, then there is a significant difference. Unfortunately, a thumb-up/thumbs-down approach to p-values illustrates a superficial understanding of significance testing – one that is likely to cause problems for those who are busy exploring the depths of their data set. A p-value is the probability of getting a given test statistic similar or more extreme than the one obtained from the data under the assumption that the null hypothesis is true. So, as an example, a p-value of .05 really means there’s a probability of 5 percent or less of a given test statistic occurring. But this is just a probability. At some point – in this case, five out of 100 tests – a given test statistic might occur simply due to chance. Why is this a problem for mindless data mining? If over the course of a day at work you run 500 hypothesis tests to explore your data set, 25 of your tests can be a false positive. That’s a huge amount. No one wants to base their business decisions on false positive results.
- Non-replications: With false positives being a probable occurrence in our daily research, one would expect researchers to replicate their findings before concluding that they are true. However, it is only in the past few years that academics have begun paying attention to the replicability of findings. For example, in 2015, the Open Science Collaboration attempted the first large scale reproducibility study of psychological research. Using similar study designs as the original authors, the group attempted to estimate the reproducibility of 100 psychological experimental and correlational studies. The authors reported several methods of assessing replicability including whether the second study was also statistically significant. The authors found that while 97 percent of the original studies reported significant results, only 36 percent of replications had significant results. In short, finding a statistically significant result does not mean the result exists. Failing to integrate replication in the research process puts us at risk of accepting results that are simply not true.
Statistically significant but not practically significant
To some, if there’s a statistically significant result then it is deemed as both important and true. However, not all significant things are practically important. While practical significance is not given the same weight, it is in fact maybe more important than statistical significance. With a big enough sample size, and enough statistical power, it is easy to find significant differences in the smallest deviations. In a world of big data we recommend paying close attention to whether results have practical significance.
To understand practical significance think about an ad’s effect on purchase intent. An ad can have a small effect, bumping purchase intent up a negligible amount. Or on the other end of the spectrum, it can have a large effect, maybe shifting the respondent from a one or “I would not consider purchasing this product at all” to a seven or “I would definitely consider purchasing this product.” Market researchers can quantify this by using measures of effect size, a commonly used measure of practical significance.
Practices to avoid
There are a number of articles that mention lists of other smaller versions of scientific misconduct. For example, Leslie K. John, George Loewenstein and Drazen Prelec recapped a list of other practices researchers should avoid, published by the Association for Psychological Science. These include:
- failing to report all of the dependent measures in a study;
- deciding whether or not to collect more data after looking to see if you have significant results;
- failing to report aspects of the research design which could shed light on the validity or reliability of findings;
- stopping data collection because you found the result you were looking for;
- rounding p-values down to .05 (e.g., rounding .054 to .05);
- only reporting studies with significant results;
- deciding whether or not to exclude data only after you’ve looked at their effects on the significance of the finding;
- reporting unexpected findings as if you had predicted them all along;
- claiming results are unaffected by demographic (or other) variables without actually testing the claim; and
- falsifying data.
In part two we will provide a set of best practices for MR design, documentation, analysis and reporting.