Editor's note: Pete Cape is director, global knowledge in the London office of research firm SSI.

So you get your results back. The Japanese survey rates your client’s hotel chain an 8.34 out of 10 on cleanliness. Your American survey reveals a 7.46. The difference is statistically significant. What do you tell your client? Are the Japanese employees better at keeping the hotel clean than the American ones? It seems so from the data but there is something nagging in the back of your mind: Do Japanese respondents always rate more highly?

If that were the case wouldn’t it be great if there were some simple way of converting a Japanese rating into an American one? Well, it would be great but unfortunately it is neither simple nor easy. If it were, we would all be doing it!

This article outlines the challenges and suggests some direction to create practical solutions. Given the importance of scales in questionnaire design and the increase in multicountry research, we hope in this article to provide a clear understanding of the issues as a foundation for research which will result in more powerful and reliable solutions than we have today.

There are mathematical ways of dealing with the issue but they are not without their own problems. (We address some of them in the sidebar on normalization and standardization but you might want to read on first.) Why is it neither simple nor easy? Because it is a multifaceted problem, with even more facets than you might think.

If we look at the way the problem is presented there seems to be only one issue at stake: Do Japanese consistently rate the same thing more highly than Americans? And with two variables we have four possible outcomes:

1. Japanese and Americans rate the same way; the hotels are different.
2. Japanese and Americans rate the same way; the hotels are the same.
3. Japanese and Americans rate differently; the hotels are different.
4. Japanese and Americans rate differently; the hotels are the same.

The data would seem to suggest that Option 2 cannot be true. But recall that when we perform a test of statistical significance, while we might find a 95 percent chance that the numbers are different (i.e., Option 1), this naturally means there is a 5 percent chance that they are not (i.e., Option 2).

Should either Option 3 or Option 4 be true, that may lead to different courses of action by the business. And therein lies our dilemma and our big unknown.

But is it our only unknown? There are other dimensions to consider.

Cultural meaning in scales. It is common in market research to use numeric scales to represent degrees of difference. It seems intuitively obvious that this type of scale does what we want a scale to do. The spaces between the points are equal (just as we would like a verbal scale to be) and the progression of the numbers from low to high suggests an improving picture from bad to good. To aid the respondent we often anchor a scale with words to convey what it means. Or rather, what it is supposed to mean, for numeric scales used in this way are not equivalent to centimeters or inches on a ruler. The meaning of numeric ratings is steeped in our cultures. For most of us it begins at school. In the U.S. the grade average is between 0 and 4, although of course children receive their actual grades in terms of A to F. This is by no means a universal. In France the scoring is 0-20, in Italy 1-10, in Russia 1-5. Now, put some pressure on someone to make decisions repeatedly and quickly and they will eventually revert to heuristics. This is the reality of data collection today. We ask people to respond to multiple items in a grid very quickly. So a 7 in Italy is good; in France it is not so good. Take this to another extreme and consider that in Germany scholars are rated from 6 to 1, with 1 being the highest! Education scoring also sheds light on the meaning of the space between the numbers. They are not equal. Take this example from France:

16-20: very good (très bien)
14-15.9: good (bien)
12-13.9: satisfactory (assez bien)
10-11.9: tolerable (passable)
0-9.9: fail (insuffisant)

Both the top and the bottom brackets are wider than the middle 3 and the “good” grades start at 70 percent of the scale (14 out of 20). In the U.S. one might expect good grades to be starting more around 80 percent.

Problems are fewer with verbal scales but there are, of course, issues in translation. Perhaps assez bien in French is better (or worse) than the translation “satisfactory” is in English? Note also that the French scale stops at “very good” where a market research scale may go to “extremely good” – Google Translate makes no distinction between these two phrases in translation into French – both are rendered as très bien.

Cultural meaning in items. We must also concern ourselves with the relative meaning of the items that are being rated, their meaning within the culture. While it is outside the scope of this article, it is worth thinking about the relative importance and resonance of “cleanliness” (in our example) in the two cultures. It is not only in the behavioral sciences that the risk of generalizing from WEIRD (Western, educated, industrialized, rich, democratic) samples to human populations exists.Researchers (and marketers) also tend to be WEIRD and will happily write questionnaires from their own perspective. Importance and saliency are probably the two most valuable questions that are most often left out of surveys. If we have this data we can use it to up- or down-weight how much attention we pay to item differences we observe.

More problems

So, to our first two problems – Do they always rate more highly? Is the data the truth? – we can also add: Do the scales mean the same thing? And: Is the item being measured in any way meaningful? Just taking the scale meaning alone increases our set of possible outcomes from four to eight.

And there is one more issue that is often ignored: the question of expectations. These are also culturally-driven. If my expectation of cleanliness is low then a “somewhat good” level of cleanliness (from Culture A’s perspective) is going to look “extremely good” from mine. So the precise same level of absolute cleanliness would be rated differently from the two cultures’ perspectives. This issue is solved in part by considering service “gaps” rather than absolutes; that is, the gap between expectation and delivery. Couple this with a relative importance measure and you have a powerful tool to allocate resources and measure progress (see the sidebar “Gaps not ratings”).

Adding in the issue for expectation or service delivery takes our set of possible real outcomes from eight to 16!

It is no wonder that the hard-pressed researcher falls back on considering only one problem – that of sample error. At least he has theory, formulae and a calculable probability of being correct; and is safe in the knowledge that the real truth, in opinion research, can never really be known.

Identify variables

Once we understand the dimensions of the issue it is obvious how the cultural effect of scale usage should be measured and that it cannot simply be done through observation of survey data. Firstly we need to identify some variables or dimensions that are culturally neutral. They should be something that we all, the world over, agree on as an essential human truth that we should all rate in the same way. Then we need to find some measuring stick (our scale) that we can agree has the same meaning at each of its points. Any observed difference between cultures on their ratings must then be due to cultural bias in response (plus sample error; and we have a paradigm for dealing with sample error). Applying the calibration factor to any given rating will adjust the scores, making them equivalent.

While easily stated, it is not a simple matter to find either the scale or the culturally-neutral items. One thing is for sure: With the plethora of scales in use today and the modern trend for questionnaire design without any pre-survey qualitative work, the chances that you hit on a culturally-neutral item and your scale is cross-culturally valid must be small. We need to look outside our own market research frame of reference to find these cross-cultural items that ought to be responded to equally. The worlds of social values and moral psychology may be fertile ground for us.

Be less sure

If researchers work with cross-culturally-validated scales and can take the time to ensure all their items are equally salient then the probabilities work in their favor that what they observe (given a confidence interval) is actually the truth. But they should be less sure than they think they are: Their possibility of being wrong is higher than their statistical test suggests.

If we knew for sure how to calibrate our scales for any cultural bias in usage then we would be in a better position to advise companies where to concentrate their scarce resources. And that would be good for market research the world over.

REFERENCES
1Joseph Henrich, Steven J. Heine and Ara Norenzayan. “The weirdest people in the world?” Behavioral and Brain Sciences. Vol. 33. Issue 2-3. June 2010. pp. 61-83.