Editor’s note: Stefan Althoff is marketing research manager at Lufthansa Technik, Hamburg, Germany, a subsidiary of Lufthansa German Airlines.

The author would like to thank the following colleagues for their input, thoughts, suggestions and cooperation: Udo Dumke, senior consultant at topcom, Hamburg; Bill MacElroy, president of Socratic Technologies, San Francisco; and Jeffrey Kerr, senior vice president of Socratic Technologies, Chicago.

In July 2006 I made a presentation at the 13th PUMa conference (Plenum of Business Market Researchers; produced by the German market research journal planung & analyse) in Frankfurt. The presentation was related to international market research and the discussion afterward revolved around the topic of scaling and the analytical challenges that can result. This was the starting point for the revival of an old idea.

I was planning an employee survey that had some difficult challenges, so I contacted Udo Dumke, a senior consultant at topcom, a consulting agency in Hamburg, Germany, with whom I had already corresponded some 10 years earlier about a very old method: magnitude estimation scaling (MES). We had long discussed how this technique could help with certain difficult question response conditions, and I thought that this might be the right time to try it out.

The survey was about possible improvements to the intranet site of the Lufthansa Technik (LHT) marketing and sales department. To evaluate an intranet site seemed an ideal application for MES.

Topcom had already many experiences with concept evaluation and the use of MES analysis in similar offline studies. The company offers a tabulating program with which MES studies can be evaluated.

I had some previous experiences in the design and programming of online studies. Lufthansa holds a company license for the German software Globalpark, a do-it-yourself package for producing our own internal and external surveys. Our experience has been that online surveys are very suitable for employee surveys. And more importantly, the LHT employees find online surveys to be familiar and acceptable.

So, after a few conversations, we thought, why not perform an online MES employee survey together, and see if we can get better results? In August 2006 the first work on the joint project began.

Understand context

Even if MES is not the exact topic of this article it is necessary to explain this method to understand the context of the whole project and where we netted out.

MES goes back to the research work on auditory perception by American psychologist S.S. Stevens. Stevens stated that physical sensations are felt subjectively. For example: A doubling of loudness does not mean that this sound is perceived as being twice as loud. For the measurement of the individual’s perception of loudness Stevens’ respondents worked with levers (similar the brake grasps of bicycles). The louder the sound, the tighter the respondent would squeeze. With this variable measurement device, Stevens could solve the problems of rating scales being used differently for the same stimulus.

Stevens used the MES method for other physical sensations as well (e.g., for the taste perception of salt) and also applied his results to market research problems. But the procedure never received large attention because of its complexity in execution.

Without going into too much detail, his research application of MES is a sociological convention to transform classical five- or seven-point interval scales into relative estimation measurements. This is desirable because otherwise it would be not be possible to calculate the means and other ways of comparing results.

Topcom found out that by using MES it is possible to evaluate slight differences between different products or concepts - which is exactly the problem we were facing with the intranet evaluation. It uses a simplified version of MES, which is applied like this:

1. An initial ad or concept is presented to the survey participant for consideration.

2. Next the respondent has to evaluate the concept or ad on a series of attributes, but instead of using a five- or seven-point scale the participant provides ratings using a value along their own individualized scale (a whole number value greater than zero, open-ended).

3. Then the participant has to rate whether he or she likes the ad or concept with regard to the attribute being discussed (likes/dislikes).

4. After this, the next concept is presented, and the respondent evaluates it in the same way.

This technique allows the respondent - if the individual’s scale choice is not too restrictive - to adjust his or her ratings upwards and downwards with a very fine level of precision. After all the evaluations are collected the data analysis for all items involves constructing the arithmetical mean for each individual scale and the likes and dislikes in percent.

MES works best if you start each individual with a benchmark exercise. An evaluation of the LHT intranet homepage (not the key area for the evaluation) was selected for this task.

So far as our project team knew, MES had never been integrated into an online survey in Germany before. Because this was an important project for management, and because we weren’t completely confident that the technique would yield the results we were looking for, we decided to run a split-sample test, with half of the survey sample taking the survey in a traditional way, using a seven-point scale, and the other half using MES.

Interesting artifact

By starting with the LHT homepage as a benchmark within our MES test design, we discovered an interesting artifact of the MES technique. It turns out that it is necessary for the respondent to see the ratings that they gave each previous concept evaluation. Only then do we find that maximum differentiation between subtle differences is achieved, resulting in a more precise evaluation. It also helps to maintain a control between stimuli by creating a constant basis for comparison between items, one of which has already been evaluated.

Again, as far as the project team knew, this too had not been used in online surveys, although the power of the Web makes such applications easy. We felt that customer expectations and the degree to which expectations were being fulfilled (common elements in most customer satisfaction studies) would also be measurable using this technique.

So as a part of the MES experiment, we looked at this new, exciting innovation and dubbed it previous rating displayed (PRD). The team formulated the following hypotheses regarding its use:

  • PRD would increase the comfort of respondents taking online surveys;
  • using PRD, the scattering of the ratings should be lower.

The previous rating for this study was displayed below the items for the current evaluated intranet site aspect.

We eventually wound up creating three survey groups, largely due to the fact that after a pre-test with LHT trainees we had the feeling that the acceptance of the MES method (the free-form ratings) could be in trouble. So rather than put too many eggs in the MES basket, we decided to cover our bets and create a third group that we could use to isolate the effects for PRD. The splits were as follows:

  • 40 percent magnitude estimation scaling with PRD built in (MES group).
  • 30 percent for a group with a seven-point rating scale, with PRD in addition (PRD group).
  • Finally, one group - 30 percent - with a seven-point rating scale without PRD, our so-called classical group.

This survey finally became the biggest online survey every conducted internally at LHT, with more than 4,600 invited employees.

More compelling

While the results from the tested intranet pages using MES in an online survey were interesting, the PRD results were even more compelling.

In order to judge our hypotheses, the participants were asked about their perceptions of the PRD technique:

  • 56 percent of all participants within the PRD split mentioned that the display of the previous rating was helpful.
  • 32 percent of all participants of the non-PRD split mentioned that the display of the previous rating would have been helpful.
  • Within both splits only a few participants went back to the previous page to correct the ratings.

So our first hypothesis was confirmed: PRD increases the comfort with an online survey. But what about the scattering of the data?

After several statistical tests - 62 F-tests on variances for independent samples - the team found that the PRD technique had only slight improvement in the consistency of the ratings:

  • Seven instances were significantly different in the PRD group at the 95-percent confidence level.
  • Three instances were significantly different in the classical group at the same level.

This was in October 2006. Even if the results were clear, I was still concerned that this one-off experiment might have had results that were merely coincidental. Dumke and I had some discussions about the findings. Even after several checks we were not sure that we didn’t made any analysis mistakes.

Research-on-research

In November 2006, I talked about PRD with Bill MacElroy of Socratic Technologies in San Francisco at the IIR European Research Event in London. MacElroy was interested in our findings, and he promised to continue the research-on-research work when he could find a suitable survey.

In February 2007 Jeffrey Kerr, senior vice president of Socratic Technologies in Chicago, integrated a methodical test of PRD within an online beverage survey (see Figures 1 and 2 for screen shots).

For this concept test we used the online panel of Socratic Technologies in the U.S. and completed 362 surveys with people who had been screened by beverage usage.

  • Concept evaluation with PRD: 181 test persons (= 50 percent).
  • Concept evaluation without PRD: 181 test persons (= 50 percent).

The participants had to evaluate three different beverage concepts. As before, when the PRD group was evaluating the second concept, the ratings of the first concept were presented. But there were two important differences:

  • Within the LHT online survey the participants evaluated the intranet site aspect by aspect, whereas in the Socratic survey the panelists did concept-by-concept ratings. Within the latter design the usage of PRD is much more obvious to the respondent.
  • The Socratic programming was much more sophisticated: The participants had the chance to correct the ratings by clicking onto the ratings via a pop-up. If a participant used this feature, the corrected rating was displayed after closing the pop-up.

Also within this survey, the participants were asked if the display of the previous rating was helpful. The results are very similar to those in the German survey.

  • 53 percent of all participants within the PRD split mentioned that the display of the previous rating was helpful.
  • 33 percent of all participants of the non-PRD split mentioned that the display of the previous rating would have been helpful.
  • Only a few participants corrected their ratings.

After data analysis, Kerr stated that PRD does not result in findings which are significantly different from the classic method, but the degree to which it made comparisons of stimuli with very similar characteristics possible was a definite plus. We are continuing to study the sub-samples for further differences and to see if certain groups respond to the technique better than others.

Increases comfort

We could prove that the PRD technique increases the comfort with online surveys when small differences between stimuli are involved: For more than half of all participants in both surveys the display of the previous rating was helpful.

It produces the same general results: the PRD technique does not influence the relative concept ratings themselves, given that the findings with PRD are equivalent to those without PRD.

But it is also slightly more difficult to execute: the PRD technique requires more programming effort.

So, our overall conclusions are that PRD is a relatively easy technique. With respect to the general tendency of decreasing participation in online surveys this technique could be one method to boost cooperation levels - particularly if the rating tasks are complex or difficult.

But there are still some open questions:

  • How does the PRD technique works with longer item lists?
  • Is it possible to use PRD in combination with a sophisticated random rotation algorithm for the item lists?

The last point is important. In both surveys it was not possible to rotate the attributes. This was a result of some restrictions due to the hierarchy of choice. If the items on the previous page and the comparison page rotated in different ways, it would be very confusing for the respondents. This means that further basic research about PRD variations is necessary.

But my overall conclusion is straightforward: it works!