Editor's note: The following article is taken from the book "Handbook of Package Design Research," edited by Walter Stern. Copyright 1981 John Wiley & Sons, Inc. Reprinted by permission of the publisher. Donald Morich is principal of Consumer and Professional Research, Inc.

A consumer package normally goes through many tests before it gets out of the design laboratory and onto the production line. Much time, energy, and money is spent in completing routine tests to insure the package physically works for the product (i.e., it holds the exact amount of product required, it protects the product itself up to acceptable standards, it withstands certain strains and stresses, and it dispenses the product without difficulty). However, once the label is fixed to the package, further testing is often bypassed—the package is technically satisfactory and it looks "finished."

It's the unusual company that takes the next step and consumer tests its finished or final stage packages. More often, a label design is judgmentally selected and the product/package is produced in quantity. Reasons generally cited for skipping the consumer test phase fall into one of these categories:

  • There's no time. The people who have created the package have finished their job and the marketing group is anxious to get it to the marketplace.
  • There's no money. Thousands and thousands have already been spent on package/label development.
  • The package will work. It's mechanically well-constructed and the label was designed to tight specifications concerning the precise way in which the product will be positioned to the consumer.
  • Label design is a creative endeavor. The design will influence consumers both consciously and unconsciously and therefore its total effect will be impossible to measure.

There is, of course, truth in all of these statements. But to use them as excuses for not undertaking a consumer research program is to underestimate the contribution a well-planned and well-executed consumer test can make to the package/label design development process. This final, consumer evaluation step should be part of this process, not an adjunct to it.

At the "moment of truth," that point at which the consumer finally decides whether or not to purchase the product, the package is the summation of the communications efforts the company has made on behalf of the product. It's at this point of sale that the package must accomplish several important things:

  1. It must tell the prospective buyer what the product is called; it must communicate the brand name quickly and clearly.
  2. It must tell the prospective buyer what the product is and what it does—the person looking at the package should be able to note quickly what type of product it is and what it's for (a non-aerosol spray antiperspirant that keeps you dry). It's also important to communicate clearly the ways in which this product is different/better than others the prospective customer may be considering—in other words, the package must "sell" itself to the consumer. These objectives are often accompanied by graphic treatment/design elements as well as words.
  3. It must inform the prospective buyer how nutritious the product is, what ingredients it contains, and, in some cases, what each ingredient's purpose it.
  4. The package carries the Universal Product Code, which tells the consumer nothing, and a price sticker or stamp which tells the consumers what they really want to know.

Thus, as a communications vehicle, the package/design must transmit a tremendous amount of information about the product to the prospective buyer, some of it spelled out in copious detail, some of it implied, but all of it consistent with the basic product positioning/level of performance built into the product and its marketing strategy.

At the same time, there are some constraints on package design development. A package is a means for consumers to recognize the products they know and use. One of the axioms in package design is that a new or revised package should not be drastically different from the label/package that consumers are familiar with. This is not to say packages should not be changed, rather that they should be modified on a gradual, step-by-step basis.

Quaker Oats oatmeal has had a number of significant package graphics changes over the years, yet each one maintained the integrity of the previous design—a design that consumers knew well and could easily recognize in the store. Even on those occasions when the product itself has undergone significant reformulation and the marketer is anxious to have consumers recognize they have a new reason to try the brand, a drastic design change is a questionable business tactic. It seems that consumers are willing to accept reformulation of the product as an important improvement, but they consider package/label design changes unnecessary.

Also, certain package configurations and/or colors are traditionally acceptable for certain products. Catsup, for example, illustrates this point. Virtual1y everyone instantly recognizes the shape of a catsup package, and in controlled use-test situations nearly everyone will agree that this type of container is extremely difficult to use. But the package configuration is traditional, it is catsup, and any other shape simply would not be suitable for the product.

It's generally agreed that the package is an extremely important element in the marketing mix for most consumer products. It has multiple functions to perform and, therefore, no single consumer research measure can do an acceptable job of evaluating the total effectiveness of a consumer package. There are at least three measures that are important considerations in evaluating consumer package/label designs:

  • Impact. How intrusive is the package/label design; does the package have the ability to "jump off' the shelf and be recognized quickly?
  • Image. What does the package communicate to consumers; what kinds of impressions and/or perceptions does the package create?
  • Preference. Does the consumer like the package; is it aesthetically pleasing to the prospective customer?

In order to be maximally effective, these three measures must be in balance. They must work in concert to produce the kind of impact that's necessary for the brand to be noticed (quickly) on the store shelf without sacrificing those design elements that help communicate the kinds of positive product impressions (image) that are important to the success of the brand. It is entirely possible to have a tremendous amount of intrusiveness, to communicate certain product characteristics very effectively, or to have people generally agree that the package is attractive. However, if any of these are accomplished at the expense of the others, the wrong balance has been struck.

It's easy to get the wrong balance. For example, it's possible to achieve a high degree of impact using certain design devices—fluorescent ink colors or unusual package configurations will invariably produce high recognition. However, the use of these overt attention-getting devices may distract the consumer from the package's communications effort in other directions. Bright colors often are interpreted as harsh or too strong, and unique package shapes could prove difficult for consumers to handle. Or, as is the case with catsup, a new, different package shape might not be acceptable to a large proportion of consumers. Conversely, pastel colors and swirly, curly graphics often are interpreted as mild or gentle, but could well be so soft that the package would fail to achieve impact at the point of sale. Finally, simple consumer preference, often based on an individual's assessment of the attractiveness of the design, may have little or no relationship to which package the consumer will purchase—they want the product inside, not the package. They recognize this and so must we.

The effective package design, then, is the one that is most successful in integrating these three properties—impact, image, and preference. The consumer research test methodology must, then, be designed to measure these same properties effectively. We have developed a research methodology that measures consumer response to test package/label designs in terms of these three decision criteria.

This system is an experiment in the sense that the environment in which the package alternatives are studied is one in which the variables (stimuli) can be controlled and manipulated. The respondent is not in his or her normal context (a supermarket) but in a tightly managed interviewing station or facility. Also, s/he is often asked to react to stimuli that represent the actual package(s), for example, 35mm slides or photographs, rather than the packages themselves. This makes it possible to isolate and control key variables and to study the consumers' response to these variables. Since all package alternatives included in the test are exposed to this identical treatment, the absence of a real-life environment affects equally all packages relative to the others in the test scheme. By studying respondents' reactions as the test variables are manipulated, it's possible to make judgments concerning the extent to which these variables affect a respondent's perceptions of the packages being studied.

Impact measure

The first measure in this consumer research system is an impact measure. How intrusive is the package design? What elements of the design do consumers notice? How quickly do they notice certain package/label features? This is the kind of information developed in the initial section of the research methodology. An impact measure is the first step in the research, not because it is the most important element to consider, but because impact, in order to be correctly measured, must have a clean or uncontaminated exposure. Visual or verbal cues of any type have a tremendous influence on this measure, so much so that results tainted by prior cues are virtually meaningless.

A tachistoscope (T-scope) is used to measure the impact of salient visual elements of each test package. It is a device that precisely controls the amount of time a stimulus (package) is exposed to the respondent. By strictly controlling both the number of exposures and the length of time of each, the packages in the test are ensured equal exposure time. Thus any differences in the time required by respondents to pick out salient print or graphic elements are ascribed to the test variable, the package itself.

The T-scope used in this portion of the research is electronically controlled for increased accuracy and ease of operation. Each exposure interval is clearly marked and the control knob has a definite click stop point for each interval. Once the instrument is set up it can be operated even in total darkness.

The T-scope exposes the test package to the respondent a total of eight times. The first flash exposure is at a speed of 1/150 second and each succeeding exposure is lengthened until the maximum speed of 2 seconds is reached. The T-scope is set at the following speeds:

Exposure 1 1/150 second

Exposure 2 1/100 second

Exposure 3 1/50 second

Exposure 4 1/25 second

Exposure 5 1/10 second

Exposure 6 1/2 second

Exposure 7 1 second

Exposure 8 2 seconds

Immediately after each of the eight exposure intervals, the respondent is asked to report exactly what he or she saw. The verbatim comments are recorded, coded, and tabulated to form the basis of this portion of the analysis. The impact measure provides these types of information:

  • Brand name recognition.
  • Product description playback.
  • Product/brand name misidentification.
  • Recognition of symbols/logos.
  • Identification of other salient graphics.

It's possible to calculate average recognition times or mean scores for each of the above points, and it's also possible to display the test results cumulatively to determine the pattern that respondents follow in viewing the test package(s).

Table 1 is an example of the brand name impact measure for four different test packages/brands. It shows the percentage of respondents in each test who were able to identify correctly the brand name of the package at each exposure interval. The Mean Recognition Score, the average length of time it took respondents in this test to identify correctly the brand name of the package, is also presented in the table.

The second way to examine results of the T-scope questioning series is to accumulate responses for each successive exposure interval. Table 2 shows the cumulative percentage of respondents who mention various visual elements of the package as the test progresses.

The T-scope device, administered and tabulated in this manner, provides a total recognition profile of a test package, by time period. The analyst can use the results to identify which package elements are noticed more quickly and can also track the path or pattern the respondent is following in his or her attempt to complete this test successfully—that is, to report exactly what he or she sees at each exposure interval.

The technique is especially useful in comparing test package alternatives for two important reasons—it produces consistent results in a test-retest situation and it is able to discriminate among test alternatives. If the same stimulus (package) is tested more than once, the results will, with a high degree of probability, be identical each time. Second, if the test packages are capable of producing different levels of impact or intrusiveness, this testing technique will measure the difference.

Table 3 illustrates the ability of this measurement system to replicate its findings. Brand Name Recognition Scores are shown for two consumer packages, each of which was tested on three different occasions with three different samples of respondents. The brand name recognition score is the average length of time, in seconds, it takes a sample of respondents to identify correctly the brand name of the test package. For Brand M, results differed by only 0.12 seconds in the three tests; for Brand S, the scores differed by only 0.06 seconds in three successive tests. If the same packages are tested, it’s likely the results of the impact measure will be the same. The table also indicates that the T-scope test is able to discriminate between different package designs. Brand S was correctly identified about 1/3 second sooner than Brand M (Average Recognition Score of 0.78 versus 1.10).

Another illustration of the test's ability to discriminate between packages is shown in Table 4. In Test 1, four packages from the same product category were tested. Results indicate Brand A was correctly identified five times faster than Brands B or C and 13 times faster than Brand D. Test 2a included two packages with identical label graphics but different brand names; Test 2b included two packages with identical brand names but different graphics. Results clearly show significant differences in impact. These test scores also show both brand name and/or graphics can influence recognition. With identical graphics Brand F was correctly identified about twice as quickly as Brand E; yet with identical brand names Graphic Y performed much better on the correct name recognition.

The conditions of the exposure of the packages to the respondents is always a point capable of generating long discussions. There are two basic conditions under which a package may be

tested by using this T-scope impact measure:

  1. An individual package, by itself, can be studied without reference to a competitive frame. The respondent is exposed to only a single package stimulus.
  2. The test package can be studied as it relates to its competitive frame or environment. It can be shown to the respondent as part of a "typical" shelf array in which the test package is one of several brands/packages displayed and collectively viewed by respondents.

Results of previous tests indicate the more viable testing technique is to deal with a single package exposure rather than a multiple package situation. Exposing the respondent to only a single package generates results that go into more depth relative to the test package. In a single package test design, it is likely that information will be volunteered on the attention-getting characteristics of such design elements as:

  • Colors and/or combinations of colors.
  • Unusual or distinctive package shapes.
  • Brand name recognition.
  • Playback of product category description(s).
  • Package graphic elements such as illustrations or photographs.
  • Corporate symbols or logos.

It's also likely that test results from a single package testing technique could more readily identify elements of the package that are being misread or are in some way confusing to the consumers participating in the test.

In a shelf array test situation, respondents are asked to look at a number of different brands as well as a number of packages or shelf facings for each brand. This normally results in a display of 20 or more packages. When asked to concentrate on this type of stimulus, respondents tend to focus their attention on picking out the brand name of each group of packages shown. They will be sure they have correctly identified the brand name on one set of packages, then move their attention to the second set, then on to the third until they are confident they have correctly noted each separate brand in the shelf array. In a practical sense, this procedure leaves respondents little or no time to comment on any other aspects of the packages displayed.

Also, since there are a large number of packages to look at, respondents tend to be a bit cautious about relating what it is they see until they are reasonably sure of themselves—thus their recognition for correct brand name identification is usually much slower than in the single package stimulus tests. As a result, the dispersion of mean recognition scores is not as great in the single package test mode. This point is illustrated by the above chart of Mean Recognition Scores.

In the 24 Single Package T-scope tests, the fastest time recorded was just under 1/20 of a second, nearly 20 times faster than the slowest mean recognition time recorded. In the 28 Shelf Array T-scope tests, the fastest time recorded for correct brand name identification was just under 1/2 second, about three times faster than the slowest score of 1 1/2 seconds.

Another point in favor of selecting a single package mode for testing the impact of packages is that this system better handles the built-in bias that consumers have for brand names they know well. Packages that carry familiar brand names tend to achieve much higher recognition scores than packages with brand names that are less well-known by respondents. Thus tests in which some of the packages carry new names will invariably indicate that the established brands achieve faster recognition scores. This will of course occur in both single package format and in shelf array test situations. The single package scheme does have greater ability, however, to discriminate between test packages and is more likely to show which of the new brand name/package alternatives included in the testing scheme has the most impact.

Finally, the most compelling argument in favor of using a single package mode rather than a shelf array exposure is that test results are parallel for both situations. If several different brands are included in a test, the "winner" is likely to be the same brand/package in both situations, the nearest competitor would be in second position in both testing schemes, and the slowest recognition speeds would probably be the same packages. The rank order of test results, in terms of speed of brand name recognition (impact), would probably be the same.

Thus if two test designs do an equally adequate job of measuring the speed of brand name recognition of one test package relative to other test packages within one test series, but one of the testing modes also provides information concerning the impact of other package elements, the methodology that can generate the additional information is the logical choice.

Table 5 shows the similarity of results for four brand names/packages tested under both a single package mode and a shelf array situation. Clearly, Brand A "wins" in both methodologies, Brands B and C are distant seconds and Brand D does the least effective job of gaining correct brand identification.

The next point to consider in completing this type of package design research is the actual physical form the stimuli (packages) used in the test should be. There are three basic considerations:

  1. In what form is it practical to produce the package/design alternatives for testing purposes?
  2. Will the materials used to execute the test allow for geographical flexibility?
  3. Can the test be easily administered in the field or does it require specialized training and/or equipment to function properly?

A testing methodology that utilizes 35mm slides to represent the test package is usually a good choice. This method is portable, takes little mechanical aptitude to operate, and requires no special equipment other than a T-scope device and a 35mm projector/viewing screen. Most important, 35mm slides can represent accurately the actual test packages in terms of color reproduction and are much less expensive to produce than tight, mock-up packages. For example, if a new product package is under consideration, only one of each of the test packages need be produced. That model can be photographed and the test can be completed in several markets simultaneously without fear of destroying the prototype model.

Yet the testing system must be flexible enough to vary from this methodology when it's appropriate. If, for example, one or more of the test packages in the research use fluorescent ink colors, it would not be possible to use 35mm slides because these inks cannot be reproduced adequately in full color photography. In fact, the only way to represent this type of test package accurately would be the test package itself. In one testing situation in which these types of packages were included as alternatives, a shadow box was built, and the T-scope procedure was modified so that the actual package inside the shadow box was illuminated in a controlled fashion.

A final consideration prior to conducting this type of impact/image/preference test relates to sample selection. A package test is not unlike any other form of market research investigation in the sense that sampling methods and selection criteria are important to the successful completion of the study. For the majority of consumer package tests the sample should be structured to include a high percentage of product category users. Establishing a subquota of users of certain brands is also a good plan. It was noted that brand familiarity influences how quickly respondents are able to recognize brand names. Familiarity/experience with the brand also influences a consumer's image perceptions of the product. Thus it is important that this sampling variable be controlled from test cell to test cell. If one-half the respondents in Test Cell A are users of the test brand, one-half the respondents in Test Cell B should also be users. Since the package test is essentially an experimental design, the researcher can tightly control the sampling procedures to be more certain that important differences that are noted in the test results are traceable to the package variations, and not the result of sampling variation.

The physical place in which the T-scope portion of this testing system is completed is almost always a small conference room setting in a central location interviewing facility. The room is arranged so that the 35mm projector is set up to flash approximately "life size" sequential exposures of the test package or the viewing screen.

Respondents who meet the sampling eligibility requirements and agree to participate in the test are brought into the room and seated about six feet in front of the screen. Participants are told they will see a picture flashed on the screen very rapidly and will be asked to report exactly what they saw each time a picture appears on the screen. If they normally wear glasses or contacts while shopping, they are asked to wear them during this test. The interviewer in the room with the respondent controls the T-scope mechanism and also records each respondent's verbatim commentary after each package exposure.

This type of impact measure is a purely physical measure of how quickly respondents can pick out or notice characteristics of the test package shown to them. Thus mechanical variations in the test setting can greatly influence the test results. For this reason, this portion of the test must be closely controlled and monitored. As much as possible, the physical characteristics of the test environment should be identical for each location or city used in the test. Instructions given the respondents are quite detailed and are read, word for word, to each participant.

In this type of T-scope methodology, respondents begin to "learn" the technique as the subsequent exposures are shown to them. They try to win the game in the sense that the first few exposures have taught them what to look for on the screen. For this reason, exposing any single respondent to more than one test package (or shelf array) in this type of testing system is not an acceptable procedure. To do so would only introduce a bias into the test that could well blur the results. It is much wiser to deal only with respondents who have not been preconditioned to the mechanics of the test procedures.

A series of eight T-scope exposures seems to be the optimal number for this testing procedure. The fastest speed, 1/150 second, is the starting place because it is the point at which a few people can actually pick out certain package elements—speeds quicker than this are just a flash to respondents. The slowest speed, 2 seconds, is long enough for all salient package graphics to be noticed—slowing it down further only results in redundancies. Dividing this 1/150 to 2 seconds range into eight intervals provides a reasonably high level of discrimination between the points.

The absolute scores generated via this T-scope methodology are only useful when a number of packages are tested and their scores are directly compared. The emphasis of this measurement is on how well each test package performed relative to others included in this or previous tests. Knowing a test package achieved a Mean Brand Name Recognition Score of 0.39 seconds doesn't mean much unless norms or other directly comparable impact scores are available.

After the respondent has seen the eight exposures of the test stimulus, the T-scope impact measure is complete. Each participant is then asked to move to a second interviewing station, and an entirely different set of questions are administered.

Imagery analysis

The objective of this section of the testing system is to determine what types of images and or impressions are communicated by the package/label designs being tested. This is done by having each respondent rate the test package on a long series of attributes or dimensions that might be used to describe the package/product. Normally 25–30 of these attribute statements are included in this semantic differential rating scale technique.

These attributes are designed to reflect the opinions consumers may have about the test package/brand. The list may be generated from prior consumer market research studies such as focus group sessions on the brand or product being tested. New product concept studies or product positioning studies are also a useful source for constructing this attribute list. Ultimately, however, brand management and the research analyst have the responsibility to anticipate the consumer's response to the brand/package and to be certain the final list reflects these possibilities as well as those dimensions known to be important to consumers in deciding to buy/not buy the test brand/product. The importance of the attribute list cannot be stressed enough. It forms the basis of the imagery analysis portion of the package research study. If the "right" attributes are not included in the list, the "right" consumer response pattern will never be measured.

The attribute list should cover three broad dimensions:

  1. Product efficacy dimensions. These attributes measure the extent to which respondents believe the product inside the package will live up to performance expectations. Examples of these types of attributes are:
  • Cleans pots and pans without scrubbing.
  • Makes silverware sparkle.
  • Rinses off easily.
  • Especially effective in removing grease.

Other product-related dimensions focus on things such as:

  • Would be economical to use.
  • It's convenient to use.
  • It's a modern product.

 

  1. Dimensions related to aesthetic assessment of the package:
  • It's an attractive package.
  • The colors are cheerful.
  • It's an eye-catching design.
  • Sprays on easily.

 

  1. Statements of a self-referral nature that reflect the respondent's personal interest in the product:
  • It's my kind of product.
  • I'd use this product every day.

In this section of the test the actual package makes an ideal stimulus. Respondents can hold it, shake it, and examine it, front and back, straight up and upside down. However, it is often not practical to work with actual packages or package prototypes. In these instances, full-color 8xI0 inch photographs of the test package can be used. This is an inexpensive means of reproducing original designs. It can be done quickly, and respondents can generally accept photography as a reasonable substitute for the real thing. The only proviso is that the photographs must reflect the original package/label design accurately. Colors must be very close to the original design specifications, and the size of the single package in the photograph should come close to the actual size of the package.

The impact measure (T-scope) portion of this testing system places emphasis on purely physical measurement. The imagery measure emphasizes consumer opinions/perceptions—to what extent does the test package communicate certain attributes or dimensions. In the first instance, sampling error, in a statistical sense, tends to stabilize relatively quickly and a sample size of 50–60 respondents is adequate. The imagery measure requires a large sample base before the test results begin to stabilize. A base of 150 respondents is the minimum sample requirement.

Unlike the impact measure, where each respondent can view only one test package, this section of the test can accommodate up to three package exposures. Previous studies have shown that a consumer's opinions of one package do not strongly influence his or her opinion of a second or third package. Thus if each respondent is asked to rate three packages in the imagery section of the study, it's possible to keep the two sections of the package test in balance in terms of sample base requirements. Reliable test results will still be generated for both the impact and imagery measures. The test design takes this form:

If more than three test packages are included in the test design, rotations can be established so that each alternative gets equal exposure. This approach is preferable to having each respondent rate four or more packages in the imagery section. The latter approach is too time-consuming, and respondents could easily lose interest in the process. If that happens, results are suspect.

The actual form of the scale that is administered in the imagery analysis section of the questionnaire is the next consideration that must be addressed. The options are almost limitless. Many companies already have strong preferences about the exact form/wording of rating scales used on their consumer research studies. As a consequence, several scale variations have been used in this section of the study-all with good results. This simply means that consumers understood that the scale was designed to place a direction and intensity to their opinions, and these scales were sensitive enough to pick up differences in consumer perceptions between package design alternatives. The five examples above are semantic differential scales that effectively measure consumer attitude in terms of both direction and intensity.

Thus the key considerations in choosing a scale are:

  • Do respondents understand the scale?
  • Is the scale sensitive enough to measure different opinions?
  • Does the company have a track record or norms for the scale?

Once a scale rating device has been selected, the administration of that device in the field is essentially the same for all scales. The interviewer reads the list of statements and has the participant respond with his or her rating for each item on the list. It's a good idea to have the actual scaling device printed on a separate card so the respondent may refer to it as each statement is read. Having the field interviewer administer these scale ratings, as opposed to having them self-administered, has several advantages:

  • It's faster.
  • Rotations and/or starting points can be used.
  • Respondents are less likely to mark the same rating point for all attributes; the presence of an interviewer seems to force participants to be more thoughtful and consider their responses more carefully.
  • Much fewer "No Answers" are recorded.
  • It avoids misinterpretation and misunderstanding.

After the respondent has rated each of the 25–30 attributes for one package alternative, the process is repeated until a maximum of three packages are rated on scores are profiled, or compared, among the test alternatives included in the study. Variations in these rating scores are indicative of each package's ability to convey different impressions to those consumers seeing it.

There are occasions in which the use of mean scores will not tell the entire story. In a situation where the distribution of a particular response is skewed (i.e., a normal frequency distribution curve would not properly describe the response pattern of consumers on a particular attribute or statement), the analyst can use a "top box" score rather than the mean score in reporting and/or analyzing test results. For example, if the scale ratings for many of the attributes included in the study show a bilateral distribution pattern (concentrations of responses at both ends of the scale) a mean score would be misleading. In this instance, the reporting of the percentage of respondents marking the attributes at the highest scale rating point (the "top-box" score) provides a more accurate picture of the consumers' response to the package being studied.

The scale rating results developed in this section of the study provide two important pieces of information:

  • They pinpoint those attributes that are highly consistent with the respondents' attitude toward the product/package design. Attributes that show a high degree of agreement are indicative of those properties of the product/package that consumers will accept as truthful and realistic product claims. Conversely, those attributes that are scored low by consumers are dimensions that respondents find hard to believe about the package/product being tested. It would, for example, make little sense to position the product as a high quality or expensive entry if the product's package clearly conveys a low cost/low quality image.
  • They provide a profile of one test package versus the other test packages included in the test. By examining the scores on a side-by-side basis the analyst can identify the areas on which consumer perceptions of the packages differ. The decision as to which package is the best fit for the product positioning has to be made on the basis of a thorough examination and understanding of these profile scores.

There are, of course, no right or wrong answers in interpreting these imagery analysis scores. The data must be analyzed in concert with what the stated objectives of the package test are. What have the packages/labels been designed to communicate? How well does each package perform in light of these criteria? These are the key questions addressed by the imagery analysis section of the package test system.

Table 6 is an example of the tabular detail generated via this questioning sequence. Brand A is the leading seller in the product category, Brands B-1 and B-2 are design variations of one of the other newer brands in the category. Notice two things: First, there is a reasonably wide range of agreement/disagreement with the attributes listed on the chart (the "top-box" scores range downward from 72 to 7%), an indication that the scale rating device is measuring differing consumer opinions/perceptions relative to these attributes. Next, the scaling technique is able to differentiate between packages tested—Brand A clearly has a different profile than either Brand B-1 or B-2, and even among these two alternatives a somewhat different attitude profile is evident.

Throughout the test, respondents are asked to deal with the test stimulus (packages) on a monadic basis. In the impact portion of the test they are exposed to only one package. In the imagery analysis portion they do handle three alternatives, but only one at a time. At no time are they asked to directly compare the relative merits of the test packages. Careful study of the impact and imagery measures will often supply sufficient consumer feedback to make solid judgments concerning the effectiveness of the test package(s).

One useful analytical device is to review the results for each package tested in terms of the four quadrants of Grid 1.

On occasion there are situations when the test scores for particular packages are very similar. In these instances, it is useful to try to exaggerate whatever small differences do exist in the minds of consumers. This is accomplished by asking respondents to "pick a winner" from among the alternatives presented.

Direct comparison preference

In this section of the study, each of the respondents is asked to compare directly the test packages for the first time. A limited number of statements or dimensions related to their impressions of the "image" communicated by each package are read to them, and they are asked to select the one package/label design alternative they prefer for each statement. This list often includes such factors as: most economical product, most effective product, most attractive package, highest quality product.

It's important to keep in mind that this questioning technique tends to force respondents into choices that they might not otherwise be in a position to make. It's not likely a consumer will ever see three different packages for the same brand on sale at the same time. The intention of these questions is not to determine a "best choice," but to magnify the differences in consumer perceptions that may exist. When read in conjunction with the other information collected in the testing system, the discrimination produced by this questioning technique can produce meaningful information.

A final preference question is normally included in this section of the study. Respondents are asked to distribute a total of 11 votes among the packages in the test. Votes are distributed on the basis of how strongly each respondent feels his or her preferences are—it's a measure of the intensity of preference. It's possible one package is so well-liked it gets all 11 votes, or votes may be more evenly distributed across the alternatives, an indication that preference for one package versus another is very slight.

Table 7 is an example of this question sequence. Four alternatives of the same brand were tested. Package 1 clearly is preferred on those dimensions related to product performance (safer, more effective, easier to use, and best overall), but it is not the most attractive package nor is it the most expensive package in the test. Though the individual preference scores for each dimension are quite strong, the constant sum scale question shows the actual intensity of preference to be quite weak. Package 1 and Package 3 are, in fact, equal in terms of consumer preference, with the other two alternatives not far behind.

Summary

This article has discussed a model for testing consumer packages or label designs. It is a reliable, proven methodology for evaluating consumer response to different packages using three separate criteria:

  1. Impact (T-scope technique).
  2. Imagery (semantic differential).
  3. Preference (forced choice).

Incorporating these three discrete measures into one single testing system offers the opportunity to examine and analyze the test results in an integrated, systematic way. The methodology is flexible, both in terms of geography and in terms of type of packages that can be studied, and the time and financial commitment on the part of the sponsor is modest. The test design is summarized in the chart above.

When read in concert, these measures provide a maximum amount of information on which to base important packaging decisions for consumer brands.