Inter-rater reliability and coding consistency

Editor's note: Katie Eddy is research analyst and Natasha Elsner is director of research strategy at the HSM Group, a Scottsdale, Ariz., research firm. They can be reached at eddy@hsmgroup.com and elsner@hsmgroup.com, respectively.

In market research, time and budget figure heavily in establishing the research method. Efficiencies typically involve negotiating trade-offs between cost, quality and time.

Using multiple coders for open-ended survey responses is a fielding efficiency that our firm, the HSM Group, uses for large, ongoing health care research projects with semi-structured interviews. For our B2B research, in-house researchers conduct telephone interviews and then code open-ended comments, enabling analysis-ready data capture. And while all interviewers are trained in the subject area, we understand that discrepancies can occur among even the most intensively-trained coders due to the subjective nature of interpreting and classifying open-ended responses. We pay careful attention to the quality of verbatim coding and review coded responses on a regular basis. A more systematic way to assess the quality of response coding is by measuring interrater reliability, which we recently undertook to gauge the group’s collective accuracy.

Inter-rater reliability (IRR) is a measure of the level of agreement between the independent coding choices of two (or more) coders (Hallgren, 2012). It determines consistency of coding and can be used to establish the deviation of a coder’s choices from the ideal or “true” codes (where “true codes” are those that garner general consensus among multiple coders). There are a variety of statistics which can communicate a measure of inter-rater reliability but one of the most common (and most appropriate for the study in question) is Cohen’s kappa coefficient. Cohen’s kappa is considered one of the most robust measures of IRR and is used widely in scholastic work (Carletta, 1996). It is calculated based on the percent consistent between two or more coding collections and accounts for the possibility of chance consistency.

Possible kappa scores range from -1 to 1, where -1 establishes absolute and perfect disagreement, 0 indicates no discernible pattern in code agreement, and 1 indicates “perfect agreement,” where there is no difference between the coding choices of two (or more) coders (Hallgren, 2012). Social scientists generally accept that in any given research undertaking, achieving a perfect score of 1 is highly unlikely in the first place and becomes even less likely as code choices and interview responses increase in complexity and variability. With this in mind, scholars have established several guidelines by which researchers can evaluate their coding consistency. According to Landis and Koch (1977), a kappa coefficient between 0.61 and 0.80 establishes substantial agreement, while any kappa of 0.81 or above can be viewed as “near perfect” agreement. And while 0.80 is viewed by many as a cutoff point for the viability of data (Hallgren, 2012), in practice, kappa coefficients below 0.80 are often still accepted in both academic and market research studies. J.L. Fleiss (1981) employs a guideline where a kappa of 0.75 indicates a proper cutoff point and where anything above 0.75 should be viewed as an “excellent” level of agreement.

HSM has kept these guidelines in mind when evaluating the quality of our verbatim coding. We felt that with our knowledge of the health care industry and our experience conducting research in this field, it should be possible to exceed industry standards.

HSM maintains many ongoing research projects for a range of clients. In the spring of 2014, we felt it was time to analyze our own inter-rater reliability to make certain that an “excellent” or “near perfect” level of agreement was being maintained for a long-time client. For this exercise, we compared each of our individual coding choices with an ideal code: This code was determined in consultation with our client and, therefore, most accurately reflects the thought process and mind-set of that content expert. Factoring in the probability of agreement due to chance, we found that we had achieved a kappa coefficient of 0.92 for this particular project. This score reflects our measured accuracy across four possible response-code options for a single question. This analysis was performed on a sample size of 298 interviews, which represents three months or one-quarter of our annual data collection for this client.

Discrete and independent

Fundamentals of survey design dictate that for single-choice questions, response categories should be mutually exclusive and collectively exhaustive. Translating this principle to multiple-response questions and to codes for open-ended comments suggests that response categories should be discrete and independent of each other to the extent possible. Researchers are aware, though, that even discrete codes have unique relationships with each other.

In health care, complexities abound where issues are intimately related to and affected by other (seemingly disparate) phenomena. For example, one set of comments might present the idea of patient satisfaction separately from the idea of quality reporting and yet in other sections of text they may appear as embedded concepts, since patient satisfaction is included in CMS quality metrics.

Therefore, developing codes or response categories truly independent of one another without recognizing those inherent relationships would not do justice to the complexity of the issues at hand. Particularly in health care, where government regulations, payment structures and delivery systems all tend to inform each other, HSM has chosen NOT to create firm boundaries between these interlocking topics.

Rather, in recognizing that the complexity of these relationships has the potential to decrease coding consistency, we conducted further analysis of the 43 codes used for the open-ended question under review and identified 12 sets of codes where topics: relate as a function of part-versus-whole; relate as cause and effect; and contain related or similar concepts.

As issues discussed in the interviews continue to increase in complexity and code meanings shift in tune with changes in the health care landscape, creating new codes, revisiting existing codes and being mindful of the interconnectedness of topics should be an ongoing collaborative effort between the market research team and its clients. At the same time, in dealing with broad interview questions, which lend themselves to a wide array of coding options, we anticipate that discrepancies in coding will inevitably occur. This is not an anomaly but an expected component of qualitative data, especially where 20 or more codes are in play at a time, which is atypical of the standard market research open-ended question but was the case in this study.

As always, the interests and needs of the clients should be paramount. We value our clients’ thoughts and opinions on code development and overall research strategy and want to involve them in the research process. Measuring and monitoring inter-rater reliability is just one way we as research suppliers can validate the quality of our work.

REFERENCES
Carletta J. “Assessing agreement on classification statistics: the kappa statistic.” Computational Linguistics. 1996;22(2):249-254.
Fleiss, J.L. Statistical Methods for Rates and Proportions. New York: John Wiley; 1981.
Hallgren, K.A. “Computing inter-rater reliability for observational data: an overview and tutorial.” Tutor Quant Methods Psychology. 2012;8(1):23-34.
Landis, J.R., Koch, G.G. “The measurement of observer agreement for categorical data.” Biometrics. 1977;33 (1):159-174.