How to catch a cheat
Editor’s note: Bill MacElroy is president of Socratic Technologies, Inc., a San Francisco research firm.
In the early days of Internet research, a commonly heard objection was that “you can’t tell if the person you’re corresponding with is really the person he or she appears to be.” Despite many improvements to Web-based survey data collection techniques and security technologies, this early perception persists. This article discusses many of the modern tools that are being used to make sure that the vast majority of “survey cheaters” are identified and deleted from the final analysis.
First of all, let’s define what constitutes a cheater. Because many Web surveys offer some sort of incentive (and since individual pay-per-completes are becoming more common), there can be quite a bit of motivation to attempt to submit multiple surveys and/or to take surveys but not pay attention to the tasks or questions. We consider both of these acts to be inappropriate survey behavior and attempt, wherever possible, to withhold rewards for such mischief.
Some cheaters are very easy to catch and remove from the data; others require a bit more stealth and internal logic checking. But before we discuss actual techniques for defeating cheaters, you might find it interesting to know that while we inform cheaters that they’ve been busted and won’t be getting any incentive, we rarely tell them how we caught them. This is because, just like in Vegas, once you tell a cheater how you can tell they’re cheating, they become even more innovative - something we’d all like to avoid. With that in mind, we’ll assume that no survey cheaters read Quirk’s Marketing Research Review.
Three points
To begin, there are three points at which you can catch online survey cheaters: prior to the invitation, during the survey and in the analysis of incentive fulfillment data.
All major research associations (IMRO, AMA, MRA, ARF, CASRO and others) have adopted codes of ethics that require lists of potential respondents (not including Web site intercepts) to have one of two characteristics: they must have either a prior opt-in for contact, or the individuals on the list must have a prior, existing business relationship with the sender through which an e-mail contact would not be considered a random, unsolicited broadcast commercial e-mail (spam). This requires most researchers to rely on online panels, customer databases or opt-in lists.
Providers of these panelists and lists regularly scan profiling and opt-in data for duplicated Internet server addresses, series of similar addresses (abc@hotmail, bcd@hotmail, cde@hotmail, etc.), replicated mailing address (for incentive checks), and other data that might indicate multiple sign-ups by the same individual.
Sophisticated systems now use an algorithm to compare the values of many profiling fields to produce a cheating probability score (CPS) for each panelist. Each profile data point is assigned a weight and the number of similarities between the data points will contribute to a total CPS; the higher the score, the higher the probability that a membership is an attempted duplicate. People with high CPSs are quietly removed on a regular basis, without notification, in order to prevent the possibility of multiple invitations to the same individual. Many companies using their own client lists or building their own internal panels follow similar precautions.
At the time of the survey, a common solution is to utilize many simultaneous tests of Internet addresses and tags to screen for cheaters. Note: the use of cookies as the sole alternative for preventing multiple submissions is now considered inadequate because so many people know how to delete or defeat these tracking tags. However, cookies in combination with IP detection and a time threshold limit of IP address (e.g., no multiple submissions from the same address in x number of minutes) can defeat many cheating attempts.
In addition, most survey organizations are now using a seeded database approach to inviting participants by e-mail. This is called a handshake. A handshake is a real-time information exchange between the interactive database and the survey taker. This entails creating a series of unique URL links to the Web survey which are embedded in the e-mail invitation. Once a unique code link is used, no one else should be able to re-use that link.
These unique URLs should be different enough so that invitees can’t easily figure out how to copy the link and change the address to gain entrance multiple times. For example, some do-it-yourself survey programs use a series of consecutive numbers at the end of the URL, which can be easily copied, pasted and then incremented or decremented to gain access (e.g., using a URL like “http://surveyorg.com/survey/?pid=101” is an open invitation to paste this address into the browser and try “?pid=102”). Using tagging components that are random, complex and/or contain symbols, prevents most multiple submission attempts.
Screen ’em out
Once an invitation has been sent, it is important to screen out cheaters who use a type of technology called automatic field-populating applications. These allow tech-savvy cheaters to capture the clicks used to complete one survey and then simply replicate the same entries using other aliases that they have used to create “multiple personalities” in the panel or list. Some of these are very hard to detect (e.g., Sam in Houston may also be Sally in Hudson, etc.). A field populator helps these folks expedite their cheating by filling in all the answers at lightning speed. So not only are they cheaters, they’re lazy cheaters.
By running real-time data matching algorithms and/or text field matching applications, the duplicate records (including the original) can usually be found and eliminated so that replacement interviews can be completed before field is closed. The matching criteria should be set high enough to detect probable duplicates (90+ percent exact matches), but in cases where there is a high degree of consistency in the market’s opinion, some managerial discretion may need to be applied.
Another way to avoid these types of cheating technologies is to create another kind of handshake which includes a code embedded within a complex graphic. This handshake involves information entry or a task in which the respondent enters a code that is invisible to most bots (a type of automated digital scanning software). Sometimes this is accomplished through the use of random numbers and letters that are contained in dense graphic field which the respondent then retypes as a password for entrance into the survey. So far most automatic field populating applications have been unable to read the code through these complex graphic backgrounds (although computer scientists at UC Berkeley working in conjunction with Yahoo! have developed several algorithms that are getting very good at reading through this type of screening handshake).
For example, humans can read distorted text like that shown below but most current computer programs can’t. If the data received does not match the data sent, then the multiple attempts can be detected and rejected.
Get the cash
The second type of cheater is someone who responds to a survey but who rushes through the questions without thinking in order to gain some incentive reward. Amazingly, some “Get Paid for Survey Taking” sites actually give people pointers as to how they can get through surveys quickly and get the cash. This is where another form of cheat-detecting technology comes into play: pattern recognition.
Pattern recognition looks for several types of bad survey behavior. These types of detection algorithms look for people who simply “straight line” the survey (e.g., taking the first choice on every answer set or entering 4,4,4,4,4,4, on a matrix , etc.) and those who zig-zag or “Christmas tree” their answers (e.g. 1,2,3,4,5,4,3,2,1, etc.). In either case, when this behavior is detected, the survey is terminated and the offending respondent is usually tagged for permanent removal from the sample source. Once again, the cheaters generally are not alerted to the fact that they’ve been tagged; they just never receive any other invitations.
The second way that pattern recognition looks for cheaters is by applying convergent and divergent validity tests within the survey. In short, similar questions should be answered in a similar fashion and polar opposites should receive inverse reactions. For example, if someone strongly agrees that a product concept “is expensive,” they should not also strongly agree that the same item “is inexpensive.” In these cases, it may be acceptable to tip our hand prior to terminating the interview by saying “Please check your answers to this last section of questions. Some of your answers seem to be contradictory.” But if the inconsistencies remain, the interview should be scrapped and the respondent tagged for quiet removal.
Incentive delivery
Finally, analysis of the delivery of incentives can be a method of identifying cheaters. Although online gift certificates are a very popular way of delivering a per-complete reward, every third or fourth incentive payment should be made by check or printed notice mailed to a physical address. (Even online gift certificate notices can be sent by mail.) This way, if people want their reward, they have to drop any aliases or geographic pretext in order for delivery to be completed, and oftentimes you can catch cheaters prior to distribution.
By scanning for similar names and derivatives within a specific zip code, otherwise hard-to-detect cheaters can be smoked out. But tech-savvy cheaters also know that slight derivatives, including unusual capitalization, small misspellings, use of initials, etc., are hard for some systems to detect and still will be honored by most banks, (e.g., Richard Smith, R. Smith, Dick Smith, Richerd Smith, RiCHard SmiTH, etc.).
Even if you use online gift certificates, your provider may be able to track the IP addresses from which the certificates were fulfilled. Because most award certificates are not linked to the e-mail address to which they were originally sent, some cheaters just put them together and cash them in from some single address. Combining both of these pieces of information, you can work with your gift provider to detect cheaters, albeit, unfortunately after the fact.
Catch the liars
In summary, cheating is still very much alive in all forms of surveying, but in the online world, underlying pattern-matching, logic detection and data mining tools are eliminating the vast majority of behavioral outliers, or outright liars, as the case may be.
(And if anyone is counting down, this article will probably appear on some “How to Cheat at Web Surveys for Fun and Profit” site in 4, 3, 2, 1…)