Enhancing Data Integrity with AI+HI Fraud Detection

Editor’s note: This article is an automated speech-to-text transcription, edited lightly for clarity.

On November 20, 2024, CMB presented during the Quirk’s Virtual Session series on data quality. The company shared the steps the company has taken to increase data quality and the importance of others doing the same.

Speakers Amanda McMahan, senior insights consultant, Monika Rogers, VP of growth strategy and Richard Scionti, VP, product development and innovation, shared the reasons, process and possible next steps CMB will take to combat not only fraudulent data but inattentive respondents and programmatic responses as well.

Session transcript:   

Joe Rydholm 

Hi everybody and welcome to our session “Enhancing Data Integrity with AI and HI fraud Detection.”  

I’m Quirk’s editor, Joe Rydholm and before we get started let’s quickly go over the ways you can participate in today’s discussion. You can use the chat tab to interact with other attendees during the session and you can use the Q&A tab to submit questions for the presenters and they will answer as many questions as we have time for during the Q&A portion afterwards. 

Our session today is presented by CMB. Enjoy the presentation!

Monika Rogers

Hello everyone and welcome to our session on enhancing data integrity with AI plus AI Fraud detection. We appreciate you all joining today.  

I thought I would start off by just sharing a little bit about CMB.  

For those of you who aren't familiar with us, we are a leading insights and strategy consultancy firm based in Boston. We help some of the world's largest brands make a business impact through insights. We work tightly in collaboration with our clients and partner with them not only to drive insights, but to drive storytelling that aligns and inspires action in their organization.  

CMB has three primary areas of expertise. One is on human psychology and analytics with both qualitative and quantitative capabilities. We also have deep industry expertise with practice areas including financial services, media, entertainment and technology and telecom.  

A lot of what we're talking about today is part of our expertise in AI plus human power intelligence. So really focused on finding the right balance and mix of AI centered work in everything that we do.  

My name is Monika Rogers. I am the VP of growth strategy here at CMB. My background includes experience on the client-side at General Mills, in academia at the University of Wisconsin – Madison, and starting and growing my own companies, including in consulting and technology.  

Joining me today is Amanda McMahan. She is an insights consultant here at CMB with deep expertise in data quality, having experienced both on the sample side and on the agency side. So, welcome Amanda!

Amanda McMahan

Happy to be here.

Monika Rogers

Also joining me is Richard Scionti. Richard is our VP of product and Innovation. He has really brought all the generative AI solutions that we have been working on over the past few years and also has expertise in managing some of our proprietary tools and methods. Welcome Richard!

Richard Scionti 

Thanks, Monika. 

Monika Rogers 

To set the stage for why we felt doing this session was so important and what we believe the impact is to all of you is really looking at how the landscape has changed in the past few years since the pandemic.

What we've seen is for many of our clients' organizations, their spending has declined as a percentage of revenue. That means that marketing teams are potentially smaller. The insights budgets are oftentimes not growing relative to the expectations of expansion for the organization. That's has put a lot of pressure on all of us, both towards speed and efficiency.  

And AI has a huge potential for driving speed and efficiency, but we believe that that is not enough, we need to focus on quality. Data quality is integral to our success and our client success in driving impact in the marketplace. If you just focus on speed and efficiency, it becomes a race to the bottom.  

In fact, when you look specifically at survey fraud, you can see the impact that it's having on our business.  

On one hand, there's a financial risk to sample providers in the amount of money that they're spending on fraudulent participants, and that has been growing significantly over the past few years.  

But on the other side of it, and the really important thing to understand, are the business losses associated with us using fraudulent data in making decisions.  

In fact, a few years ago at another Quirk’s Event, PNG and Pepsi got on stage and presented about a multinational company that had developed a product and an advertising campaign that they brought to market based on research that had fraudulent participants. That research was conducted with a reputable sample company and the impact on them was huge in terms of financial losses related to making a wrong decision about this new product launch and advertising. 

So, this is something really important for us to understand.  

Overall, the research industry estimates that 15% to 30% of survey responses are fraudulent, and that depends a bit on the sample type.  

This is not isolated to specific companies; it's not isolated to specific industries. It does vary a bit, B2B versus consumer, and the amount of incentives you have can drive the amount of fraud that you might see. But across the board, companies are seeing this. 

63% of organizations are saying ‘we are accepting some level of fraud in conducting research. We know it's happening. We know we can't find and fight all of it.’ 64% have said they have delayed a project because they see negative impacts from fraud. They see something wrong with that dataset and they need to go back and refill the study and over half say that they believe their decision-making has been impacted by fraud.  

So, they're really seeing that the decisions that they're making based on this data might not have been the right decision. These are all really, really important factors that led CMB to make a significant impact in improving data integrity.  

To share a little bit of that story with us, I'm going to turn it over to Amanda. 

Amanda McMahan 

Thank you, Monika 

I joined CMB two years ago and we have already embarked on evolving our data integrity strategy.  

The product development and innovation initiative, that Richard and his team were spearheading, was focused more on a fraud detection tool pilot. We were assessing these tools to try to layer them into our current QC processes.  

I joined it at a really great time when we were starting to pilot those tools and garner data. We were asking ourselves a lot of hard questions: 

  • How much fraud are we faced with?
  • How different does it look across project types or audience types?
  • Is it different in certain points of time for certain audiences?  

We were thinking through all of this, this initiative and wondering how we can improve our processes in a more holistic way because as Monika mentioned, it's not entirely about fraud. It's also about the poor-quality respondents that might exist in the dataset as well.  

So, the challenge at hand was differentiating poor data quality from a legitimate human respondent from fraudulent data that looks like human mimicry. Come to find out humans aren't that great at determining what is a bot within an online survey sample set. It's challenging to decipher this because the methods that are used are so good at obfuscating that it is a bot at play here.  

It used to be inattention that would be the primary for quality definition of a dataset. Inattention might manifest as being unengaged in a survey where a respondent puts into an open end a lackluster response, or they speed through a survey or they straight line down a matrix showing to us that they're not really paying attention.  

But with the rise of fraudulent data over the past several years, we see really interesting data sets, really interesting response patterns that had us puzzled. This might look like a programmatic entry in mass into one project where the same responses are layered over and over and over again. 

How this happens is a survey farm will programmatically decipher the sequence to qualify for a study and then push a ton of responses that might be the same or slightly differ into that data set. And this wreaks havoc for us.  

It's really pronounced in B2B surveys, which are very expensive and have low sample sets, but it's not exclusive to B2C health care, even expert networks are showing some fraudulent levels of data that we have to combat.  

So, the real bad data of any type cascades into client impact that's negative. Whether it's poor quality data from an inattentive respondent or fraudulent data from some type of bot or programmatic response, clients might question the data, be concerned with how the data is representing an audience that they know intimately that will lead them to doubt the validity of the insights that we deliver to them from the projects that we help support. If they're questioning the data and the validity of the insights, a timeline might be impacted to have to rework that project or a subset or a specific audience. 

This can have great bearing on the feasibility of the project. Getting to those insights as quickly as possible and staying in scope cost-wise, can cascade pretty heavily into a client deciding to just omit data-driven initiatives entirely, focusing more on qualitative endeavors, outdated or anecdotal evidence that will limit a client's ability to adapt and be nimble in a market as they should be.  

But we think the real problem is a lack of client confidence. Really the erosion of client confidence. Having concerns with the data that we present for the projects that we're called on to support means that a client is going to lack the confidence in our consultation about how to arrive at that dataset for that specific business objective. 

It will also call into question the trusted partnerships that we leverage in the research that we conduct. It will call into question the scope of the project; can we execute it in the timely fashion that's required so that they can go to market with whatever business objective they have at hand?  

The real loss, however, is an erosion of client decision making ability. Feeling confident in the decisions that we help consult to come to is really the major loss here.  

So, as our data integrity and strategy has evolved, we have come to approach this problem holistically and that we have to rethink everything.  

I want to turn it over to Richard Scionti to tell us more.

Richard Scionti 

Thanks, Amanda. 

Part of this story revolves around our use of AI. Back when ChatGPT launched generative AI onto the scene, CMB decided quite early on that we thought AI could have a significant impact in the insight space. We decided to focus on building up some guidelines and documentation on the use of AI to ensure that we were not putting our clients at risk or impacting research participants in any way and put together some ethical guidelines and practices.  

But it's also this exercise that sort of originally germinated our foundational philosophy in AI plus HI. That's a key to this solution set that we're talking about today.  

The magic comes from not choosing between human intelligence and artificial intelligence but really operating in the sweet spot where the solution is elevated by the combination.  

I'm going to pass it back over to Amanda to sort of take us through our usage and how it is realized in survey fraud detection.

Amanda McMahan 

Thank you, Richard.  

Our data integrity operations are AI enabled but human driven and we work really hard to strike that balance between how we're leveraging these tool sets and our team's expertise. 

We layer across the project timeline many components of quality control measures starting at the design consultation with our clients. Intelligent instrument design allows us to come to an agreed upon definition of the audience at hand and the objectives that we hope to answer and derive the right quota size, the right timeline, the right methodology to execute that research.  

Intentional survey programming is another way of layering in some automation for quality controls which allows us to have some flexibility on making flags versus something that's an auto removal. 

These pre-front survey initiatives help us control some level of fraud at the start. Where the magic of AI comes into play is also pre-survey, but less so in the design. 

Embedding a fraud prevention tool at the front of our survey allows us to evaluate, detect and prevent fraud from entering our dataset. Doing so also enables duplicate detection. 

We know that professional survey takers are a problem, those who exist across multiple vendors. We also know that programmatic design will duplicate responses from a single source and flood a certain survey. 

So, these two AI enabled measures that are driven by our human expertise allow us to cut out much of the fraud at the very start of our survey taking process. The magic of the blend really comes into play with expert syntactic analytics and text analytics.  

Syntactic analytics is where we are looking at closed ended responses. We are evaluating pattern behavior across matrices. We are looking at how conjoint and max diffs are answered appropriately and engagingly by respondents. We're also evaluating text entry analytics, open-ended entries. 

Arguably open ends are the most problematic and the most difficult to combat quality within. Come to find out humans aren't so great at determining what is a bot within an open-end. It can become a difficult task.  

So, we blend that AI and AI strategy to allow our teams to focus on the insights, really focus in on the objective at hand as opposed to trying to decipher what is a legitimate response versus what is fraudulent.  

The results mean that we significantly reduce our workload by applying a blended AI and HI approach. We can say that from this subset of studies here, the metrics show roughly eight, two thirds of the fraudulent data and duplicate data is being called out with an AI enabled tool. 4% or a third of the quality control measures are happening on a manual basis by our teams still leveraging some AI tools in this space. 

But what we hope to express is that this AI plus HI combination allows us to focus on the insights and we can now track progress and improve as we learn by evaluating vendor quality over time for specific audiences by project type. We're able to size the vendor composition that we suggest for our clients. We're able to ensure trend ability of data, the consistency of our quality control processes and hold vendors more accountable for the data that they provide to us.  

By building confidence across the ecosystem, we drive greater impact for our clients. By having high quality data, we can consult more readily about the kinds of research initiatives we hope to bring to the clients, what kinds of audiences, sample compositions and methodologies we suggest. A client's confidence will be amplified by knowing that our AI plus HI solution is combating quality to this degree. It also means that vendor accountability is more at play so the client confidence in the trusted partners that we bring to the table is brought back to life.  

That also means that we can provide reliable expectations by getting in front of fraud at the very offset before it even enters our survey. We know that we have a cleaner data set with which to start gleaning those inattentive responses from. 

The greatest win is trusted consultation with our clients. Their decision-making power is more confident than ever because our clients know that we have an approach that gives them the best data. 

I'll turn it over to Richard to tell us more about what's to come.  

Richard Scionti 

Thanks, Amanda.  

While we feel a strong sense of accomplishment for the solution that we've put in place, we also recognize that these bad actors are highly motivated to get into surveys and we expect that they will adjust their approaches, try to go through or around barriers that we have set in place. So, we know we're playing the long game, and we're prepared for that.  

Another area where we are going to turn our attention is around more efficiencies and open-end evaluation. This is a place where we think generative AI can play a significant role. Most of this solution is built around a more narrow AI and we think that we can algorithmically exclude some responses based on detecting their quality and also queue up for human evaluation. Those that we have some suspicion about where we need the expertise of an individual to make the choice about whether to include or exclude that response.  

Then finally, we think another area we want to invest in is around respondent engagement to address those inattentive or distracted respondents. We think AI can play a role here as well in surveys that become more conversational and more engaging in their nature so that we address respondent fatigue and inattention in a more dynamic way. 

These are areas that we intend to continue to focus on as we move forward.  

With that, I'll pass back to Monika to wrap us up.  

Monika Rogers 

Thanks Richard.  

We are really delighted to have all of you join today.  

If you're interested in learning more on this topic, you're welcome to scan the QR code here on your screen and you'll go to a blog on data quality, as well as have an opportunity to download our AI plus HI guide with more information about new CMB solutions using generative AI.  

With that, I will open it up for questions.