Exploring the challenges and potential of synthetic data and survey participants

Abstract

With AI, new forms of synthetic data have become possible. In the survey world, new synthetic data vendors are cropping up. Steven Millman discusses the challenges and potential of this evolving technology.

Editor’s note: Steven Millman is global head of research and data science at market research firm Dynata.

Synthetic data is by no means a new concept in market research. While the term is relatively new in common use, it simply refers to the use of estimated or modeled data as though they had actually been collected or observed. Historically, the most common forms of synthetic data in use have been created through the imputation of missing data rather than the wholesale construction of survey or household information. Imputation is a process for predicting missing values, typically based on the relationships between other variables or through look-alike models. Imputation of missing values can be as simple as mid-point imputation, which assigns the average value of variable to missing data, or as complex as multiple imputation models, which handle missing data by generating multiple plausible data sets through iterative imputation, analyzing each data set separately and then combining the results for the best possible estimates.

Look-alike models such as ascription or fusion have also been used for decades. Generally, these models take two nonoverlapping data sets and assign data from one set (donor set) to the other (recipient set) based on similarities between the individuals or households in the data based on common variables such as age, gender, income, etc. The donor look-alike will then have their data appended to the data of the recipient. These models also have methods for reducing over-donation by specific individuals which might unreasonably skew results.

New forms of synthetic data

With artificial intelligence (AI), new forms of synthetic data have become possible, which is why we’re hearing more and more about the term synthetic data across the industry. Some of these have shown great promise, especially when used for simulation and ideation. In the survey world, new synthetic data vendors are cropping up that are not just producing tools for the imputation of missing data, they are also bringing to market methodologies that involve the generation of fully synthetic respondents to fill out hard to reach quotas or fully replace surveys altogether.

The drawbacks of AI-generated imputation of missing data

Regarding the first use case – AI generated imputation of missing data – these tools have not yet been shown to be much more effective than other advanced imputation techniques such as multiple imputation modeling (although they can be much faster) but they do have an important disadvantage collectively.

AI-generated synthetic imputed data cannot, at least at present, produce margins of error around their estimates which makes stat testing misleading at best. With most forms of non-AI missing data imputation, the extra variance associated with the estimation of the new data can be included in the calculation of the variance of the variable in its entirety. It’s easy to understand how a variable with 20% missingness imputed by model estimates would have more variance and therefore a larger margin of error than a variable that included only observed or self-reported data. AI, by its nature, makes it difficult or impossible to assign the margin of error limitation that’s required, leading to false confidence about the results including AI imputed data.

The challenges of using fully synthetic respondents

The second use case – the creation of wholly synthetic individuals or households – is unfortunately idea whose time is yet to come. Consider a typical use case for fully synthetic respondents in which a researcher has collected 1,000 real human respondents but lack a sufficient number of respondents of a certain type, let’s say Hispanic Gen Z, where you only have 100 respondents. You may want to compare men and women within this demographic, but a sample of just 100 Hispanic Gen Z is insufficient to do so with any sort of statistical validity.

The way most of these new methodologies work is that their AI uses the distribution of the variables that were collected, and the intercorrelations between those variables, to produce a set of entirely synthetic respondents with the demography you are looking to enhance (Hispanic Gen Z) that reflect the results from within the sample you’ve already collected. Claims have been made by several entities in this space that the new synthetic panelists have similar results to collected data and also have less variance than observed data, and therefore smaller margins of errors. Unfortunately, there are several unresolvable problems with this approach.

First, the reason we want to collect the additional survey respondents (or other observation-based data) in the first place is not because we want to run stat tests, (although I do find it fun to run them). We collect more sample because we do not have confidence that the distribution of the sample we’ve already collected accurately describes the phenomenon we’re trying to measure. Any process that seeks to replicate this distribution does not make the data more reflective of the population, it simply recreates what you’ve already collected in much the same way that copying and pasting those 100 Hispanic Gen Z respondents in the data file to create 200 would do. This means that the addition of these synthetic respondents does not improve your results but rather reinforces whatever bias happened to exist in the data you originally collected.

The claims that this method produces similar results and less variance than actual data collection is true in a sense but is highly misleading. The reason that the variance among synthetic respondents is lower is that the methodology is simply replicating the existing variance on your data and so is not adding additional variation as the sample size increases. This will always result in lower variance and margins of error, but it is an egregious misuse of variance. Because the error around the synthetic data cannot be measured, we are not adding any of the variance created by this imputation into our analyses. As a result, any stat test applied will include a vast understatement of the actual variance which in turn means that even if your previously collected data were unbiased, you would be more likely to find statistical significance where none likely exists. In statistics, this is known as a Type 1 error.

Low sample is highly likely to produce biased results, which is why you want more data in the first part and replicating the variance you’ve already collected reinforces that bias. Consider a vastly oversimplified example: If you roll two fair six-sided dice a million times, you will essentially always get an average roll of 7.0 and a distribution that looks like a bell curve. If, however, you roll those dice 15 times, you could randomly get just about any average or distribution. Imagine you did that and got an average of 9.3, a large left skew and a very wide margin of error. What would happen if you applied a methodology that replicated the distribution in your observed dice rolls? If you created 1,000 more synthetic dice rolls, you would still end up with an average of about 9.3 and a large left skew, but this time you would have a very small margin of error and a great deal of confidence that this answer was correct. Simply put, adding the synthetic sample in this manner creates the risk that you will be far more certain that a biased wrong answer is the right one, resulting in misleading insights and bad, costly decisions.

Synthetic data and LLMs

Another method for creating synthetic panelists relies on the use of large language model (LLMs) generative AIs (gen AIs) such as ChatGPT, Llama or Claude to create personas that can be interrogated via prompts. Coming back to the Hispanic Gen Z example, one could use the U.S. Census to determine the correct age, income, education and geographical distribution of this group in order to create 100 new personas that combine to create a sample reflective of that population segment. At this point one could feed those personas into the gen AI system with a prompt such as:

“Let’s play a game. Pretend you are an American Hispanic [gender] who is [age] years old, makes [income], lives in [geography] and has achieved [education] as their highest level of education. Please answer the following questions as this person would.”

It’s worth taking a moment to reflect on what an incredible technological development it is that this is even possible. Such a technology available in the hands of the general public rather than a programmer would have been largely unthinkable even 10 years ago. That said, gen AI personas also suffer from a set of challenges that cannot currently be resolved. While LLMs do a reasonable job replicating the average values of some simple multiple-choice questions, they suffer badly from a regression to the mean problem. What this means is that they tend to over select the most common answer to a question and therefore provide a data set that is somewhat lacking the variation, diversity and breadth of real human responses. There is further emerging evidence that gen AI responses handle emotion and affect particularly poorly.

For example, according to a recent Gallup poll about 15% of Americans say they smoke marijuana. This is probably an underestimate as survey respondents often under-report illegal or socially objectionable activities. When I fed 100 demographically varied personas into the ChatGPT, however, the only “individuals” that reported marijuana use were young men in cities where marijuana is legal. This is not only an example of the regression to the mean problem, but also of the second issue with LLM personas – bias. Examples of biases in the underlying data used in LLMs which fuel gen AI are well known and broadly reported. As you might expect, these work their way into your personas’ survey responses in ways that may not be clear. Gen AI open-end responses suffer similar challenges to other question types, creating little variation in their answers and occasionally making things up entirely (hallucinate), They may also refuse to answer based on the sensitivity of the question asked based on the guardrails built into some LLM systems.

If you’d like more tangible evidence to support the challenges of using LLMs to construct personas that one can interrogate, just ask your favorite gen AI platform the following: “If I were to create a survey and request responses to it through the use of use of 1,000 demographically diverse personas using this LLM, how likely would they be to be reflective of the real population and why or why not?” You will find them surprisingly clear on this subject – and also a lot of fun!

It's also important to note that there are quite a number of very useful ways that AI can be used in this context. LLMs can do a great job at evaluating surveys and testing their logic as well as conducting sophisticated analyses of open-ended responses although human oversight remains essential. Large scale simulations of populations for hypothesis testing has also been used effectively by some organizations prior to conducting primary research.

While the technology of today is clearly not up to the task of effectively replacing survey respondents whole cloth, even as subsets, that does not mean that the technology of tomorrow won’t be. As a survey professional, I do hope that a viable version of these technologies will one day become available, especially to supplement hard-to-reach populations or to get answers to questions that are not permitted by regulation. In the meantime, there is simply no replacement for fully permissioned first-party data from real humans.