Editor’s note: Florian Kahlert is co-CEO of market research firm Helixa, New York. 

cell phone storage dataWhen you ask people how often they check their phone, their answers are going to be wildly off the mark. Usually too low. This is a classic example of the limits of survey research — there is often a significant difference between what people think and reality.

Observed behavior, on the other hand, is relatively easy and straightforward to capture. But, if you want to understand why people made a particular choice, that’s quite more complicated and usually not fully answered by observation alone — a survey is an excellent tool to get to those answers.

Market research has long relied on these two core approaches of collecting research data for development of consumer insights. Researchers have come to understand their strengths as well as their limitations and, most importantly, we now have a clear picture of the gap that exists.

While surveys have their strengths, collecting survey responses on tens of thousands of items from tens of thousands of individuals is extremely expensive, often impractical and sometimes inaccurate. 

On the other hand, observed data is now available in abundance, though it often lacks some parameters like basic demographics or the motivations to put the data in context.

Luckily the advent of ever more powerful and affordable computing power, both for processing and storage of data, combined with advances in machine learning provide us with a way to narrow this gap and expand the knowledge they produce.

However, many of the data sources we easily access and use these days lack one critical ingredient: A common unique identifier that would allow us to link them directly. Historically, this unique identifier had some great benefits — mainly creating the connection that data1 and data2 belonged to the same “person.” As data sets are increasingly diverse this is less and less the case and, from a privacy perspective, not desirable. It requires a different approach to solving the connection problem.

Leveraging machine learning can help us overcome this. Machines can sift through data more efficiently and identify connections that we typically could not see. When trained on a high-quality learning data set (something that has its own challenges and we leave to discuss another day) it can start making probabilistic predictions on which items in two data sets “belong together” and ideally it can indicate how likely this connection is to be true. 

This opens up a range of possibilities. We can correlate data that we have previously not been able to in context, and assign a score to the strength of the connection. For example, if we know from one data set that a particular group of people like the freshness of hard seltzer and another data set helps us understand the music preferences of various groups of people, we are actually able to pinpoint what music these hard seltzer people are into.

The systems may also help us understand that it's not just one type of music they are into but that the subgroup that is on the East Coast prefers K-pop and a group that likely has children in the home is more into classic rock. None of these data sets are connected to an individual – they can actually be kept completely separate – and the learning from one set is transferred and applied to another set.

What’s most important is that we can also discover new connections. The people who consume hard seltzer and like classic rock may in fact also show a propensity for being luxury car enthusiasts. Neither data set on their own would have revealed that nor given us any indication without some help.

Preserving privacy 

Not only have these advancements in AI and ML made possible something that was nearly impossible even a few years ago, the best part about this is that there is no need to know any personally identifiable information about any person in either of the data sets. By making these predictions for groups of people, we are able to uncover valuable connections in a safe way, preserving and securing privacy.

Marketers can use this ability to better understand the appeal of their products to subgroups of their buyers, allowing them to segment them into ever more detailed interest groups (and not just demographic clusters) without breaching any privacy, relying too heavily on benchmarks or making assumptions based on stereotypes. By having a much more detailed picture of them, they can identify relevant ways to truly connect with their customers.

But all this technology does not absolve the marketers from making diligent judgement calls. We as researchers should not assume that marketers are statisticians or data scientists, and our job is to help them come up with better decisions but also be clear and lay out the limits of the solutions available to them. Ultimately, this also marks a major shift in that it reveals the need for marketers to become a lot more data savvy. Not only to understand the possibilities these new capabilities bring but also to truly understand when and how to use them, and when to proceed with caution.