Caplena Virtual Session: 5 Expert Tips to Master Open-Ended Feedback Analysis in 2025

Abstract

Caplena’s co-founder and co-CEO gave a presentation on improving open-ended feedback analysis on January 30, 2025. The presentation was one of eight in the Quirk’s Virtual Sessions – AI and Innovation series.

5 Ways to Improve Open-Ended Feedback Analysis in 2025

Editor's note: This article is an automated speech-to-text transcription, edited lightly for clarity.

Caplena sponsored one of the eight sessions in the Quirk’s Virtual Sessions – AI and Innovation series on January 30, 2025. Pascal de Buren, co-founder and co-CEO, Caplena, gave the presentation on improving open-ended feedback analysis.

Buren gave five tips to ensure high-quality analysis using AI technology. He used a case study to help highlight the tips and why they are useful.

Session transcript

Joe Rydholm

Hi everybody and welcome to our presentation, “5 Expert Tips to Master Open-Ended Feedback Analysis in 2025.”

I’m Quirk’s editor Joe Rydholm – thanks for joining us today. Just a quick reminder that you can use the chat tab if you’d like to interact with other attendees during today’s discussion. And you can use the Q&A tab to submit questions to the presenters and we will get to as many as we have time for at the end.

Our session today is presented by Caplena. Enjoy!

Pascal de Buren:

Hi, welcome everyone to today's webinar, “5 Expert Tips to Master Open-Ended Feedback Analysis in 2025.”

My name is Pascal, today's host for this session, self-proclaimed expert on open-end analysis.

Of course, as part of our everyday work at Caplena, we deal a lot with open-ended feedback. We advise clients on how to use them, how to embed them in their surveys or how to leverage online reviews.

Today I would like to share these tips with you. I think most of them will be relevant for everyone and of course it's going to involve AI, but not only that. So, I hope you're not tired yet of all the AI buzz.

Well open-ends have always been a useful tool in the market research toolbox. We've surveyed our customers and more than 6% are planning to increase the use of open-ends.

Of course there are various reasons for that. Usually people state bias, giving the respondent more freedom, hopefully getting less survey fatigue out of that, just to mention a few.

Then ‘what are the main reasons why they're not asking more open-end questions,’ all correlate a bit with each other, at least the main ones. It's about the high effort in analyzing them, the time constraints that they have and the turnaround time for the studies. And of course, they are also associated with high cost.

I would argue that all of these can be managed and drastically improved by using effective methods for analysis. There are five tips from my side here to get there.

Of course, the whole topic is complex. I’m not claiming here that it just needs a half an hour session for you to completely master it. That's why this is also a whole profession and there are things that you can do using AI, and other methods to really accelerate this and what 2025 will also bring in terms of additional developments that you may be able to leverage.

So, tip one, topic templates are inferior to AI, are inferior to context-aware AI.

Well, what are topic templates? Some also call them codebook templates.

Essentially if you're doing similar kinds of studies or you have an internal hierarchy of how you classify your touchpoints, you come up with some pre-made topics that magically fall into these kind of topics that can work, but usually what the customers then bring to you, if they are real and not fake, will be different things and it'll usually not quite fit to what you bring as your template.

It's natural that it just makes sense to be able to at least adapt that template to AI, to this data. Usually using AI or also of course manually but using AI is quicker.

What is context-aware?

Well, I think it's easiest if I just show you an example.

Let's hop into a study about consumer electronics, it's about TVs. Using for example a summarization tool laterally, some kind of LLM. I can already see the main topics. That's also probably what I would get in a template, what I would have in there if I were the person looking at this data, we put in picture quality, something around price, something around features of course, but we will be able to quickly see that by using a specialized AI system that we can actually get much more detail and more context specific things out of it.

Here we see generated topics based on the actual feedback and we can clearly see that there is much more than what I would have put in through a fixed template, I apply to any NPS study.

But I would argue that this is only 80% good or maybe even a little bit less, but how can I get to high-quality topics?

Well, it does involve again some AI, but also a bit of my own experience, my own input as a researcher.

For example, in this case, if I look at this, I see only a few brand names here listed. This kind of seems arbitrary. Of course they are probably the most popular ones, but I would also like to have all the brand names that were mentioned because I want it to be as complete as possible.

I can do that by using it again a LLM together with the prompt, together with the existing topics we'll do some magic and construct a resulting topic collection out of that, which is specific to this data set and covers my wishes for this analysis. Here I can just add these that were suggested.

That allows me to work together with the AI to build a comprehensive codebook that is not just some generic template, or some generic AI first iteration output let's say, but actually considers what I want as a researcher plus also all the other context that we can give. Such as for example, who was surveyed, what are the other questions that were being asked, maybe that's relevant for which topics we want to have analyzed here. And of course, also insights into the industry and the product that we are testing, etc. everything that I can get as text from somewhere I can input to the LLM.

This I think is really powerful and allows us to be much quicker and more precise.

What we of course want to prevent when we are doing this process is too much duplication. That's also possible to understand using these kinds of methods, both sort of a combination of classical natural language processing in the eye to red detect which ones are potentially similar and should be merged in order for the analysis to be really mutually exclusive and collectively exhaustive.

Next, we have tip two, keep updating; no one likes taking decisions on outdated data.

Data is usually something that lives. Every company collects data on a recurring basis, some yearly, some daily, some every second. No matter what, you will be taking data from your customers or maybe to phrase it a bit more nicely, you'll be listening to the customers in a regular interval. Also, your analysis should keep this in mind and be updated from time to time.

What do we need to be updated about? It's three things.

One is changes in what the respondents talk about, what they write about online or in the survey. This means we need to be able to detect new topics in the data. For example, in the TV study, people mentioned color accuracy as a new topic, maybe if we did a change to our product.

This is something that is outside of our current code and we really want to be alerted about that. That's something that these tools can provide, detecting that there is a change in new responses versus the old ones, comparing versus what is already there and detecting gaps and suggesting new topics.

We of course also want to be on top of changes in distribution. Of the existing codebook, what happened in terms of distribution, in terms of sentiment, distribution, just occurrence overall were there any significant changes? So, really also statistical testing, why not, over time to understand was this change statistically significant or is it a fluke. Often some topics are very rare and then a double increase doesn't actually mean that it is statistically significant.

Last but not least, we want to stay on top of quality. Be sure that we are actually still in the green zone, that our AI is performing well, that there's not a deterioration of how it's analyzing the feedback. How we do that with our ways to monitor this, to validate this I will come to in just a bit.

In summary, it's really about these three things that we need to be monitoring over time. We need to check for new topics, we need to check on changes in the existing topic distribution and we need to check on quality. Of course, the more real time our data collection is, the more we need tools to help us with these three things, we're not going to be able to go manually into a project every week to see, ‘well, is there anything new that has been mentioned?’ We want this to be automated and that's of course what we can do using a tool to do that for us.

Tip three, explore the data.

I think we now have the luxury of not thinking of open-ends as a chore. We just need to get this coding done and then we do two or three graphs and put it in PowerPoint and be good with it. We can really dig deeper and explore this data in a more interactive way using a large language model, acting kind of like a chatbot partner or a colleague who has read all that feedback and is happy to give me these answers.

This is truly something that will transform, I believe, how we work with this open-end data. It's not a replacement for the quantitative analysis that we can do, that we could do before either manually or also in automated way. But it allows us to go beyond that to drill down and also make the data more accessible to mock research people that just want to get their questions answered.

What we are seeing is that we humans are only beginning now to understand what we can ask a generative AI to do for us. There's still so much going on in terms of what can we prompt it about. What is it good at? Where does it perform? But what really can it do? The use cases seem so best.

It's important to see some examples and really understand what precisely we can ask to get some good insights.

For example, we're able here to of course produce any kinds of summary that we want about the topics, but we can also combine the quantitative aspect and the qualitative aspect of this kind of analysis, then by asking specific questions on the data.

So for example, I could ask ‘What do our respondents say specifically about smart TV features. I would like to get more detail around this negative 2.3% for example.

We can see here the share of sentiment on the right-hand side. So really what's the positive versus negative, and then of course a qualitative overview of what people are then saying on the smart TV features.

Of course I can also correlate this with other variables. Let's for example, look at correlations with the satisfaction score.

We can see here the smart TV features are actually, I'm quite a strong negative driver in this case.

So, what do people want change in respect to smart TV features? How could I actually improve my satisfaction?

For example, here they said, okay, yeah, we need to focus on the compatibility and of course addressing box. And we can of course go deeper, ask questions. Again, we are a researcher, we can ask deeper and always get more nuggets out of this past data.

What I can also do is correlate this with other kinds of variables, not just the satisfaction score, but also understand, okay, which are the segments that for example, are particularly negative or positive towards a certain topic.

Here for example, I can see that some brands have a really strong difference, and I can also test for significance all in an interface where anytime I want to dig deeper, I can ask questions.

That's what we think is powerful about LLM. It's not just about that it will produce an answer to everything but also assist or allow it to pull up the quantitative data and to support its arguments. Because as we will see later, the raw output of an LLM can be terribly wrong.

Tip four, healthy skepticism about AI output.

Given recent news with DeepSeek and all the major AI companies pushing out new AI models every couple of months and it's difficult to even keep up with the changes happening. It's important not to be overwhelmed and to keep a scientific mind about it. And for that we need to understand a little bit how AI works. And we need to be a little bit skeptical about the output and be able to also verify the output and how to monitor its accuracy.

How do we do that?

Well, as first mentioned, we need to understand as much as possible these engines are quite black box in a way of course governed by the math of millions and millions of parameters. But in a way it's also what's really helpful to understand is what they're really trained on and that is the so-called next token prediction.

Given all my input text that we gave, what is in the training data, what this AI has seen on the internet and it will produce the most likely next word. This will enforce some pattern that we will see.

In this case, we made a small example with just in total eight pieces of customer feedback. Three of them mentioned that the product is bad, three of them mentioned that it's good and two mentioned that it's bad, but in another way. So, we would have five out of eight that are negative.

We ask it well, ‘okay, what's the percentage people saying its bad? The interesting thing is that it gets sort of in the first glance, everything right? Well, it'll compute the percentage grade. It takes the bad responses, divide it by the total response, multiplies it by 100. Fantastic.

By the way, if you replay this with maybe you get a different answer, be excited then to see, but this case was it was ChatGPT 4, we got this response saying that four out of eight essentially are bad, 50%, which if we look at it either way it's kind of wrong or is just plainly wrong.

If we take bad, literally we will be three out of eight. If we get bad just semantically just meaning a negative response, we get it five out of eight. And what it produced was something in between and which is also not, probably not by hazard 50%, it just lands very often on quite common numbers and common responses.

We also see this in qualitative aspects. As a follow-up, we asked, ‘how many responses specifically said the product was bad? And again, it mentioned the four, which is just somewhere in between.

But what we find also really interesting is that when you ask about in qualitative aspects, remember we put in eight very generic responses. And I ask now, ‘what are the main drivers for customer satisfaction in this sample?’ Clearly, we cannot say anything about that if we're honest given the data that we have.

That would be the response I would hope for from a human. But in this case, given sort of this training background of how the AI was trained and then how human preference fine-tuned it to produce a convincing looking output.

We can see that it made up all kinds of things about quality, about value for money, performance and user experience. On first glance those look convincing, but they were just made up. And of course we don't really want that. We like that it's always able to say something which can be good in many cases, but of course we don't want a fabricated output.

How can we address that?

Well, in our case, what we saw is that you can actually verify the output and get a very accurate measure or feeling of how the AI is performing by just reviewing a small sample, a small intelligently selected sample of the data manually. That will inform both you and also then really a metric measure, a number that you can put on a stamp to really make decisions upon. And of course there are various methods to do this.

We sample for example here in the responses in such a way that the most unbalanced ones, the most difficult ones are shown to a user. And with that review, we can then estimate a quality score that we can also monitor over time without actually having a user or users having to do anything. But really just to calibrate the system to understand, okay, where are we at with AI accuracy? Then take it there and monitor it over time.

It's been a powerful tool to make sure that we are not falling victim to made up output.

So, just taking a few smartly selected samples and checking them and then doing some math on that.

And this brings me already to the last tip. Tip five more granular is generally better.

What should be more granular? Well, in the case here, of course the topics you are working with.

For example, let's say we are looking again at the same data, we are looking at this TV dataset and we've done once an analysis where we let the AI really fully generate and really be specific about it. We didn't do any prompting to make it more custom or anything, but we see that we have quite a lot of topics, for example, around the quality of the device and then it also specifically also about the software. On the other hand, I compared this to what we also often see happening is that we are more coarse and we just say, well, there's just quality. There's something around performance and there's something maybe ideally really reference a specific aspect of quality. But here, for example, reliability and performance can be related to the software, it can be related to the device itself, to the remote.

This will manifest itself then that I don't have these specific topics, but rather only these course ones, it'll manifest itself by more variability in how these topics are assigned, even if a human would do it or even also if AI does it. And of course you lose detail.

In my view, there's really no reason to be so coarse anymore. I understand if we come from a template, for example, it'll be course. If we come from a context of, okay, maybe I need to manage these topics and I want as few as possible. True if you have to do that manually, but we don't really do have to do that anymore. We can use AI to give us all details as detailed even as possible. And if we want then more education we can still aggregate later on, we can aggregate when we visualize, we can aggregate when we summarize.

We see here, for example, the fact in our AI quality score dropping for example six, seven, if it's more specific to using this more course, still very good comparable to how we would assign that, but it just is the AI is more certain. Also, a human will be more certain in how to assign you will get a more stable outcome if it's more specific.

All right. These were already the five top tips. Of course there is much more to be said regarding open-end analysis. We encourage to really stay methodological about it, not just because now we have really powerful AI systems to blindly trust them.

Again, be in a healthy skepticism, check what you get and these systems are adaptable. It's not like there is just a one-off thing we can adapt, we can improve any kind of AI system and we should leverage that and not simply accept what we get as the first iteration.

With that, I'm already at the end of my wisdom. I would like to open it up for questions. Looking forward to all the questions you have on problems you encounter when analyzing open-ends. There are so many, they talk about asking positive, negative questions, what's easier to analyze and all these kinds of service.