Guiding LLMs to Understand Your Customer Feedback
Editor’s note: Automated speech-to-text transcription, edited lightly for clarity.
On September 20, 2023, Caplena gave a webinar on how to use LLMs to understand customer feedback. The issue with this type of analysis is AIs ability or lack thereof to contextualize the information.
Maurice Gonzenbach, co-founder of Caplena, walked the attendees through the best ways to gain the insights you want with ChatGPT and their own platform. He used examples of clients who have been gaining insights in great ways through AI.
Watch the full video or read the transcript below from September’s Wisdom Wednesday session with Caplena.
Webinar transcription:
Joe Rydholm:
Hi everybody and welcome to our webinar “Contextual calibration: Guiding LLMs to understand your customer feedback.” I'm Quirk’s Editor, Joe Rydholm and before we get started, let's quickly go over the ways you can participate in today's discussion. You can use a chat tab to interact with other attendees during the session and you can also use the Q&A tab to submit questions to the presenters during the session. And we'll answer as many as we have time for afterwards during the Q&A portion. Our session today is presented by Caplena.
Enjoy the webinar.
Maurice Gonzenbach:
So welcome again to this session on “Contextual calibration: Guiding LLMs to understand your customer feedback.” I have actually spent most of my career analyzing customer feedback or working on solutions that help companies do so. And already in my studies I worked on sentiment analysis, which is still part of these products and these processes today. So I'm excited to share a bit on what has happened here in recent years and what current state of the art approaches are.
Now, I often like to start presentations with a relevant quote, and whereas you used to go to Google back in the day to do that, nowadays obviously you ask ChatGPT, right? So that's what I did.
I asked ChatGPT to give me a quote that involved machines and accuracy and of course reliably as always, ChatGPT didn't hesitate and spit out a number of quotes that actually matched my query pretty well.
The fourth one, I especially like “machines can calculate with relentless precision, but it is the human touch that brings accuracy to life.”
Beautiful. But being a scientist, I also like to verify my sources. So I Googled it quickly to see who this can actually be attributed to. And it turns out there is actually no sign of this being anywhere on the internet.
Now the same applies to 80% of the other quotes. ChatGPT has just invented them on the fly. And I think this illustrates even nicer than a quote that there is this strong requirement of having tools to actually ensure accuracy of the LLMs when we use them to analyze our customer feedback.
So with that, let's get started into the session and there are three topics, three chapters I want to cover today.
First of all, I want to introduce the problem briefly, then I want to touch upon the potential tools that are out there to solve the problem. And the largest part will then be on this analysis lifecycle for open-ended feedback during which I will focus on this quality assurance part.
So what is the issue we're dealing with?
Hopefully we have a large amount of open-ended feedback that we want to understand and for being able to understand them, we probably want to do some kind of categorization. So we want to get them into topics or themes or codes, however you want to call this. And we want to have that in a hierarchical structure. And we probably also want to have the topic level sentiment for each of these themes.
So how do we get there? And there are three basic ways we can tackle this.
First of all, the classical solution, fire up Excel. And this is not always the wrong solution. I think many of us have done coding in the past on their own, and this is actually the correct way to go about the problem. If you have only a couple dozen responses in my opinion, then you should actually read every single one of them and do the analysis yourself.
Obviously when the number of responses, the number of rows grows larger, then this becomes extremely tedious and it also becomes very fault or error prone. People are prone to messing up rows at some point or skipping columns or just inadvertently inputting the wrong code ID at some point.
So a second option is to use do it yourself tools. For example, ChatGPT. An example of how this could work is given on the left hand here, we give it a prompt where we provide the topics that we want to have and the response or the text comment that we want to have categorized and we ask ChatGPT to give us the solution. And this works pretty well if the number of topics is somewhat limited and if the text comments are generally understandable and don't require too much context. But there are of course challenges when it comes to scalability.
If you have tens of thousands of rows, you probably don't want to input them into ChatGPT manually, you'll have to use an API and then you're quickly going to go down the rabbit hole by building an entire platform yourself.
The second issue is the quality control. If you want to change a topic for example, or add a new one or give it this context and your internal lingo, then that is very hard to do in this interactive way and you will spend a lot of time on redoing the entire analysis again and again.
So there are of course specialized tools out there which are the third options, and I'd like to call them augmented intelligence platforms. As many of these platforms actually try to make the human and AI collaborate. I'm not going to go into these steps in detail yet, we'll touch upon these later.
But the idea is that you use the number crunching and prowess of the AI, but then you apply the domain knowledge and the project knowledge of the human to make them collaborate in a very efficient manner.
So there are a number of tools out there that do that among them, no secret..us, but also a couple of others that all approach this slightly differently. And of course in my presentation today, I'm going to focus on how we do that and how we see the best solution for these openings.
And with that, let's get to the meat of today's presentation and that is this analysis lifecycle for open-ended feedback.
Now, once you have collected your data, the first step is to actually identify and define the topics that you want to quantify or track over time.
The second step is to do this quality assurance.
A third step is the dashboarding and reporting.
And finally, you hopefully want to generate some insights out of these open ends.
And this whole thing is actually a cycle. If you are doing anything from a CX project or a multi wave project or a transactional feedback system that you are tackling because you want to then make sure that you identify new topics again and you do again the quality assurance when there is something new that has popped up. So how do I get to this topic list that I want to have quantified and tracked?
And the challenge here is coming from this list of keywords, which is how traditional tools have often worked. That you define a number of keywords which lead to a specific topic and you get more to a hierarchical high level theme organization like on the right hand side here. And this looks pretty simple if you only have a very limited scope of the survey.
But oftentimes in reality we see that these topic collections can become quite elaborate.
So Lufthansa, taking one example client of ours, they have something like 400 specific topics that they want to track very in-depth things like the responsiveness of the in-flight entertainment system that is a relevant thing for them and that they actually want to relate back to an internal KPI.
You also want to make sure that the topics are “MECE”, meaning Mutually Exclusive and Conclusively Exhaustive, so well separated and that they actually cover the entire space of topics that you are interested in.
Now how we get there has changed dramatically in the past two years. Now with these LLMs you can actually make this a highly interactive process. And what we do here is that we first suggest topics that appear in the data, but then we ask the user for feedback on these generated topics. And we do that in a very specific way.
So we don't just ask ‘What do you think about this?’ But we ask, ‘Hey, do you think that maybe these topics, creative canvas and visual thinking could be the same thing? And if yes then please remove one of them.’
So very specific questions guiding the user to get to this MECE topic collection.
Now for the second step, this quality assurance, there is sometimes the sentiment out there that people say, ‘well, I have AI, all is fine.’ But as the introductory example showed, you actually want to be sure that the AI does the thing you want it to do.
Maybe make an example here, eBay, they're VP of Insights who has procured our tool and he actually shows the result to the CEO every second month. So if you do that, you want to be sure that these results are really tangible and not some kind of random hallucination of the AI. And how we see the best process for doing that is reviewing a couple of rows.
So this means nothing else but just looking at a couple of assignments of the AI topic assignments and either confirming their correctness or potentially correcting them. And this will help the AI learn from the changes that you have made and apply them to the rest of the data.
And it'll also enable the AI to give you a quality score. So it will tell you how well it understands your categorization, how well your categorization matches that of the AI and we also try to be very specific here and even give this score for specific topics so that you can actually improve these where the AI still has issues in understanding them in detail.
Now this is also a step where things have dramatically changed over the last couple of years.
When we started out in 2016, you would probably have to give somewhere between 500 and a thousand examples to really get a high quality topic assignment because the AI had to relearn a lot of the things every time you did a new study and had a new topic collection.
Nowadays, the AI already has a huge knowledge and understanding of language out of the box and you actually mostly use this fine tuning to give it the context of your business, the context of your project, your specific lingo that you want to track.
And so the effort for this has come down by a couple of orders of magnitudes, and now oftentimes you will be through such a fine tuning process within maybe 15 to 30 minutes.
Now I already hear the question, ‘well why do I actually still need that? Is that really relevant?’ And the answer is yes.
In many cases if you do want to have this high quality analysis which is on par with the human categorization, then you probably will want to do this.
And to show you why, we did a couple of experiments here where we compared our AI to that of ChatGPT, and we did that on a couple of sample studies, 17 studies in total and we benchmarked the score. So here you see how well our AI does compared to ChatGPT. And if it is in the white area here, it means ChatGPT did better. If it's up here, it means our AI did better.
Now out of the box, the scores are somewhat similar, so the Caplena AI is slightly higher at 47 ChatGPT only at 35. But to be fair, both scores are not quite satisfactory yet after doing this fine tuning and we really limited ourselves here to give ChatGPT the benefit of the doubt and optimize all the limitations for them by giving them the instructions through a few short learning techniques.
But here, the difference then really showed with Caplena getting a score of 59 and ChatGPT of 32. And this really shows that this fine tuning is actually a very relevant technique still and is currently still the most effective way of providing this context to the LLM.
Of course, we are experimenting in many ways on how to get to the next step here and also experimenting with a number of different variants of giving this context, for example, by providing it a more elaborate description of the topics or a more elaborate description of your industry of the project you're doing. But so far it has really shown that from a quantitative standpoint, this fine tuning through labeling is still the most effective way of doing this both time-wise and in terms of the results that you achieve.
Now, it's of course important to recognize that ChatGPT is an amazingly versatile tool like most of these general purpose LLMs out there, and we actually make use of them for specific features already.
So we have introduced this summarization feature where you can look into one of your themes that you have quantified before and ask ChatGPT to give you a pros summary of these rows in that specific section. So we are actively leveraging their technology in our application as well, just at the current moment, not for the core part, the categorization itself.
Okay, so now I have categorized my feedback. I have done that in a high quality manner. The next step is actually bringing these results back to our stakeholders in the organizations.
And I would like to show you how Miro does that because I think they exhibit this combination of push and pull in a very nice way.
So Miro has these yearly annual strategic insights reports, highly customized and in detailed reports where they discuss the strategic trends and deliver them in a beautifully designed report to the management team. Then they have quarterly top feature requests reports.
So these go to the product leadership and they're already a bit more standardized. There's still some custom research behind them, but it's also more standardized than this strategic report coming out yearly.
On a monthly basis, they have trends and pulse checks, and these are already highly standardized. So this is basically a template which is filled in with the most current data, and it is accompanied already by dashboards as well, that the stakeholders can then look into themselves.
And finally, they have these live dashboards that are available at any time on the own time of the product managers, of the user experience researchers that can be accessed at any moment and will always contain the most current data that those stakeholders can then use to dig into the data themselves.
And I think this combination is a highly effective example of how to make sure that your results are also heard in the organization and don't just sit in some system without being employed.
Now the holy grail is of course generating insights beyond just simple counting of topics. And while there is no guarantee on how to get to these kinds of insights, we do have a sort of process that we see every now and then or quite often that companies try to employ to get to these interesting insights nuggets.
It starts off by identifying a relevant topic that you want to dig into.You can do that by just either taking the most common topic or you can do a bit more advanced analysis and do a driver analysis.
For example, that means mapping the importance of topics based on an overall satisfaction score, and it can then turn out that topics that are mentioned often might not necessarily be the ones that also drive satisfaction.
So you have now identified a topic you want to explore. The second step is then digging into it and doing some more in-depth analysis.
For example, you could correlate these topics with others or you could do it by third variables, demographics, close questions, whatever other data you have in these surveys or in the CX data. This is step two. And this helps you to find interesting segments, interesting bits that behave differently than maybe the average.
And finally, you drill down into this combination of different variables. You filter by these, you dig into them, you let them be summarized by the AI to understand why those segments behave differently.
And again, of course it doesn't always work, but it can be used as a kind of recipe to try to find some of these interesting nuggets that go beyond a simple count or percentage.
Now I'm getting towards the end of this presentation and just summarizing again what I said, this automation versus quality is like walking a thin line and you can get to a very high degree of automation without sacrificing a lot of quality.
But once you go to a hundred percent, then you often lose a significant part of the quality. So our recommendation here is really aiming for this 90, 95% automation, but making sure that you keep a grip on it and that you do these qualitative checks as well to make sure the quality is where you want it to be.
Now, our marketing would of course not let me off the hook unless I show at least two marketing slides. So here we go.
We help companies like eBay, like Miro, I mentioned them both really produce these tangible, actionable results.
eBay mostly uses us to understand the root cause of issues and make sure that their management is on top of the currently biggest pain points of their customers.
For Miro, the core use case is shaping their roadmap. That is absolutely essential for a product driven, product led company like they are. And it helps them prioritize the roadmap based on customer feedback.
We of course help a number of other clients as well, over 150 clients by now with a strong foothold in Europe, but also quite a few U.S. companies that we support by now.
And if you would be interested in having a chat with us and seeing if your use case fits our application as well, then I'd of course be very happy to have that conversation with you and explore if this might be a solution for you as well.
And now if you have any general questions on customer feedback on LLMs or on our solution specifically, I'd be very happy to answer them live now or also by e-mail at a later point in time.
Thank you so much for your attention and have a great day.