Take advantage of train-test methodology
Editor’s note: Sheilah Wagner is the director of data sciences at ENGINE Insights. This is an edited version of a post that originally appeared under the title “Train-Test Methodology to Enhance Brand Building Models.”
Whether modeling a brand’s performance, forecasting key changes to enhance brand revenues or creating clusters of customers with varying needs and wants from a brand, a train-test methodology can help fine-tune brand building models.
While circumstances are not always conducive to apply a train-test methodology, researchers should recognize, plan and seize the opportunity when it presents itself.
The application detailed below is straight-forward and the benefits of confirming the brand models are beneficial in demonstrating the robustness and building confidence in the future use of a brand measurement tool.
When to use train-test methodology
It is not always possible in survey research to obtain a sample size robust enough to apply a train-test methodology. However, it is an important step to verify clustering algorithms, regression models, classification tree predictions and the like.
It may not always make sense to split into train-test files. For instance, if the data set is small, creating even smaller factions just makes both sets less robust.
When we do have the luxury of a larger data file that allows us to take advantage of splitting into train and test, it’s important to know the benefits and best practices.
Single source model building drawbacks
A model built using only a single data set might pick up random effects and overfit to a unique pattern in that particular data, which may not generalize to future unseen data. Therefore, when a good estimate of model performance is critical, train-test is most appropriate.
Train-test benefits
A train-test split is beneficial in estimating the performance of machine-learning algorithms. A model is developed based on observations in a train data set, and then that model is applied to the test data to determine how well it predicts on unknown data.
When it is appropriate, perform any general cleaning, such as removing straightlining or rectifying any quality issues, dropping respondents etc., to the overall data file before splitting so it only needs to be done once.
Train vs. test data
Next, determine what percentage of your data you want to be contained in the train and test data sets. Most data should be split into the train data and the remaining into the test data. There is no set amount for train and test, typically from 67-80% are allotted to the train data, and the remaining to test. Some considerations are computational resources, number of observations and representativeness of each data set.
If the project is global and includes more than one county, we recommend splitting into train and test files at the country level and then aggregating all train files and all test files into global train and test data sets.
Ensuring balance across key variables
Often, the data is split on a random basis; however, there is no guarantee that key variables such as demographics are balanced in the two resulting data sets. If one of the data sets randomly contains more young vs. older respondents, or an imbalance of gender or other key variables, results may be impacted.
To ensure each train and test file is representative of the original sample, I recommend using a stratified sampling technique when applying the split to ensure files match across key variables.
Simple R syntax for stratification on key variables
R syntax is readily available to ensure a balanced split. In the following example, the data file is split 70% train, 30% test and key variables of age, gender and region are stratified. Setting a seed guarantees consistent files will be generated if you need to re-run.
set.seet(22)
Train<-stratified(data,c(“Age”,”Gender”,”Region”), size=.7)
Test<-filter(data, !(ID%in%Train$ID))
Once files are split into train and test, compare to verify key variables are balanced.
Models are built using the train data set, then the test data is run through the same parameters to confirm that model by using a new set of respondents that were not used in model building.
Application of train-test methodology
Let’s look at a train-test methodology used for the purpose of segmenting a brand’s customers and potential customers into groups based on similar needs from brands in the category. To ensure the recommended clusters would be highly reproducible the train-test methodology was applied. Numerous train-test data files were split 70%/30%, with demographic stratification at the country level, then aggregated into total train and total test data.
Due to a large number of total respondents, splitting the data files was also beneficial in reducing the run time of algorithms.
The train set was used to develop initial cluster solutions utilizing various algorithms such as k-means, bicluster and ensemble methods. For key proposed segments, the same algorithms were applied to the remaining test respondents to ensure the same clusters were found. The use of the train-test methodology demonstrated that the clusters found were also consistently found with the test data.
Train-test results
Creating a model to build brand strength with train data and applying the results to a test data set can be very powerful in demonstrating to clients the robustness of the model’s application on future unknown data. So, when possible, take advantage of every opportunity to apply a train-test methodology.