Editor’s note: Kevin Gray is president of Cannon Gray LLC, a marketing science and analytics consultancy.
Propensity score analysis is used when experimentation is not feasible or as a recourse when experiments go awry ("broken" experiments). Its basic concepts were hammered out over the span of several decades by Jerzy Neyman, William Cochrane, Donald Rubin and several other eminent statisticians, and the thinking of distinguished economist James Heckman has also influenced its development. Propensity score analysis in several variations has seen extensive use in medical research, economics, education, assessment of government programs and, more recently, in marketing research and predictive analytics.
Experiments, quasi-experiments and non-experimental research
First, why do we use experiments? We may wish to test the efficacy of some treatment or intervention such as medication, therapy and counseling or, in the case of marketing, liking for a new product. Randomized controlled trials (RCTs) are the gold standard in scientific research. A monadic product test where subjects (respondents) are randomly assigned to a "treatment" in which they use and evaluate Product A or Product B will be a familiar and very basic example from marketing research. In this case the main outcome variable is often a rating of overall liking or purchase interest for the product.
We use experiments to rule out as best we can apples-and-oranges comparisons. By randomizing allocation to a treatment group we minimize the chance that one product might have been preferred to another simply because the respondents in the two groups were different before they tried and evaluated the product. Through randomization we eliminate (or greatly reduce) the possibility that confounding variables are affecting the results, and we are able to have greater confidence that our conclusions are unbiased. The cost of a product failure can be enormous, as can the rewards of success, and product tests have been a major part of marketing research for many decades.
However, experiments are also open to criticism on numerous grounds such as cost or impracticality in some circumstances, and in place of experiments, non-experimental or quasi-experimental designs are frequently employed. Quasi-experimental research is similar to experimental research except that subjects are not randomly assigned to treatments, usually because it is not feasible or would be unethical. They may instead self-select themselves into treatments or be assigned non-randomly by a program administrator based on perceived need for intervention or some other criteria.
Applications of propensity scores
Propensity scores are used in quasi-experimental and non-experimental research when the researcher must make causal inferences, for example, that exposure to a chemical increases the risk of cancer. Since randomized group assignment is not possible with these designs, potential confounders must be dealt with through other means. One is multivariate analysis, in which potential confounding variables are included as covariates (independent variables) to reduce group differences post hoc. In many situations, however, the number of covariates is very large or they are highly intercorrelated and, consequently, the results obtained from the model are unreliable or even nonsensical.
Propensity score analysis is often an alternative. It can be conducted in a variety of ways but the essential idea is to create a balance among the groups of interest on important covariates, such as age, gender or product use. We are attempting to mimic an experimental design after the fact. A single variable is computed – a propensity score – that captures how differences in these variables contribute to a subject's statistical probability of being in one group or another. The term propensity score really refers to a kind of index or composite variable that summarizes important group differences. Subjects (e.g., survey respondents, customers in a data base) with similar propensity scores resemble each other with respect to these characteristics and those with very different propensity scores are dissimilar.
We cannot, in some Frankenstein-like fashion, re-create experimental data that have been collected through a non-experimental or quasi-experimental design. We can adjust afterwards for imbalances among groups that could potentially contaminate our interpretation of the results. In marketing, analysis of possible cause-and-effect relationships tends to be less rigorous than in fields such as medical research and the settings for which propensity score analysis was originally designed are less common in marketing research.
Even so, the core ideas of propensity score analysis have been successfully adapted and applied in marketing. One example is in direct marketing when results from a test mailing are used to score contacts in a database who have not received the mailing. Propensity scores are computed using demographic information and other characteristics to predict the likelihood of an individual responding and making a purchase. Marketing can then be tailored to individuals based on their estimated propensity to purchase. Customer relationship management (CRM) and shopper targeting are two other examples. Propensity score analysis is also employed in survey research to adjust samples for non-response, or reduce coverage or selection biases. Propensity scores can also be used before the fact, as when cities or stores are matched for tests of advertising or promotional campaigns.
Performing propensity score analysis
Propensity scores are typically computed using logistic regression, with group (treatment) status regressed on observed baseline characteristics such as age, gender and behaviors of relevance to the research. I'll briefly touch on how to select these variables later in the article. The propensity scores are the predicted probabilities of being in one group or another that have been derived from the model. Logistic regression is not the only method available and probit analysis, discriminant analysis, tree-based methods and other techniques are also used.
Broadly speaking, propensity score analysis can be performed in a number of ways: propensity score matching, propensity score stratification, propensity score weighting and covariate adjustment. There are variations of these methods and a full discussion of their intricacies is well beyond the scope of this article but here's a snapshot of each.
Propensity score matching
Propensity scores can be used to create matched samples. Both one-to-one matching and one-to-many matching are used. In the latter, each treatment subject (e.g., respondents, customers) can be matched to more than one control subject. Matched samples are used in subsequent analyses to estimate the effect size of the treatment. Several matching algorithms are available – greedy and optimal matching being two broad ways to classify these approaches. The matched sample can then be employed in subsequent multivariate modeling to estimate what the effect size would have been had an experimental design been employed. Statisticians call this a counterfactual inference. A principal drawback of many propensity score matching methods is that sample size may be decreased because data from subjects that cannot be matched must be excluded from the analysis.
Propensity score stratification
An alternative to matching is to stratify subjects on the basis of their propensity scores. As with the other methods, the fundamental purpose is to adjust for prior differences between the treatment and control groups so that we have a more accurate reading of the extent to which they really differ on the outcome variable. Following this approach, subjects are ranked according to their propensity scores and grouped, for example into five equally-sized strata (quintiles). Analyses are performed within each strata and then pooled to estimate an overall treatment effect. Multivariate analysis by strata is also possible provided sample sizes are sufficiently large, as is multiple-group structural equation modeling (SEM). Propensity score stratification is commonly used but impractical when the number of subjects is small.
Propensity score weighting
Using propensity scores as weights is also common. Inverse propensity weighting is one easy way and is analogous to using sampling weights to weight survey respondents so that the sample is more representative of a specific population. There are several other weighting methods (e.g., rescaled inverse propensity weighting). With inverse propensity weighting each subject's weight is the inverse of the probability of belonging to the group to which they belong, probability being represented by their propensity score. A shortcoming of propensity score weighting is that there are a number of alternative kinds of weights and the choice of which to use is usually not obvious ahead of time.
Covariate adjustment
Propensity scores, either in continuous raw form or grouped into strata, can also be used as covariates in models for estimating effect size. The easiest way (though not always the best) is a regression model relating the outcome (dependent variable) to treatment group status – usually a dummy-coded (0/1) variable – after having first included subjects' propensity scores in the equation as a control variable. This general form of statistical model is known as analysis of covariance or ANCOVA. A variant is when propensity scores are first grouped, for example into quintiles, and subsequently used in analysis of variance or ANOVA. In the ANOVA approach, the stratified propensity scores are entered first as a categorical factor followed by treatment group membership.
All of these methods can accommodate multivariate analysis such as structural equation modeling to estimate adjusted effect sizes. Multiple groups – not just a single treatment and a single control – are also possible in propensity score analysis, though this makes the analysis more complicated. There are other methods for achieving similar objectives in addition to the four described above. They include the instrumental variables (IV) approach, Heckman's sample selection model, matching estimators and nonparametric regression.
So which is best? All have pros and cons and sometimes these methods are used in combination with one another. To my knowledge, no one method is ideally suited to all types of data under all circumstances. Sample size and research design need to be considered. Two things that cannot be debated are that propensity score analysis should never be carried out perfunctorily and that the degree to which group imbalance has been diminished must be examined carefully, whatever method or methods have been used. The choices the researcher makes can be very consequential and it is advisable to consult the literature even if this topic is not entirely new to you.
There are no hard-and-fast rules regarding which variables to use in the propensity score estimation. In theory, we can group variables into four broad categories:
- variables that are related to both outcome (dependent variable) and group membership (treatment versus control);
- those associated only with the outcome;
- variables that are only related to group membership; and
- those that are not related to group membership or outcome.
In actual practice it is often hard to place variables into categories as distinct as these, even in fields such as medicine which rest on well-defined scientific principles. At a minimum, variables associated with the outcome that also differ between the groups should be considered, provided their inclusion makes intuitive and theoretical sense. These are potentially true confounders. In most circumstances, the last type listed above can be ignored. The middle two groups may prove to have an impact in multivariate modeling and should not be automatically excluded from consideration. However, propensity score analysis in a marketing context is often different and care must be taken when adapting the method to our own circumstances.
Criticisms
Propensity score analysis is not without controversy and, as is true of any methodology, can be misused. One criticism is that it is increasingly used as a substitute for experiments when the latter would have been both practical and ethical. Experiments should always be preferred whenever possible. Another criticism is that it is quite complicated, with many options and numerous decisions to make, and the way it is undertaken can substantially affect the results. Important confounding variables may be omitted, modeled incorrectly or crucial data difficult or even impossible to obtain. There is always the risk of hidden confounders – variables the researcher is unaware of that are affecting the results. Naturally, the skill and experience of the researcher must also be considered, as propensity score analysis is not as straightforward as this short article might imply.
Further reading
Nearly all the methods we use in marketing research have been developed outside of marketing research, hence I believe it is a good idea to keep our eyes and ears open for developments in other disciplines. The foregoing has been a greatly simplified overview of a relatively new and complex set of methodologies that are rapidly evolving. In addition to the many sources available online, several full-length textbooks on this subject have been published. Two recent ones are Using Propensity Scores in Quasi-Experimental Designs (Holmes) and Propensity Score Analysis: Statistical Methods and Applications (Guo and Fraser). Experimental and Quasi-Experimental Designs for Generalized Causal Inference (Shadish et al.) is a widely-cited textbook that offers a comprehensive perspective on research, including propensity score analysis, as well as detailed guidelines on research design and how design affects our interpretation of the results.