Latent Profile Analysis (LPA) is a variant of cluster analysis. Cluster analysis is a statistical technique for finding ‘‘clusters’’ of observations that have similar values on a set of variables.
In this sense it is the “classic” way of segmentation, i.e. finding homogenous groups with hetrogenous attributes.
LPA is a model based approach which means observations ( = customers) obtain a probability of beloning to a class (which is the latent part of LPA).
LPA handles continous variables and allows the researcher to find an optimal solution (number of clusters) based on several fit criteria.
LPA (or another clustering technique) can be used to:
We use the following example data to illustrate LPA analysis from a Wholesale distributor. The data reflects the annual spending of business in Portugal.Below you see a sample of the data. Each row is a customer.
More information on the dataset can be found here.
fresh | milk | grocery | frozen | detergents_paper | delicassen |
---|---|---|---|---|---|
12669 | 9656 | 7561 | 214 | 2674 | 1338 |
7057 | 9810 | 9568 | 1762 | 3293 | 1776 |
6353 | 8808 | 7684 | 2405 | 3516 | 7844 |
13265 | 1196 | 4221 | 6404 | 507 | 1788 |
22615 | 5410 | 7198 | 3915 | 1777 | 5185 |
9413 | 8259 | 5126 | 666 | 1795 | 1451 |
With any cluster analysis, this is a critical question. There are various ways from a analytical standpoint to answer this question. LPA is a unsupervised technique, based on probabilities. The algortihm can therefore determine how many cluster provide and “optimal” fit for the data.
We run a set of clusters (1, n) and options and let the model determine the optimal set of clusters. There are a number of information criteria which can be used to select the number of clusters (e.g. Aikake Information Criterion etc.) for AIC and BIC lower is better. In our examepl a 3 or 4 cluster solution provides the best fit to our data.
Model 1 and Model 6 refer to settings of variance and covariances
We use a more advanced approach defined (Akogul and Erisoglu 2017Akogul, Serkan, and Murat Erisoglu. 2017. “An Approach for Determining the Number of Clusters in a Model-Based Cluster Analysis.” Entropy 19 (9). Multidisciplinary Digital Publishing Institute: 452.) to select the optimal number of clusters. In this case 3 clusters are most appropriate.
Once the model / researcher has decided on the number of clusters, meaning must be assigned. It is a matter of interpretation what type of labels are assigned to each cluster. Generally speaking the most obvious / interesting / useful assignments are made.
Size and distribution of the variables can be compared between clusters
Same data as before only not the relative percentages
In our example one could assign the following clusters
cluster | fresh | milk | grocery | frozen | detergents_paper | delicassen |
---|---|---|---|---|---|---|
Big spenders | 22198 | 18952 | 22171 | 8687 | 8708 | 5356 |
Freshies | 13209 | 2023 | 2565 | 3602 | 371 | 909 |
Small spenders | 8568 | 7003 | 10638 | 1325 | 4303 | 1366 |
Once clusters and meaning have been assigned this data can be merged back to the original set to obtain further insights. In our example we also have information on the type of business and region of channel
Same data as before only not the relative percentages
Horeca has relatively more Freshies, which makes sense.
A different way to visualize whether the clusters make sense and get a more indepth understanding of the variability of the cluster solution is using a dimension reduction technique. In this case we use UMAP which takes the original data and reduces it to two dimensions (without loosing too much information).
You can see a grouping of customers which (to some degree) match out cluster solution. The map also demonstrates that some clusters have a wide variety within a cluster (e.g. a Big Spender can vary in their purchase pattern).
This technique provides an individial level estimate of similarity which can be used for downstream analysis.
Points represent our customers. Customers which have a more similar purchase pattern appear closer together.