Clustering in SAS® Visual Statistics
One of the analytical methods available in SAS® Visual Statistics is "clustering". Clustering in general is a method to group observations based on their similarity with the purpose of handling them in groups, eg. handling and targeting a segment of customers in the same way.
In the following example we will use the clustering technique to perform a transactional segmentation of banking customers. The available data we use lists the number of transactions for each customer and channel. This means we will segment the customers in terms of their transactional channel usage, distinguishing between transactions initiated in the branch (BRC), via the call centre (CCT), POS (point-of-sale) transactions and cash machine (ATM) transactions, also including one column for the total of these four categories. The following screenshot shows a sample of the data, which has one observation per customer:
Clustering in SAS Visual Statistics uses K-means clustering as the method. As the data distribution of the input variables can quite heavily influence the results of K-means clustering, let’s first see our data distributions.
This can be easily done by creating a couple of histograms in the Data Explorer as shown in the following screenshot:
K-means clustering preferentially uses variables with low skewness and kurtosis values. The histograms show that the distributions of the transaction counts are heavily left skewed. To be able get meaningful segments we will need to create some derived (calculated) variables, and use those as input for the clustering. We will create variables that reflect the proportion of transactions being processed through each channel. In particular, we will use the logit function defined as the following in this case (where Ln means the natural base logarithm):
It is very easy to set up the new calculated items in the Data Explorer using the expression builder facility:
Let’s change our histograms to show the newly created four logit calculated items for each of the four categories. We can see now that the distributions should be good to use for our K-means clustering:
Now we can move on to the clustering step. We add a Cluster type exploration and select the four Logit variables as the data items used in the analysis, and we will also request five clusters to be created:
As a result of clustering, we can look at various outputs. In this case we would like to profile and interpret our final clusters, for which the parallel coordinate plot is useful. This shows each cluster with a different colour on the binned scale of each input variable. Looking at each cluster reveals how the cluster members are distributed along each variable and therefore shows the cluster profile.
Cluster 1 shows higher than average POS transactions as the most important difference. Therefore, they can be labelled as “Card Shoppers”.
Cluster 2 shows higher than average ATM usage, so we label them as “Cash Lovers”.
Cluster 3 has more or less averaging values in all the 4 channels, which means they use newer channels as well more traditional methods. We therefore label them as “Transitionals”.
Cluster 4 shows a high usage of the Call Centres, so we could call them “Phone Bankers”.
Cluster 5 shows high usage of the Branch transactions, the most traditional channel. We might label them as “Time-Honoureds”.