In my previous post, I explained how one of the most widely used customer segmentation models – the RFM analysis – can be performed. Today I want to continue with customer analysis topic and guide you through the process of applying machine learning to customer segmentation.

## K-means algorithm

K-means is an unsupervised machine learning algorithm, which is used for data clustering. The k-means algorithm defines k number of cluster centroids and then assigns each data point to the nearest cluster while trying to keep the clusters small.

## Assumptions and data preprocessing needed for k-means clustering

To obtain reliable results, data needs to satisfy a few assumptions. It cannot contain outliers which influence the k-means algorithm. Moreover, data should be unskewed and have the same mean and variance for every feature.

Let’s take a look at our data and check if it satisfies k-means clustering requirements.

We can observe that recency distribution is skewed. Moreover, features have different means and variances. So first, we need to unskew data by implementing the log transformation on the recency column. Then, we standardize data, so all features have mean equals zero and the standard deviation equals one.

As you can see on the plots below, after transformations all the features are unskewed and have the same mean and variance. Distribution of data after applying log transform to recency column and standardizing all data. Dashed lines represent mean values

Now, we will identify the outliers using a box plot and the interquartile range score. The box plot gives us information about data distribution and presence of outliers based on five-number summary:

• “minimum” – the lowest data value excluding outliers (shown as the end of the left whisker)
• the first quartile (Q1) – 25th percentile (shown as the left side of the box)
• median (shown as the line inside the box)
• the third quartile (Q3) – 75th percentile (shown as the right side of the box)
• “maximum” – the highest data value excluding outliers (shown as the end of the right whisker)

“Minimum” and “maximum” values are defined in the following way. First, the interquartile range is calculated:

IQR = Q3 – Q1

Next, “minimum” and “maximum” are computed using the below equations:

“minimum” = Q1 – 1.5IQR

“maximum = Q3 + 1.5IQR

Any data point below “minimum” or above “maximum” is considered to be an outlier.

Looking at the box plots for our data, we can observe that each feature contains some outliers, which should be removed. After this operation, our data fulfils all requirements and we can move to k-means clustering.

## Applying k-means clustering

We start by finding the optimal number of clusters for the k-means algorithm. We will use the elbow method.

First, we need to perform k-means clustering for a range of values for k. Then for each value of k, the average score for all clusters is calculated. As the scoring metric, we used inertia, which is the sum of the distances from each data point to its assigned cluster centroid. Next, we plot inertia versus k and looked for the ‘elbow’ point (the point at which decrease starts to slow down).