K-means

A do-it-yourself two dimensional k-means algorithm

After producing the Radiohead colour palettes in a semi-unsupervised manner using stats::kmeans(), I thought I’d have a go at producing my own simple and rudimentary k-means algorithm in order to test my understanding of what it’s doing and how it works.

The code in this blog post creates a very basic k-means clustering algorithm that operates only in two dimensions. But nonetheless appears to give sensible results!

Begin with the tidyverse. It just makes everything so much easier…

library(tidyverse)

Data

The data I am going to use for this example is the cluster::xclara data set, described in the documentation as an artificial data set consisting of 3000 points in 3 quite well-separated clusters. A quick peak at the data (below) shows that this statement appears to be true!

plot(cluster::xclara, pch=4, asp=1)

My understanding of how k-means works is that it operates in ordinary Euclidian distance - and although the range of the x and y variables in the dataset are similar, I am still going to normalise them such that they both vary between 0 and 1. I also add a row number here for use later.

df <- 
  cluster::xclara %>% 
  mutate(x = (V1 - min(V1))/max(V1 - min(V1)),
         y = (V2 - min(V2))/max(V2 - min(V2))) %>%
  mutate(rn = row_number()) %>%
  as_tibble()

# Plot normalised data to check it worked properly
plot(df$x, df$y, pch=4, asp=1)
# Draw lines at 0 and 1
abline(h=c(0,1), v=c(0,1), col=2)

Finding clusters

Start by defining the number of groups to look for

ngroups <- 3

Make intitial, random centroids by sampling ngroups rows from the data. This defines are random starting positions for each cluster center.

# Set seed for reproducability and make random inital guesses for centroids
set.seed(5)
initial_rows <- sample(df$rn, ngroups)

Now I initialise a dataframe to hold details of the centroids that will be made during each iteration of the algorithm. For now, I can populate it with the first (randomly chosen) centroids. This dataframe, cent keeps the coordinates of the initial guesses in both the original dimensions (V1 and V2) and the normalised dimensions (x and y).

cent <-
  tibble(centroid = 1:ngroups,
         V1 = df$V1[initial_rows],
         V2 = df$V2[initial_rows],
         x = df$x[initial_rows],
         y = df$y[initial_rows],
         iteration = 1)

A smart algorithm would run iterations until nothing is changing with successive iterations. That level of smartness is beyond the scope of this basic example so here I define the number of iterations I would like to run through.

kmeans_iterations <- 15

Now loop through the iterations computing the cluster centers. I have tried to comment each step to explain what is happening

# Initialise a list to store the dataframe with all rows assigned to a centroid at each iteration
dataframe_list <- list()

# Begin the loop 
for(i in 1:kmeans_iterations){
  
  # Initialise an empty matrix with a row for each observation in the data and a column for each cluster
  distances <- matrix(nrow=nrow(df), ncol=ngroups)
  
  # Loop through each centroid computing the distance from each point in the data to the centroid
  for(j in 1:ngroups){
    # Compute distance of each xy (normalised) point to each centroid j
    x <- abs(df$x - cent$x[cent$iteration == i & cent$centroid == j])
    y <- abs(df$y - cent$y[cent$iteration == i & cent$centroid == j])
    dist <- sqrt(x^2 + y^2)
    
    # Assign distances to column j of the matrix
    distances[,j] <- dist
  }
  
  # Compute which point is closest to which centroid
  # Initialise a vector to store the centroid that each observation is closest too
  min_dist <- vector(length = nrow(distances))
  
  # Loop through the matrix rows, selecting the column (centroid) that has the smallest distance
  # and assigning it to the min_dist vector
  for(k in 1:nrow(distances)) min_dist[k] <- which(distances[k,] == min(distances[k,]))
  
  # Append the centroid list to the df and output the dataframe to the dataframe list
  df$iteration <- i
  df$centroid <- min_dist
  dataframe_list[[i]] <- df
  
  # If this is the last iteration, we can exit the algorithm here
  # We dont need to build the next iteration of centroids for cent
  if(i == kmeans_iterations) break
  
  # Write the new centroids back to the centroid dataframe
  # Group by the centroid and compute the mean of both the 
  # original coordinates (V1 and V2) and the normalised coordinates (x and y)
  new_centroids <-
    df %>%
    group_by(centroid) %>%
    summarise(V1 = mean(V1),
              V2 = mean(V2),
              x=mean(x),
              y=mean(y)) %>%
    mutate(iteration = i + 1)
  
  cent <- bind_rows(cent, new_centroids)
}

Bind all of the data set iterations together

df_list <- bind_rows(dataframe_list)

Run stats::kmeans()

Here I will run kmeans from the stats package in R on the same dataset looking for the same number of groups. This will allow me to test the output of my code.

# Run the base R kmeans function for ngroups
km <- kmeans(df[c("x", "y")], ngroups)

# Compute the V1 and V2 group centers from the normalised x-y centers
km_centers <-
  df %>% 
  mutate(kmeans_group = km$cluster) %>% 
  group_by(kmeans_group) %>% 
  summarise(V1 = mean(V1),
            V2 = mean(V2))

Visualise

Visualise all observations (in the original V1, V2 dimensions) at each iteration coloured by the centroid that they belong to in that iteration. This plot overlays the centroids at each iteration and the final results from the stats::kmeans() function.

# PLot all iterations facetted by iteration
df_list %>%
  ggplot()+
  geom_point(aes(V1, V2, col=as.factor(centroid)), alpha=0.03, show.legend = FALSE)+
  geom_point(data = cent,
             aes(V1, V2, fill=as.factor(centroid)),
             size=5, col=1, shape=21, alpha=1/2)+
  geom_point(data = km_centers, aes(V1, V2, shape="stats::kmeans() result"), size=2)+
  labs(title = "Basic k-means algorithm on the cluster::xclara dataset",
       subtitle = "Facetted by iteration number. Centroids for interation 1 are chosen at random",
       caption = "Results from stats::kmeans() shown as black dots",
       x = "V1",
       y = "V2", 
       shape = "",
       fill = "Centroids (this algorithm)")+
  theme_bw()+
  theme(panel.grid = element_blank(),
        strip.background = element_blank(),
        legend.position = "bottom")+
  facet_wrap(~iteration, nrow=3)+
  scale_color_viridis_d(option="plasma", end=0.8)+
  scale_fill_viridis_d(option="plasma", end=0.8)+
  scale_shape_manual(values=c("stats::kmeans() result" = 19, "1" = 21, "2" = 21, "3" = 21))+
  scale_x_continuous(breaks=scales::pretty_breaks(5))+
  scale_y_continuous(breaks=scales::pretty_breaks(5))+
  coord_equal()

In the plot above, my algorithm has converged to the same result as the stats::kmeans() by roughly iteration 13.

Animate

The following code (not run here) plots each iteration with a line showing the movement of the centroids. I have then created an animation of the iterations using imageJ.

# Plot each iteration on its own - tracking the path of the centroids
for(i in 1:kmeans_iterations){
  p <-
    df_list %>%
    filter(iteration == i) %>%
    ggplot(aes(V1, V2, col=as.factor(centroid)))+
    geom_point(alpha=0.1, size=2)+
    geom_segment(data = cent %>%
                   filter(iteration <= i) %>%
                   group_by(centroid) %>%
                   arrange(iteration) %>%
                   mutate(xend = lead(V1), yend = lead(V2)),
                 aes(V1, V2, xend=xend, yend=yend, col=as.factor(centroid)),
                 size=1)+
    geom_point(data = cent %>% filter(iteration == i),
               aes(V1, V2, fill=as.factor(centroid)),
               size=5, shape=21, col=1)+
    labs(subtitle = glue::glue("Iteration number = {i}"),
         x = "V1",
         y="V2")+
    theme_bw()+
    theme(legend.position = "",
          panel.grid = element_blank())+
    scale_x_continuous(breaks=scales::pretty_breaks(10))+
    scale_y_continuous(breaks=scales::pretty_breaks(10))+
    scale_color_viridis_d(option="plasma", end=0.8)+
    scale_fill_viridis_d(option="plasma", end=0.8)+
    coord_fixed()
  
  print(p)
  ggsave(paste0(i, ".png"), width = 6, height = 6)
}

No clusters

This implementation of kmeans allows me to also see what happens when it is run on a dataset that contains no clusters.

I have started by wrapping the code above into a function (cmeans() = chris means!) so I can easily run it many times.

# cmeans = chris means :)
cmeans <- function(df, ngroups, iterations = 10, seed=NULL, alpha=0.1){
  
  colnames(df) <- c("x", "y")
  
  if(!is.null(seed)) set.seed(seed)
  
  initial_rows <- sample(1:nrow(df), ngroups)

  cent <-
    tibble(centroid = 1:ngroups,
           x = df$x[initial_rows],
           y = df$y[initial_rows],
           iteration = 1)
  
  dataframe_list <- list()

  for(i in 1:iterations){
    
    distances <- matrix(nrow=nrow(df), ncol=ngroups)
    
    # Loop through each centroid computing the distance from each point in the data to the centroid
    for(j in 1:ngroups){
      x <- abs(df$x - cent$x[cent$iteration == i & cent$centroid == j])
      y <- abs(df$y - cent$y[cent$iteration == i & cent$centroid == j])
      dist <- sqrt(x^2 + y^2)
      distances[,j] <- dist
      }
    
    min_dist <- vector(length = nrow(distances))
    
    for(k in 1:nrow(distances)) min_dist[k] <- which(distances[k,] == min(distances[k,]))
    
    df$iteration <- i
    df$centroid <- min_dist
    dataframe_list[[i]] <- df
  
    if(i == iterations) break
  
    new_centroids <-
      df %>%
      group_by(centroid) %>%
      summarise(x = mean(x),
                y = mean(y)) %>%
      mutate(iteration = i + 1)
  
    cent <- bind_rows(cent, new_centroids)
    }
  
  df_list <- bind_rows(dataframe_list)
  
  df_list %>%
    ggplot()+
    geom_point(aes(x, y, col=as.factor(centroid)), alpha=alpha, show.legend = FALSE)+
    geom_point(data = cent,
               aes(x, y, fill=as.factor(centroid)),
               size=5, col=1, shape=21, alpha=1/2)+
    theme_bw()+
    theme(panel.grid = element_blank(),
          strip.background = element_blank(),
          legend.position = "")+
    facet_wrap(~iteration)+
    coord_equal()
  }

I now define a random, uniformly distributed x-y dataset and run cmeans() looking for 1 to 5 clusters (iterating 9 times each).

set.seed(1)
t <- tibble(x=runif(5000), y=runif(5000))

lapply(1:5, function(x) cmeans(t, ngroups = x, iterations = 9, seed = 2) + 
         labs(title = paste0("Specified clusters = ", x)))
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

Of course, the algorithm converges on the number of clusters specified - yet no real clusters exist! Neat!

Another stable configuration for 5 clusters appears to be

cmeans(t, 5, 12, seed=30)

Avatar
Chris Holmes
Senior Data Scientist

PhD physicist making his way in the world of data science!

comments powered by Disqus