Creating wordclouds in R

An overview of how I wasted my time making wordclouds in R!

Back in 2005, Johnathan Feinberg created the Wordle wordcloud algorithm that eventually went on to be used in the very popular Wordle web application in 2008 . A few months ago, I stumbled upon his answer to a related question on Stack Overflow where he succinctly breaks down what the Wordle algorithm does into a single sentence

Place the word where it wants to be, while it intersects any of the previously placed words, move it one step along an ever-increasing spiral

Naturally, I thought that sounds like something I can try and reproduce in R!

I am using this project as an interesting coding challenge for my own amusement. I’m not sure wordclouds are a particularly useful way to visualise text analysis (Feinberg addresses some of their shortcomings himself at the end of this document) - but I do think they can look rather good.

Making wordclouds in R

There are well known packages in R for creating word clouds and text mining/analysis.

All examples shown here use text from Obama’s 2016 state of the union speech which I downloaded from here.

Also, the rather messy and unkept code repository is shared on github with several branches for each of the different attempts I had at this project.

Background

In an earlier post I wrote about how I had previously gone down a rather deep and dark rabbit hole related to character strings and glyphs in R whilst investigating how to best make wordclouds. Well, to cut cut a long story short, that rabbit hole had three distinct channels…

  1. The base R text plotting functions graphics::text() that allowed me to produce simple wordclouds (not accessing the glyphs themselves, so all collision detection is limited to the rectangles that bound individual words)
    • Simple wordcloud using functions such as graphics::text(), graphics::strwidth and graphics::strheight example below
  2. I then spent a long time (too long) constructing my own letter glyph polygons (by tracing postscript files!!!! using the excellent grImport package) in order to construct words and do polygon collision detection using the simple features sf package. It kind of worked but was horrendously slow and I couldnt come up with an elegant way of spacing the words on the page very well
    • Polygon character glyph wordcloud using the simple features sf package example below
  3. Finally I found Yixuan’s fontr package on github which allows you to extract character glyphs from a specific font, either as bitmaps or outline polygons. This, in turn, allowed me to create code for wordclouds that detected collisions at the individual glyph level, but also that wasn’t unuseably slow, and that is the subject of this blog post!

Adding a buffer

In order to tune the overall ‘crowdedness’(?) of the wordcloud I wrote some code to add a buffer region around each letter in the matrix returned by the fontr package.

It works by first expanding the whole matrix out by the size of the buffer (this makes empty matrix elements available to assign to the buffer value) and then mapping across the rows and columns of the matrix, computing the run length and indices of non-empty elements and adding the buffer to the start and end of each run.

The snippet of code below demonstrates what the resulting buffer region looks like.

p <- function(string, size, buffer, ...){
    # Produce the matrix of the glyph
    word_mat(string, size, 0, ...) %>%
    # Add the buffer
    add_buff(buffer) %>%
    # Plot the result
    plot_wc(buff.col = "orange",
            subtitle = paste0("Buffer = ", buffer)) + 
    theme_bw() + 
    expand_limits(y=size, x=size)
  }

# Run three times and visualise
wrap_plots(map(c(0, 5, 20), ~p("R!", 300, .x)))+
  plot_annotation(title = 'Visualising the letter buffer (shown in orange)')

This is a very simple and inefficient way of computing the buffer region for a letter, and it doesnt work so well when the glyph is off angle (see below). However, it does work well enough for my needs.

# Run three times and vsualise
wrap_plots(map(c(0, 5, 20), ~p("R!", 300, .x, rot=10)))+
  plot_annotation(title = 'Visualising the letter buffer (shown in orange)')

And here I render a small wordcloud with the buffer region highlighted

bmwc(obama[1:20,], show_word_buffer = T, col_buffer = "orange", buffer=4, seed=1)+theme_bw()
## All words placed (no overlaps)

Placing words

The small animation below demonstrates an example of how my implementation of the algorithm positions words on the page. It starts by trying to place the word at its initial position, and if it can’t be placed there (because it collides with an already placed word), it is moved around a spiral until it can be placed.

Collisions are detected by computing the intersect() of the non-empty elements of the word to be placed and the non-empty elements of the region of the page it is trying to be placed at. I suspect that this is a fairly inefficient and expensive way of detecting collisions.

Note that there are two special cases

  • The first (and largest) word can always be placed on its first attempt (as there are no other words for it to collide with)
  • My spirals are not ‘ever-increasing’ (although thet probably should be), so if a word reaches the end of its spiral and still cannot be placed, it is placed at the final coordinate of the spiral (where it will collide with an already placed word - and a message is printed to say that the word has been placed with an overlap)

Example wordclouds

Below are some example wordclouds using different features of the package

bmwc(obama[1:200,], seed = 2,
     title = "Barack Obama 2016 state of the union speech",
     subtitle = "Most frequent 200 words - font size mapped to word count")
## All words placed (no overlaps)

Setting markup = TRUE adds the coordinate axes to the plot, annotates the plot with an orange rectangle (the area within which the initial spiral origins for each word must sit) and the spiral for the first word placed. This view can be helpful when debugging!

bmwc(obama[1:200,], seed = 2, markup = TRUE,
     title = "Barack Obama 2016 state of the union speech",
     subtitle = "Most frequent 200 words - font size mapped to word count")
## All words placed (no overlaps)

Changing font and angle

bmwc(obama[1:200,], seed = 2, angle_range = c(-10, 10),
     title = "Barack Obama 2016 state of the union speech",
     subtitle = "Most frequent 200 words - font size mapped to word count")
## All words placed (no overlaps)

bmwc(obama[1:200,], seed = 2, angle_range = c(-80, 80),
     title = "Barack Obama 2016 state of the union speech",
     subtitle = "Most frequent 200 words - font size mapped to word count")
## All words placed (no overlaps)

And changing font

sysfonts::font_add(family = "old_eng", "C:/Windows/Fonts/OLDENGL.TTF")

bmwc(obama[1:200,], seed = 2, angle_range = c(-10, 10), font_family = "old_eng",
     title = "Barack Obama 2016 state of the union speech",
     subtitle = "Most frequent 200 words - font size mapped to word count")
## All words placed (no overlaps)

I will note that this code is still frustratingly slow to run but I think I’ve had enough of wordclouds for the time being!

Avatar
Chris Holmes
Senior Data Scientist

PhD physicist making his way in the world of data science!

comments powered by Disqus