Character glyphs in R

My journey towards gaining control over character glyphs in R

I recently went down an R-shaped rabbit hole…like, a really deep one, and it all started when I decided that I would have a go at creating my own word-clouds in R for fun.

Now I know what you’re thinking…word-clouds suck!…and I think you’re right. In my opinion, they’re a pretty poor way of analysing text and I think they can sometimes be a bit misleading when conveying what the intent or sentiment of the original text actually was. But with that aside, I do think they can sometimes look nice and it’s aso an interesting problem (computationally) to try and generate a well packed, pretty word-cloud (at speed).

I’ll skip the details of how my R code for word-clouds works (that’s probably another post all of it’s own) and instead, I’d like to document my journey through the rabbit hole of getting the control I wanted over character glyphs in R. Namely, control of the coordinates that make up the outline of individual letter glyphs.

Now, I didnt mention at the start of this post that I have only sort of got out of said rabbit hole, and thats important, because my work on word-clouds is not yet finished…but Im just a little fatigued and plan to come back to it at some later time.

The text im using for the wordclouds in these examples is from Obama’s 2016 state of the union speech which I downloaded from here.

Using graphics::text()

My initial attempt at producing wordclouds relied on the base text() function that plots characters on the active graphics device. In the example below, the words are printed in grey using graphics::text(). I have then drawn a bounding box around each character string in blue using graphics::strwidth() and graphics::strheight() to compute the string dimensions.

Word-cloud using graphics::text()

The problem with this approach to word-clouds is that the collision detection and fitting of words on the page is limited to the rectangles that bound each word. Because try as I might, I could not find a way of getting access to the coordinates that make up the outline of the individual glyphs - and these are needed for the collision detection to ensure that words do not overlap one another (if you’re reading this and know a way of doing it - please get in touch and let me know how!!). For example, it is not possible to fit smaller words into the empty space that surrounds words but still within the bounding rectangle (or within the empty space of an individual letter, like printing a word inside the ‘hole’ of an ‘O’ character) whilst ensuring there are no collisions between words. The result of this is that the word-cloud looks sparse and a bit sub-optimally packed in my opinion…because after all, the aesthetic merit of a word cloud is subjective!

Using polygons

So, I started asking around for help on how to get access to the coordinates that make up the perimiter of individual character glyphs (stack overflow, reddit, …) and I got very little traction. But whilst doing this research, I came across this solution on stack overflow to a loosley related problem which got me thinking, could I create my own set of character glyphs as polygons, and then create word-clouds with them?

The short answer is yes…but in doing so, it raised another question…is it worth it!!??

Creating glyph polygons from a postscript

The following is some rough code to serve as a demonstration for the approach I took to creating my own set of glyphs as polygons. It’s not complete but it gives a rough idea of the process.

In this example I start by creating a postscript file for the character glyph ‘S’

    /Times-Roman findfont
    10 scalefont 
    newpath 0 0 moveto (S) show",

I then used the grImport package function PostScriptTrace() to trace the glyph to an xml file of nodes and paths.

# Trace the postscript file to an XML file
grImport::PostScriptTrace("/", "/example.xml")

Finally I can extract nodes for the glyph and do some wrangling (to clean up start and end nodes) for plotting

# Read the XML
xml <- xml2::read_xml("/example.xml")

# Extract the XML node associated with the text paths (the other node is a summary)
letter_paths <- xml_find_all(xml, "text/path")

# Extract the x and y coordinates
x <- letter_paths %>% xml_find_all("move|line") %>% xml_attr("x") %>% as.numeric()
y <- letter_paths %>% xml_find_all("move|line") %>% xml_attr("y") %>% as.numeric()

# Remove last pair of coordinates
x <- x[-length(x)]
y <- y[-length(y)]

# plot
plot(x,y, asp=1, type="l")

Huzzah! the outline coordinates of the Times New Roman glyph ‘S’

Note that coordinates of the glyph are all normalised to x=0, y=0 which became important (and a pain) when I started to construct words from the individual glyphs.

Create complete polygon glyph set

Obviously I didnt repeat this process for the whole alphabet. Instead, the postscript file I created contained all of the character glyphs from A-z so that I could extract all glyphs indivdually in one go.

Having all of the individual letters allowed me to convert them into polygons by defining what are the external paths and what are the internal ‘hole’ paths. Note that some glyphs have multiple external paths, like ‘i’, and some glyphs have mulitple internal ‘hole’ paths like ‘B’. I did this process manually (and it took a little while to do), but I knew I would only need to do it once (per font glyph set!). This is something that I think it would be an interesting problem to tackle programmatically - but I havent tried it.

The resulting glyph set is shown below, with the polygon sections filled in different colours.

Complete glyph set

So there it is. A very convoluted way of getting fine control over character glyphs and their plotting coordinates in R. To recap, I have essentially created a postscrip file of character glyphs and then traced them all back into R as a ‘lookup’.

Of course, for the glyphs to be useful for word-cloud project, I had to then create code that would ‘stitch’ the individual glyphs together to form words. That is probably best saved for another blog post on word-cloud generation (as it was tricky and painful to do, with many pitfalls!). But I will just include the two images below which show the output that I ended up being able to produce. You’ll see that I was finally able to print words withing the empty space of other words!

Word-cloud using polygon glyphs

Word-cloud using polygon glyphs - zoomed in

All code for these word-clouds can be seen on github here. This project is a work in progress and not working properly yet. There are separate branches for the different approaches I have taken.

Yixuan’s fontr package

I’ll just also note here that another way to access character glyphs is through Yixuan’s fontr package. I will go into this in more detail in my post on word-clouds, because it is really neat and produces much better results than my nasty polygon solution described here!

Chris Holmes
Senior Data Scientist

PhD physicist making his way in the world of data science!

comments powered by Disqus