Star Trek text exploration

Rudimentary analysis of the text contained in Star Trek TV series scripts


With Jean-Luc Picard back on our screens again in the new Star Trek: Picard TV series, I’ve been re-watching some old episodes of Star Trek: The Next Generation, and it got me wondering if there was some fun text analysis that could be done with the scripts of old Star Trek series.

After only 2 minutes of googling, I started trying to scrape to get access to all of the available scripts. Unfortunately, I have no experience with scraping webpages and I was really struggling to make progress before I stumbled upon the raw text scripts and processed lines of all Star Trek series scripts already shared on kaggle!

So what follows is a rudimentary analysis of this data.

Stardate distributions

Data in

Start by loading packages and reading in all scripts


raw <- fromJSON(file='all_scripts_raw.json')

# Take a look at the summary
summary(raw) %>% knitr::kable()
Length Class Mode
DS9 173 -none- list
TOS 80 -none- list
TAS 22 -none- list
TNG 176 -none- list
VOY 160 -none- list
ENT 97 -none- list

Immediately I can see that I am going to want to order any analysis by the original air date of the Star Trek series, so here I create a named vector for convenience. This vector can also be used as a look-up to quickly convert the three letter series abbreviations to their full names.

s_order <- c(TOS = "The original series", 
             TAS = "The animated series",
             TNG = "The next generation",
             DS9 = "Deep space 9",
             VOY = "Voyager",
             ENT = "Enterprise")

I want to keep the structure of the analysis ‘tidy’ so I am going to start by storing all of the script information in a dataframe

df <-
  # Create a dataframe containing all series names and all episode names per series
  map_df(names(raw), ~tibble(series = .x, episode = names(raw[[.x]]))) %>% 
  # Add a column that stores the episode number as an integer
  # Add a list column that contains the character vector of the script for each episode
  mutate(episode_number = str_extract(episode, "[0-9]+") %>% as.integer(),
         script = map2(series, episode, ~raw[[.x]][[.y]])) %>% 
  # Convert the series column into an ordered factor
  mutate(series = ordered(series, names(s_order)))

# Take a look at a few rows
df %>% sample_n(10)
## # A tibble: 10 x 4
##    series episode     episode_number script   
##    <ord>  <chr>                <int> <list>   
##  1 TNG    episode 13              13 <chr [1]>
##  2 ENT    episode 89              89 <chr [1]>
##  3 ENT    episode 0                0 <chr [1]>
##  4 VOY    episode 151            151 <chr [1]>
##  5 VOY    episode 20              20 <chr [1]>
##  6 ENT    episode 49              49 <chr [1]>
##  7 TOS    episode 47              47 <chr [1]>
##  8 TAS    episode 3                3 <chr [1]>
##  9 TNG    episode 160            160 <chr [1]>
## 10 VOY    episode 127            127 <chr [1]>

Although not evaluated here, this chunk is really useful as it writes each script to a .txt file in my working directory. This is so I can manually (shock!) read and look at the scripts if it’s necessary (which it almost certainly will be!)

pmap(df, ~write_file(..4, path = paste0("scripts/", ..1, "_", ..3, ".txt"))) %>% invisible()

Stardate extraction

In my initial research into stardates (which only involved looking at this Wikipedia page), the following descriptions were very helpful

  • The Original Series era
    • The following edited excerpt from Star Trek Guide (April 17, 1967, p. 25) instructs writers for the original Star Trek TV series on how to select stardates for their scripts

Pick any combination of four numbers plus a percentage point, use it as your story’s stardate. The progression of stardates in your script should remain constant but don’t worry about whether or not there is a progression from other scripts.

  • The Next Generation era
    • They were described as follows (edited here) in Star Trek: The Next Generation Writer’s/Director’s Guide of March 23, 1987 (p. 13)

A stardate is a five-digit number followed by a decimal point and one more digit. Example: “41254.7.” The first two digits of the stardate are always “41.” The 4 stands for 24th century, the 1 indicates first season. The digit following the decimal point is generally regarded as a day counter.

So quite simply, the code I have implemented here to scrape the stardates from the scripts looks for two things

  • the string ‘stardate’ followed by either
    • at least 4 digits (to capture both The Original Series era (4 numbers) and The Next Generation era (5 numbers)) followed by a decimal point, followed by any length of numeric digits as a (this probably only need to match one)
    • at least 4 digits

The reason for both searches is because I quickly realised that not all stardates have a decimal point and a following value ‘day counter’. Some are just integers. I’ve also chosen to only scrape numbers that follow the string ‘stardate’ as I found I was unintentionally getting hits on number strings that have nothing to do with stardates.

Define the regular expression to use

rx <- "(?<=stardate )[0-9]{4,}\\.[0-9]+|(?<=stardate )[0-9]{4,}"

Map across all of the scripts extracting the stardates and also create a stardates_str column for convenience that prints the stardates inline.

df <-
  df %>% 
  mutate(stardates = map(script, ~unlist(str_extract_all(tolower(.x), pattern = rx))),
         stardates_str = map_chr(stardates, ~paste0("[", .x , "]", collapse=" ")))

df %>% sample_n(10)
## # A tibble: 10 x 6
##    series episode   episode_number script stardates stardates_str          
##    <ord>  <chr>              <int> <list> <list>    <chr>                  
##  1 VOY    episode ~            102 <chr ~ <chr [5]> [32611.4] [32623.5] [3~
##  2 TNG    episode ~             54 <chr ~ <chr [1]> [43385.6]              
##  3 TOS    episode ~             75 <chr ~ <chr [3]> [5832.3] [5832.5] [583~
##  4 ENT    episode ~             11 <chr ~ <chr [0]> []                     
##  5 TOS    episode 6              6 <chr ~ <chr [6]> [1329.8] [1329.1] [132~
##  6 DS9    episode 8              8 <chr ~ <chr [0]> []                     
##  7 DS9    episode ~             32 <chr ~ <chr [4]> [47550] [47550] [47552~
##  8 DS9    episode 4              4 <chr ~ <chr [0]> []                     
##  9 TNG    episode ~             53 <chr ~ <chr [1]> [43349.2]              
## 10 VOY    episode ~             79 <chr ~ <chr [2]> [51679.4] [51658.2]

Stardate visualisation

ts <- 
  df %>% 
  select(series, episode_number, stardates) %>% 
  unnest(cols=stardates) %>%
  mutate(stardates = as.numeric(stardates)) %>%
  ggplot(aes(episode_number, stardates, col=series))+
  scale_color_viridis_d("", option="plasma", end=0.85)+
  scale_y_continuous("Stardates", breaks = scales::pretty_breaks(8))+
  scale_x_continuous("Episode number", breaks = scales::pretty_breaks(10))

bp <-
  df %>% 
  select(series, episode_number, stardates) %>% 
  unnest(cols=stardates) %>%
  mutate(stardates = as.numeric(stardates)) %>%
  distinct() %>% 
  ggplot(aes(series, stardates, fill=series))+
  # geom_violin() +
  geom_boxplot(outlier.shape = 4, show.legend = F) +
  scale_fill_viridis_d(option="plasma", end=0.85)+  
  scale_y_continuous("Stardates", breaks=scales::pretty_breaks(8))+

(ts + bp + 
  plot_layout(guides = "collect") & 
  theme(legend.position = 'bottom',
        plot.caption = element_text(colour="darkgrey"))) + 
  plot_annotation(title = "Star Trek stardate distributions",
                  subtitle = "Stardate time-series (left) and overall distribution (right)",
                  caption = paste0("[", names(s_order), ": ", s_order, "]", collapse=" "))

Why the outliers?

Lets take a look at the DS9 outliers

df %>%
  filter(series == "DS9") %>% 
  unnest(cols = stardates) %>% 
  mutate(stardates = as.numeric(stardates)) %>%
  top_n(-2, stardates) %>% 
  select(-script) %>% 
series episode episode_number stardates stardates_str
DS9 episode 48 48 1024.7 [1024.7]
DS9 episode 101 101 4523.7 [4523.7]

DS9 episode 48

BASHIR: I’ve accessed the Trill central database. Let’s see what information there is on Joran Belar. Here we are. Born on stardate 1024.7, died on 8615.2."

So Bashir is talking about Joran Belar’s birth date! - not really relevant to the distributions and analysis I am doing here.

This excerpt also highlights another interesting point - I am missing the stardate 8615.2 because it is not directly preceded by the string ‘stardate’. Whilst that’s a happy mistake for this analysis, it highlights the perils of this type of work - I have extracted the birth stardate (a clear outlier for DS9) but have failed to extract the death stardate.

DS9 episode 101

“DULMUR: The man was a menace. What was the date of your arrival? SISKO: Stardate 4523.7. DULMUR: A hundred and five years, one month, and twelve days ago. LUCSLY: A Friday. DULMUR: What was the Enterprise doing?”

Sisko is talking about something that occurred a hundred and five years, one month, and twelve days ago!

A TNG mistake?

Let’s look at the massively low stardate outlier in TNG

df %>%
  filter(series == "TNG") %>% 
  unnest(cols = stardates) %>% 
  mutate(stardates = as.numeric(stardates)) %>%
  top_n(-1, stardates) %>% 
  select(-script) %>% 
series episode episode_number stardates stardates_str
TNG episode 11 11 4124.5 [41242.4] [4124.5] [41242.45] [41242.5]

Episode 11 appears to be set around stardate 41242, but we see one stardate at 4124.5. (~10 times lower)

Examining the script reveals

First Officer’s log, stardate 4124.5. We have found…

And I have just checked on Netflix, and it’s not a mistake in the script file or the text parsing! It is exactly what Riker says! (the first line after the intro sequence)


The other obvious thing I can do with these scripts is visualise some wordclouds (mainly because I have already developed some of my own code to generate wordclouds - not because I think wordclouds are useful, or good!)

It might be interesting to look at the wordclouds of individual characters, making comparisons across series, so here I read in the second JSON file that contains the lines associated with each character and series.

Data in

raw <- fromJSON(file='all_series_lines.json')

Clean scripts

I’ve noticed that some of these records have undesirable text towards the end of the script (some don’t) and there seem to be a few variations. Here I define these ‘footers’ for easy removal in the next step

footer1 <- "<Backto the episode listing"
footer2 <- "<Back to the episode listing"
footer3 <- "Star Trek ® is copyright of Studios Inc. Copyright © 1966, Present. The Star Trek web pages on this site are for educational and entertainment purposes only. All other copyrights property of their respective holders."
footer4 <- "Star Trek ® is copyright of Studios INC. Copyright © 1966, Present. The Star Trek web pages on this site are for educational and entertainment purposes only. All other copyrights property of their respective holders."
footer5 <- "Star Trek ® is copyright of Studios INC. Copyright © 1966, Present. The Star Trek web pages on this site are for educational and entertainment purposes only. All other copyrights property of their respective holders."
footer6 <- "Star Trek ® and relatedmarks are trademarks of Studios Inc. Copyright © 1966, Present. The Star Trek web pages on this site are foreducational and entertainment purposes only. All other copyrightsproperty of their respective holders."
footer7 <- "Star Trek ® and related marks are trademarks of Studios Inc. Copyright © 1966, Present. The Star Trek web pages on this site are for educational and entertainment purposes only. All other copyrights property of their respective holders."

# Make one vector of known footers
footers <- paste(footer1, footer2, footer3, footer4, footer5, footer6, footer7, sep="|")

Now make a tibble containing all characters and their lines (with footers removed) and also count the number of lines (removing those with no lines)

df <-
  map_df(names(raw), ~tibble(series = .x, episode = names(raw[[.x]]))) %>% 
  mutate(character = map2(series, episode, ~names(raw[[.x]][[.y]]))) %>% 
  unnest(character) %>% 
  mutate(lines = pmap(., ~raw[[..1]][[..2]][[..3]])) %>% 
  mutate(clean_lines = map(lines, ~str_replace_all(.x, footers, ""))) %>%  
  mutate(n_lines = map_int(lines, length)) %>% 
  filter(n_lines > 0)

# Take a look
## # A tibble: 6 x 6
##   series episode   character lines       clean_lines n_lines
##   <chr>  <chr>     <chr>     <list>      <list>        <int>
## 1 ENT    episode 0 KLAANG    <chr [22]>  <chr [22]>       22
## 2 ENT    episode 0 DOCK      <chr [6]>   <chr [6]>         6
## 3 ENT    episode 0 ARCHER    <chr [216]> <chr [216]>     216
## 4 ENT    episode 0 FUTURE    <chr [8]>   <chr [8]>         8
## 5 ENT    episode 0 SARIN     <chr [19]>  <chr [19]>       19
## 6 ENT    episode 0 STUDENTS  <chr [2]>   <chr [2]>         2

Clean, extract and count words from lines

Here I define a function that takes a vector of lines and cleans it (removing punctuation), and converts it to a vector of words. The vector of words is then filtered to remove empty strings and oddly encoded words and a grouped count is returned

extract_and_count_words <- function(lines){
  word_vector <- 
    str_replace_all(lines, "[:punct:]", "") %>% 
    str_split(" ") %>% 
  # Return word count
  tibble(words = tolower(word_vector)) %>%
    filter(words != "") %>% 
    filter(Encoding(words) != "UTF-8") %>%

# Example on one character's lines
##  [1] "Looks pretty authentic, Captain, right down to the spittoons."                                                                                                                                                
##  [2] "That sounds about right."                                                                                                                                                                                     
##  [3] "Aye, sir."                                                                                                                                                                                                    
##  [4] "Morning."                                                                                                                                                                                                     
##  [5] "I hope so. We need a horse."                                                                                                                                                                                  
##  [6] "Excuse me?"                                                                                                                                                                                                   
##  [7] "Sounds kind of steep. Don't suppose you'd be interested in atrade?"                                                                                                                                           
##  [8] "Look, I know it's not worth the horse, but we only need him fora couple of hours. We just want to ride out and grab our gear. We'll beback before you know it. Well, how about I leave my gun for collateral."
##  [9] "I've seen every John Ford western."                                                                                                                                                                           
## [10] "How hard can it be?"                                                                                                                                                                                          
## [11] "Better hold on."                                                                                                                                                                                              
## [12] "Where are you going?"                                                                                                                                                                                         
## [13] "He also said to keep a low profile."                                                                                                                                                                          
## [14] "Kinda late at night to be teaching school."                                                                                                                                                                   
## [15] "What's left of it."                                                                                                                                                                                           
## [16] "You're not coming with us?"                                                                                                                                                                                   
## [17] "Go ahead"                                                                                                                                                                                                     
## [18] "Captain. ARCHER"                                                                                                                                                                                              
## [19] "So what do we do? They're humans. We can't just leave them here."
# Run function
extract_and_count_words(df$clean_lines[[800]]) %>% 
  arrange(desc(n)) %>% 
## # A tibble: 10 x 2
##    words       n
##    <chr>   <int>
##  1 we          5
##  2 to          4
##  3 be          3
##  4 i           3
##  5 it          3
##  6 of          3
##  7 a           2
##  8 about       2
##  9 captain     2
## 10 do          2

Now apply this function to all rows of df

df <- df %>% mutate(words = map(clean_lines, extract_and_count_words))

Stop words with TF-IDF

Now, I want to start this section with the disclaimer that I have little to no idea what I’m doing here!

In order to remove common or unimportant words, I have done a small amount of research on term frequency - inverse document frequency and tried to apply it here in a very clunky fashion. The following excerpt is from

The TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

So I compute the term frequency by computing the count of each word per character (and series)

term_freq <- 
  df %>% 
  select(series, character, words) %>% 
  unnest(cols = words) %>% 
  group_by(series, character, words) %>% 
  summarise(n = sum(n)) %>% 
  mutate(total_n = sum(n),
         tf  = n/total_n)

I then compute the number of documents (which I think is the number of characters in my case)…

n_docs <- df %>% distinct(series, character) %>% nrow()

… and the document frequency (how many of all the characters (documents) have each term (word) in their lines)

# Compute DOCUMENT FREQUENCY (how many of the characters (documents) does each term (word) appear in)
doc_freq <-
  term_freq %>% 
  ungroup() %>% 
  distinct(series, character, words) %>% 
  group_by(words) %>% 
  summarise(df = n())

Finally I compute the TF-IDF (using the log of the IDF). I also apply an overall filter which only includes words that appear in less than 50 % of character’s lines. I guess by doing that, I’m saying that if a word is said by a lot (more than 50% of characters) then it is not such a meaningful word in terms of trying to find what the difference between character’s lines is.

# Define some stop words as words that appear in more the x% of the documents
tf_idf <-
  doc_freq %>% 
  mutate(idf = log(n_docs / df)) %>% 
  inner_join(term_freq) %>% 
  mutate(tfidf = tf * idf) %>% 
  # Filter to words that appear in less than X% of the documents (X% of the characters say that word)
  filter(df <= n_docs*0.25)
## Joining, by = "words"

Visualise wordclouds

Finally I make a wrapper for my wordcloud function that filters to the provided series and character and passes the TF-IDF weight to scale the words by


startrek_wc <- function(my_series, my_character, nwords=250, seed=1, p_title = "char", ...){

  wcdf <-
    tf_idf %>% 
    filter(series == my_series, character == my_character) %>% 
    arrange(desc(tfidf)) %>%
    rename(word = words, count = tfidf) %>%
    cloudr::bmwc(data = wcdf,
                 max_size = 350,
                 seed = seed, 
                 subtitle = ifelse(p_title == "char", my_character, my_series), 
                 buffer = 10, 
                 markup = F, 
      theme(plot.subtitle = element_text(colour="blue"))


So lets first take a look at the main TNG characters

tng <-
  tibble(series = "TNG",
         character = c("PICARD","DATA","RIKER","WORF","CRUSHER","TROI"))

map2(tng$series, tng$character, ~startrek_wc(.x, .y)) %>% 
  patchwork::wrap_plots(ncol=1) +
  patchwork::plot_annotation(title = "Star Trek the next generation")
## All words placed (no overlaps)
## All words placed (no overlaps)
## All words placed (no overlaps)
## All words placed (no overlaps)
## All words placed (no overlaps)
## All words placed (no overlaps)

Not a lot that’s too surprising here. It’s apparent that characters say their own name quite a lot, presumably when they use their communicators (maybe I should remove each character’s name from their own word list to stop this happening). Other problems have also become clear, ‘La Forge’ (a surname) has been split into ‘La’ and ‘Forge’, similarly with ‘Number one’ (how PICARD often refers to RIKER) where the word ‘one’ has presumably been removed in the TF-IDF analysis, and there seems to be a spurious ‘2’ in TROI’s wordcloud. Of course, these problems can be fixed fairly easily.

Understandably Picard says ‘Q’ a lot and Data’s words are dominated by ‘sir’. It’s also nice to see that Beverly (CRUSHER) says ‘Jeanluc’ a lot (I remember that she had some kind of relationship with Picard and did often address him in a more informal manner!). Finally, Troi’s wordcloud is dominated by ‘mother’ which makes total sense!


A comparison between the wordclouds of the COMPUTERs

comp <-
  tibble(series = c("TOS", "TNG", "DS9", "VOY"), 
         character = c("COMPUTER"))

map2(comp$series, comp$character, ~startrek_wc(.x, .y, p_title="series")) %>% 
  patchwork::wrap_plots(ncol=1) +
  patchwork::plot_annotation(title = "Star Trek COMPUTERs")
## All words placed (no overlaps)
## All words placed (no overlaps)
## All words placed (no overlaps)
## All words placed (no overlaps)


A comparison between the wordclouds of the CAPTAINs

capt <-
  tibble(series = c("TOS", "TNG", "DS9", "VOY"),
         character = c("KIRK", "PICARD", "SISKO", "JANEWAY"))

map2(capt$series, capt$character, ~startrek_wc(.x, .y)) %>% 
  patchwork::wrap_plots(ncol=1) +
  patchwork::plot_annotation(title = "Star Trek CAPTAINs")
## All words placed (no overlaps)
## All words placed (no overlaps)
## All words placed (no overlaps)
## All words placed (no overlaps)

Some spot characters for interest

And some extra ones for fun (of course, SEVEN)

startrek_wc("VOY", "SEVEN")
## All words placed (no overlaps)

startrek_wc("VOY", "NEELIX")
## All words placed (no overlaps)

Chris Holmes
Senior Data Scientist

PhD physicist making his way in the world of data science!

comments powered by Disqus