Posted on
image: www.npr.org

image: www.npr.org


Introduction


The State of the Union address was originally intended to be an address given by the President to Congress in order to report on the condition of the nation and what the President believes the nation’s priorities should be going forward. The State of the Union has now shifted towards an address to not only Congress but the American people as well. I thought these addresses would be a great way to practice text analytics given the nature of the speech, one would expect to uncover the sentiment of the President’s address and uncover important topics.

I gathered data from 1980 to 2017 from the American Presidency Project, a non-profit and non-partisan of presidential documents on the internet. The average length of each address was, which would be a rather long read if I were to go and read each one, so I hoped to use text analytics tools to uncover the general sentiment of each address.




Import Data


The data were brought into a csv file that you can download here separated by |.

# Path to data
path <- "~/Documents/github/diving4data/data/StateOfTheUnion/"

# read file
union <- read_delim(paste0(path, "StateOfUnion.csv"), delim = "|")

# make President column an ordered factor
union <- union %>% 
  mutate(
    President = factor(
      President, 
      levels = c("Ronald Reagan", "George H. W. Bush",
                 "Bill Clinton", "George W. Bush",
                 "Barack Obama", "Donald J. Trump")
      )
  )
# structure
str(union, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    37 obs. of  4 variables:
##  $ President: Factor w/ 6 levels "Ronald Reagan",..: 6 5 5 5 5 5 5 5 5 4 ...
##  $ Year     : int  2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 ...
##  $ Party    : chr  "Republican" "Democrat" "Democrat" "Democrat" ...
##  $ Address  : chr  "Thank you very much. Mr. Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States,"| __truncated__ "Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: Tonight marks the eighth year that I"| __truncated__ "Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: We are 15 years into this new centur"| __truncated__ "Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: Today in America, a teacher spent ex"| __truncated__ ...



Create Corpus


Now that we have the data in R, we are going to transform the Address into a Corpus object using the tm package. Corpora are collections of documents containing (natural language) text. The VectorSource() function is used to tell the Corpus function that we are using each row of our data frame (address) as individual documents.

# read in Address column as corpus
corpus <- Corpus(VectorSource(union$Address))

Now that we have our corpus object initiated, let’s remove punctuation and stop words (and, the, or, etc.) from the document using the tm_map(), RemovePunctuation(), and removewords.

# remove punctuation
corpus <- corpus %>% 
  tm_map(removePunctuation) %>% 
  tm_map(removeWords, c(stopwords("en")))

The object corpus is made of two parts the Corpus metadata and the Document level metadata. Corpus metadata contains corpus specific metadata in form of tag-value pairs. Document level metadata contains document specific metadata but is stored in the corpus as a data frame.

Now we are going to use DocumentTermMatrix which coerces the corpus to a term-document matrix that gives us the sparsity and frequency of the words.

frequencies <- DocumentTermMatrix(corpus)

Now that we have our Document Term Matrix (DTM), we can remove the sparse terms from the DTM, where the sparse percentage of empty elements are in the top 90%.

sparse <- removeSparseTerms(frequencies, sparse = 0.95)

The size of our DTM is now reduced in size. Now we can reformat it to a tibble which is easier to work with.

union.sparsed <- as_tibble(as.matrix(sparse))

union.sparsed[1:4, 1:6]
## # A tibble: 4 x 6
##   `100` `100th` `116` `2001` `2015` `250`
##   <dbl>   <dbl> <dbl>  <dbl>  <dbl> <dbl>
## 1     1       1     1      1      1     2
## 2     0       0     0      0      1     0
## 3     0       0     0      0      1     0
## 4     1       0     0      0      0     0

Each column refers to a word in our DTM while each row refers to the address, and the numbers in the matrix are the frequency in which they appear in the respective address.



Tidy Data


All our data is there, now we need to reshape it into something that is easier for analysis. We will make use of the tidy data philosophy, which makes it much easier to manipulate and analyze the data. We will make use of the tidyr package.

It appears that there are 6,044 unique words in our DTM, so when we combine we are going to need to gather the columns

union.combined <- union %>% 
  bind_cols(union.sparsed) %>%
  select(-Address) %>% 
  gather(word, count, 4:6010) %>% 
  arrange(desc(Year))

nrow(union.combined)
## [1] 222259

This is the tidy format that we are looking for, but we have words that over around 170,000 words that have count equal to zero. Therefore, we are going to remove those entries.

union.combined <- union.combined %>% 
  filter(count > 0) %>% 
  arrange(desc(Year), desc(count))

nrow(union.combined)
## [1] 47900

As you can see, the number of entries in the union.combined data frame has greatly decreased.



Analyze Text


Average Word Count per Address

# calculate total words for each address
address.length <- union.combined %>% 
  group_by(President, Year) %>% 
  summarise(Total = sum(count)) %>%
  arrange(Year)

# plot the data and mean
address.length %>% 
  ggplot(aes(Year, Total, fill = President)) +
  geom_bar(stat = "identity", colour = "white") +
  geom_hline(yintercept = mean(address.length$Total)) +
  ggtitle("Democratic Presidents Have Longer Speeches on Average") +
  theme_bw()

It appears that Bill Clinton and Barack Obama have consistently longer addresses than the other presidents. It is perhaps interesting to note that they are the only two Democratic Presidents in the data.


Term Frequency Inverse Document Frequency (tf-idf)

In this part of our analysis, we are going to use tf-idf in order to derive the importance of words in comparison to the word’s frequency in other addresses. The more frequent “rare” words that are in a given address have higher tf-idf scores. According to Wikipedia, tf-idf is defined as:

In information retrieval, tf–idf is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval, text mining, and user modeling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes. For instance, 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

One can calculate the tf-idf calculation using the tidytext package. I recommend checking out the free online book that shows you step-by-step how you can do text analysis in R using this package. It was a major resource that I used when performing this analysis.

union.tf_idf <- union.combined %>% 
  bind_tf_idf(word, Year, count)

union.tf_idf[1:4, -3]
## # A tibble: 4 x 7
##         President  Year     word count          tf        idf       tf_idf
##            <fctr> <int>    <chr> <dbl>       <dbl>      <dbl>        <dbl>
## 1 Donald J. Trump  2017     will    56 0.021789883 0.00000000 0.0000000000
## 2 Donald J. Trump  2017 american    33 0.012840467 0.00000000 0.0000000000
## 3 Donald J. Trump  2017  america    27 0.010505837 0.00000000 0.0000000000
## 4 Donald J. Trump  2017  country    21 0.008171206 0.02739897 0.0002238827


Top Words by Presidency

Now that we have the tf-idf calculated for each word, let’s plot the top words for each President. Granted that some Presidents have more speeches than others, we can still have look at what topics were important to each of their Presidency’s.

# create df and make word a factor variable
president.tf_idf <- union.tf_idf %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))

# plot tf-idf for each president
president.tf_idf %>% 
  group_by(President) %>% 
  top_n(9) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = President)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~President, ncol = 2, scales = "free") +
  coord_flip() +
  ggtitle("State of the Union Topics Unique to Presidencies") +
  theme_bw()

Some of the words that appear in the plots seem to be obvious like Sadam Hussein which appears in both Bush Presidencies or the large prevalence of ISIL in Obama’s. This is intuitive because the State of the Union is meant to address the most prevalent problems that the nation faces. However, others were not so easy to understand, notably “100th” from Reagan’s addresses and “Ryan” in President Trump’s. In order to find out the meaning of these words, I had to dig into the text.

After doing this I realized that “100th” referred to the 1988 State of the Union address given by President Ronald Reagan to the 100th United States Congress, and “Ryan” referred to U.S. Navy Special Operator, Senior Chief William “Ryan” Owens who died in battle and whose wife attended the 2017 State of the Union address.

This gives a glimpse into the power of text analytics and how you can explore a huge amount of text with relatively few lines of code in order to get an idea of what the general message is about. At the very least, it gives you hints into what is important and where you should look to learn more.


Negative Sentiment

Lastly, I want to show you how to investigate words that have negative sentiment. This helps to give an idea about how the President describes these problems. We will visualize the output in the form of a word cloud, a visualization method that is becoming increasingly popular in the world of text analytics.

Let’s take a look at two recent addresses that we expect to draw relatively high negative sentiment: first President Bush’s State of the Union following 9/11 and second the 2008 economic crash that happened during Obama’s Presidency.

In order to do this, we need to make use of the wordcloud package and bing sentiment data frame that can be found in the tidytext package. The list of sentiments from bing consists of only two results, negative or positive.

Subset the addresses for year 2002 and 2009.

# filter for year 2002
negative.911 <- union.combined %>% 
  filter(Year == 2002) %>% 
  left_join(get_sentiments("bing")) %>% 
  filter(sentiment == "negative")

# filter for year 2009
negative.crash <- union.combined %>% 
  filter(Year == 2009) %>% 
  left_join(get_sentiments("bing")) %>% 
  filter(sentiment == "negative")

Create the two wordcloud plots.

2002 State of the Union: Post 9/11
# plot 911 wordcloud
plot.911 <- wordcloud(
  negative.911$word, negative.911$count, 
  min.freq = 1, colors = "red", random.order=FALSE
  )

2009 State of the Union: Financial Crisis
# plot crash wordcloud
plot.crash <- wordcloud(
  negative.crash$word, negative.crash$count, 
  min.freq = 1, colors = "red", random.order=FALSE
  )

Can you tell which plot belongs to which address?



Conclusion


I hope that this post gives you a glimpse into the power of text analytics and how you can explore a large amount of text with relatively few lines of code in order to get an idea of what the general message is about. At the very least, it gives you hints into what is important and where you should look to learn more.



Acknowledgements


Thanks to all the hard work put in to make my life easier! Wouldn’t have been possible without the help of:


comments powered by Disqus