Counting Words with R

Published Nov 12, 2017Last updated Nov 18, 2017
Counting Words with R

The Thought

A integral part of text mining is determining the frequency of occurrence in certain documents. I have put together some simple R code to demonstrate how to do this.

The word frequency code shown below allows the user to specify the minimum and maximum frequency of word occurrence and filter stop words before running. The stop words can be turned off if a need exist to examine frequencies of common words. The list of stop words used can be produced with the following code.

tm::stopwords("SMART")

Reading the text document was achieved with the text mining package tm and readr. Counting the words was done using the tau library. The filter function from the library dplyr is used to select the rows of the data frame that correspond to the upper and lower frequencies. A user could implement other selection criteria if needed.

The Code


#######################################################################################################
#
# Description: Determine Word Frequency of a Text File
#
# Location: N/A
#
# Program name: N/a
# 
# Source code: v1.0
#
# Author: Jason Watts
#
# Sys.info: SnowWhite and the 88 Dwarfs
#
# Computational Framework: Microsoft R Open version: >=3.4.2
#
# Web Framework: RStudio - N/A

# Analytics Dashboard Framework: N/A
#
# Plotting and Graphics: Plotly: ggplot2: >=2.2.1
# 
# License: Private with Open Source components. Open Source components require credits with distribution.  
#######################################################################################################

# Load Required Libraries

library(ggplot2)
library(tm)
library(tau)
library(plyr)
library(dplyr)
library(readr)
library(plotly)

# Set Minimum and Maximum Word Frequency
a <- 90
b <- 100

# Remove Stop Words - Yes/No
stop_words<-T

# Downlod and Read Text File
data <- tm::PlainTextDocument(readr::read_lines(file = "/home/jhwatts/Documents/Snips/KJB/kjv.txt", 
    progress = interactive()), heading = "KJB", id = basename(tempfile()), 
    language = "en", description = "Report File")

# Remove Stop Words and Tokenize Text
data <- tau::textcnt(
  
  if(stop_words==T) {tm::removeWords(tm::scan_tokenizer(data), tm::stopwords("SMART"))}
  
  else {
    
    tm::scan_tokenizer(data)
  }
  
, method = "string", n = 1L, lower = 1L)

# Change List to Data Frame
data <- plyr::ldply(data, data.frame) 

# Using dplyr Filter
Results<-dplyr::filter(data, data[,2]>a & data[,2]<b)

colnames(Results)<-c("word", "frequency")

ggplot2::ggplot(Results, aes(x=word, y=frequency, fill=word)) + geom_bar(width = 0.75,  stat = "identity", colour = "black", size = 1) + coord_polar(theta = "x") + xlab("") + ylab("") + ggtitle("Word Frequency") + theme(legend.position = "none") + labs(x = NULL, y = NULL)

plotly::ggplotly(ggplot2::ggplot(Results, aes(x=word, y=frequency, fill=word)) + geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) + 
xlab("") + ylab("") + ggtitle("Word Frequency") + theme(legend.position = "none") + labs(x = NULL, y = NULL) + theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), axis.text.x = element_text(angle = 90)) + theme(panel.background = element_rect(fill = "honeydew1"), plot.backgrond = element_rect(fill = "antiquewhite")))%>% config(displaylogo = F) %>% config(showLink = F)

The Product

I used ggplot2 to generate a radar plot of the word and its occurrence and added a interactive plotly script to allow zooming in on larger data sets. A radar plot seems to be the simplest to visualize without interactivity.

WordFrequency.png

The plot shows all of the words the occur between 90 and 100 times in the entire King James Bible.

Going further, the word frequency code can help to examine patterns of specific authors by how often certain words occur. The document used here in this example is the Bible. I suspect one could separate a document, as in the case of the Bible into chapters and run the frequency of occurrence of words using something like,

if(all(Book_A %in% Book_B)==T) {match<-T} 

this would associate a match with what authors wrote what material in the books.

Discover and read more posts from Jason Watts
get started
Enjoy this post?

Leave a like and comment for Jason

3