Jason Watts

Counting Words with R

Published Nov 12, 2017Last updated May 10, 2018

The Thought

A integral part of text mining is determining the frequency of occurrence in certain documents. I have put together some simple R code to demonstrate how to do this.

The word frequency code shown below allows the user to specify the minimum and maximum frequency of word occurrence and filter stop words before running. The stop words can be turned off if a need exist to examine frequencies of common words. The list of stop words used can be produced with the following code.

tm::stopwords("SMART")

Reading the text document was achieved with the text mining package tm and readr. Counting the words was done using the tau library. The filter function from the library dplyr is used to select the rows of the data frame that correspond to the upper and lower frequencies. A user could implement other selection criteria if needed.

The Code


#######################################################################################################
#
# Description: Determine Word Frequency of a Text File
#
# Location: N/A
#
# Program name: N/a
# 
# Source code: v1.0
#
# Author: Jason Watts
#
# Sys.info: SnowWhite and the 88 Dwarfs
#
# Computational Framework: Microsoft R Open version: >=3.4.2
#
# Web Framework: RStudio - N/A

# Analytics Dashboard Framework: N/A
#
# Plotting and Graphics: Plotly: ggplot2: >=2.2.1
# 
# License: Private with Open Source components. Open Source components require credits with distribution.  
#######################################################################################################

# Load Required Libraries

library(ggplot2)
library(tm)
library(tau)
library(plyr)
library(dplyr)
library(readr)
library(plotly)

# Set Minimum and Maximum Word Frequency
a <- 90
b <- 100

# Remove Stop Words - Yes/No
stop_words<-T

# Downlod and Read Text File
data <- tm::PlainTextDocument(readr::read_lines(file = "/home/jhwatts/Documents/Snips/KJB/kjv.txt", 
    progress = interactive()), heading = "KJB", id = basename(tempfile()), 
    language = "en", description = "Report File")

# Remove Stop Words and Tokenize Text
data <- tau::textcnt(
  
  if(stop_words==T) {tm::removeWords(tm::scan_tokenizer(data), tm::stopwords("SMART"))}
  
  else {
    
    tm::scan_tokenizer(data)
  }
  
, method = "string", n = 1L, lower = 1L)

# Change List to Data Frame
data <- plyr::ldply(data, data.frame) 

# Using dplyr Filter
Results<-dplyr::filter(data, data[,2]>a & data[,2]<b)

colnames(Results)<-c("word", "frequency")

ggplot2::ggplot(Results, aes(x=word, y=frequency, fill=word)) + geom_bar(width = 0.75,  stat = "identity", colour = "black", size = 1) + coord_polar(theta = "x") + xlab("") + ylab("") + ggtitle("Word Frequency") + theme(legend.position = "none") + labs(x = NULL, y = NULL)

plotly::ggplotly(ggplot2::ggplot(Results, aes(x=word, y=frequency, fill=word)) + geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) + 
xlab("") + ylab("") + ggtitle("Word Frequency") + theme(legend.position = "none") + labs(x = NULL, y = NULL) + theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), axis.text.x = element_text(angle = 90)) + theme(panel.background = element_rect(fill = "honeydew1"), plot.backgrond = element_rect(fill = "antiquewhite")))%>% config(displaylogo = F) %>% config(showLink = F)

The Product

I used ggplot2 to generate a radar plot of the word and its occurrence and added a interactive plotly script to allow zooming in on larger data sets. A radar plot seems to be the simplest to visualize without interactivity.

The plot shows all of the words the occur between 90 and 100 times in the entire King James Bible.

Going further, the word frequency code can help to examine patterns of specific authors by how often certain words occur. The document used here in this example is the Bible. I suspect one could separate a document, as in the case of the Bible into chapters and run the frequency of occurrence of words using something like,

if(all(Book_A %in% Book_B)==T) {match<-T}

this would associate a match with what authors wrote what material in the books.

Those who like to test the code, the text version of the King James Bible is available on my server for download.

KJB text file

ggplot R Word Frequency Plotly NLP

Report

Enjoy this post? Give Jason Watts a like if it's helpful.

Jason Watts

A total of nineteen years in the nuclear industry plus six years in manufacturing. A summary of a few of my highlights are below. Serve as technical and engineering oversight of projects valued up to 40 million. Manage data in...

Discover and read more posts from Jason Watts

get started