× {{alert.msg}} Never ask again
Get notified about new tutorials RECEIVE NEW TUTORIALS

Ruby: Searching and Replacing using Regular Expressions

Christoph Wagner
Apr 17, 2015
<p>The other day, we were tackling an interesting problem in one of my sessions: a student wanted to learn how to output a block of text into his view, but with certain words <em>highlighted</em> via CSS. Here's how we approached this problem.</p> <p>First of all, the text in question was stored in a Rails model (as a <code>text</code> field). It was plain text, containing no HTML tags on it's own. The goal, then, was to output this text in the view, but inserting HTML tags <em>around</em> the words that he wanted to highlight. This way, he could simply edit his stylesheet to apply any formatting he desired (such as font family, style, color, underlining, and so on). We decided on using a <code>&lt;span&gt;</code> tag with a custom class name for this purpose. </p> <p>So, now we have formulated a proper problem setting: we have a well-defined starting point (plain text from a model) and a well-defined outcome (plain text containing HTML tags around certain words). Now, how do we get there?</p> <p>Our first approach was using a word array. Since he was using the <a href="http://rubywordcount.com/">WordsCounted gem</a> to analyze the text for the number of words it contained, we already had an array of all the words availabe, in order of appearance, lowercased, and with punctutation removed. Shouldn't be too hard to add some tags around it, right?</p> <p>Except that's totally not going to get us where we want. </p> <p>The problem with this approach, of course, is that we want to see the <em>original</em> text, including all capitalization and punctionation, in order for it to be human readable. So operating on a word array, as convenient as it may be, is out of the question. What we really need is a <em>string-based</em> algorithm – one that takes the original string as an input, and returns a modified string with all the tags inserted at the right places. And if we could make that a Rails helper, that would be great. </p> <p>So, my suggestion was that we look into <a href="http://ruby-doc.org/core-2.1.1/Regexp.html">Regular Expressions</a>, often referred to as "Regex" or "Regexp" by programmers. Regular expressions are a sort of "mini programming language" that follows certain rules (which leads to Computer Scientists callling them "regular"). You can read more about the theory on the <a href="http://en.wikipedia.org/wiki/Regular_expression">Wikipedia page for Regular Expressions</a>,  for the purposes of this tutorial, we're simply concerned with what they can do for us.</p> <p>A regular expression is basically a fancy way to perform <a href="http://en.wikipedia.org/wiki/String_searching_algorithm"><em>string matching</em></a>, i.e. finding certain substrings inside larger strings. But instead of matching just <em>static</em> strings, like "dog", they allow us to be more flexible – by allowing for wildcards and other constructs. So we can look for "dog", "dogs", "dogfood", and anything else starting with "dog", for instance.</p> <p>Now, regexes have a notoriety for being a little tricky – in my career, I have met many programmers who basically just stayed away from them because they never really understood them. But that's a shame, because they are really powerful. And with a little practice, they become quite tame. Nevertheless, I never expect to get it right the first time, so I always make sure to have an easy way to test them while I'm developing. </p> <p>In this case, we started with building the highlighting part first, using a very simply regex, and refining it later. So, that's the approach we're going to take in this article. So without further ado, let's jump right into the code!</p> <pre><code class="language-ruby">module ArticleHelper def highlight(text) text.gsub(/(test)/i, '&lt;span class="highlight"&gt;\1&lt;/span&gt;').html_safe end end</code></pre> <p>Let's pick that apart for a second: first, we are using the <a href="http://ruby-doc.org/core-2.1.1/String.html#method-i-gsub"><code>String#gsub</code></a> method, which takes a regular expression (or a string) as first argument, and another string as second argument, and replaces every occurence of the first argument with the second in the string that it is invoked on.</p> <p>Now, in our regular expression, we just match the string "test", which we're going to modify later. Notice that we've enclosed that in parentheses, which have a special meaning – they create  what's called a <em><a href="http://ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Capturing">capturing group</a></em>, which means that we can refer to the substrings that it matched. In the second argument, we have a <code>&lt;span&gt;</code> tag, which I assume needs no further explanation. But what's with the <code>\1</code>? Ah. That refers to the capturing group we set up in our regular expression. In this case the first (and only one). The <code>i</code> after the last slash is a <em>modifier</em> or <em>option</em> – it means that the regular expression should ignore capitalization, i.e. it will match "test" whether it's capitalized or not.</p> <p>Finally, since we're inserting HTML tags into the string, we have to mark it as <a href="http://api.rubyonrails.org/classes/String.html#method-i-html_safe"><em>HTML safe</em></a>, so that Rails won't escape the tags before inserting the result. Now, we can use this helper in our view:</p> <pre><code class="language-html">&lt;%= highlight @article.content %&gt;</code></pre> <p>In our CSS, we set the highlight class to show up in red, so we can immediately see all the matches:</p> <pre><code class="language-css">.highlight { color: red; }</code></pre> <p>The result? Any occurence of the string "test" inside the article appears in red, whether it is a single word, or part of another word. So, both "<em>test</em>ing" and "At<em>test</em>" are partly highlighted. Not quite what we want yet. Can we do better? Yes we can. </p> <p>In regular expressions, we can also use what is called <em><a href="http://ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Anchors">anchors</a> – </em>these are special symbols which match special things (which don't even have to be characters). In this case, the anchor we're interestd in is <code>\b</code>, which matches word boundaries – in other words, the beginning or end of a word. So, we simply update our helper, and replace the first argument to <code>gsub</code> with the following regular expression: <code>/\b(test)\b/i</code>. Reload the page, and BOOM! Only occurrences of "test" as a single word are highlighted now.</p> <p>We're getting pretty close now. The only requirement we're still missing is that we wanted to be able to match a <em>list</em> of words, not just a single word. For this tutorial, we'll assume we have that list readily available, stored in a constant called <code>HIGHLIGHT_WORDS</code>, which is an array of strings.</p> <p>In regular expressions, the <em>pipe symbol</em> ('|') means "or". So in order to match two words, for instance, we'd want to write <code>/\b(test|best)\b/i</code>. So, we first need to turn our array of words into a single string, using the pipe symbol as delimiter. Fortunately, that's very easy in Ruby – <code>HIGHLIGHT_WORDS.join("|")</code> does the trick. We're almost there! All we need to do is get this into the regular expression. </p> <p>Lucky for us, Ruby actually supports string interpolation inside of regular expressions! Which means we can use our trusty <code>#{}</code>, just like we can in double-quoted strings. In other words, our final helper looks like this:</p> <pre><code class="language-ruby">module ArticleHelper def highlight(text) text.gsub(/\b(#{HIGHLIGHT_WORDS.join("|")})\b/i, '&lt;span class="highlight"&gt;\1&lt;/span&gt;').html_safe end end</code></pre> <p>And it still (almost) fits on a single line of code!</p> <p>Hope you enjoyed this article! If you want to learn more about regular expressions, Ruby, Rails, or JavaScript/CoffeeScript, don't hesitate to send me a message. I'd love to help you out. </p>
comments powered by Disqus