ElectricType:Tutorials : Image Filtering

The Regular Expression Rundown
by Crispin Roven 21 Aug 1997

Crispin Roven is a Wired Digital engineer. He won the "most geeked-out attire" award at this year's Webbys.

Page 1

Q: What the !^.*$! is a "regular expression"?
- Baldemar
A: Regular expressions are programming constructs that look like #!?#@!!# comic-book expletives and can be wondrously powerful tools.
Regular expressions are used to recognize patterns within textual data. Their use has become so widespread that they appear in configuration files, mail filters, text editors, and any number of programming languages. Any application that acts on text may very well harness their power.
Regular expressions evaluate text data and return an answer of true or false. That is, either the expression correctly describes the data, or it doesn't. What data the expression evaluates and what transpires after a successful match depends entirely on the application. We might substitute new text in the place of the text matched by a regular expression. We might save the matched text in a variable for later use. We might execute a new program when we see a correct match. And so on.
There are several variants, but all regular expressions consist of characters to be matched as well as a series of special characters that can be said to further describe the data. In Unix, the grep utility is a simple starting point for understanding the work of regular expressions. The expression can be a simple string, and the input data can be a named list of files. Let's look at some examples, using grep to get a sense of how regular expressions work.
Let's say we want to find all the <title> tags in a directory of HTML files. The code would look like this:
grep -i '<title>' *.html
grep evaluates whether or not each line in each *.html file matches the description <title>. If the line is a match, then grep's standard behavior is to print out the file name and the matching line.
Pretty soon, we'll want to ask more sophisticated questions of our text data. We may want to add further restrictions and qualifications, or we may want to make our expression more general. In short, we'll need to start using regular expressions' set of descriptive "metacharacters." Let's look at a few cases.
Placeholders and repetition:
Let's say our directory of HTML files has 100 files and 100 <title> tags, and we want to narrow our search a little to see only the titles that make reference to "worms."
grep -i '<title>.*worms'
We've introduced two new metacharacters. The "." means "any character." The "*" means 0 or more instances of the previous character. What we've said here is "match any line that contains a 'begin title' tag followed by any number of characters, as long as the word 'worms' appears before the end of the line." The "." is very important. If we'd said:
grep -i '<title>*worms'
then we'd be looking for lines that looked like this:
<title>>>>>>>>>>>>>>>>>worms.
(The * character would be looking for 0 or more instances of >, which is not very useful.)
Range:
We frequently find that we want to make our expressions much more general. It would be quite inconvenient to enter 10 regular expressions if we're only interested in matching any of the characters from 0 to 9. The range symbol [] allows us to conveniently group characters together. We can also use [\.\*] to match either of those punctuation characters. (NOTE: We put backslashes before dots and stars in order to turn off their behavior as special characters. This is called "escaping" the characters.)
One especially powerful feature of the range function is the ability to negate it. We can match "anything but" the list of characters. In [^1234], the caret inside this range operator means "match anything but the characters 1-4."
Here's a useful example: Find all the hrefs that point to URLs that mistakenly have a space in them. This example uses the enhanced regular expressions of egrep.
egrep -i 'href="[^"]* [^"]*"' *.html
In other words, find the href lines that have a space between the begin quote and end quote. We use the range operator here to signify any character other than a quote.
Position:
There are two main characters that enable us to restrict our match to a location within the string. We can match either the beginning (^) or the end ($) of our input data. This is more useful than it might immediately seem.
For example, let's say we want to find the HTML tags that are not closed before the line break.
egrep '<[^>]*$' *.html
In other words, we're looking for a "less than" followed by a continuous chain of characters other than "greater thans" all the way to the end of the line.
That should suffice as an introduction. The set of special descriptive characters will differ across regular-expression implementations, but if you keep in mind that their uses fall into a few basic categories, you'll have no trouble learning them. Position, range, repetition, and placeholders are the foundations of regular expressions.