Matchmaking with
regular expressions - JavaWorld July
2001
Tutorial Details:
Matchmaking with regular expressions
Matchmaking with regular expressions
By: By Benedict Chng
Use the power of regular expressions to ease text parsing and processing
f you've programmed in Perl or any other language with built-in regular-expression capabilities, then you probably know how much easier regular expressions make text processing and pattern matching. If you're unfamiliar with the term, a regular expression is simply a string of characters that defines a pattern used to search for a matching string.
Many languages, including Perl, PHP, Python, JavaScript, and JScript, now support regular expressions for text processing, and some text editors use regular expressions for powerful search-and-replace functionality. What about Java? At the time of this writing, a Java Specification Request that includes a regular expression library for text processing has been approved; you can expect to see it in a future version of the JDK.
But what if you need a regular expression library now? Luckily, you can download the open source Jakarta ORO library from Apache.org . In this article, I'll first give you a short primer on regular expressions, and then I'll show you how to use regular expressions with the open source Jakarta-ORO API.
Regular expressions 101
Let's start simple. Suppose you want to search for a string with the word "cat" in it; your regular expression would simply be "cat". If your search is case-insensitive, the words "catalog", "Catherine", or "sophisticated" would also match:
Regular expression: cat
Matches: cat, catalog, Catherine, sophisticated
The period notation
Imagine you are playing Scrabble and need a three-letter word starting with the letter "t" and ending with the letter "n". Imagine also that you have an English dictionary and will search through its entire contents for a match using a regular expression. To form such a regular expression, you would use a wildcard notation -- the period (.) character. The regular expression would then be "t.n" and would match "tan", "Ten", "tin", and "ton"; it would also match "t#n", "tpn", and even "t n", as well as many other nonsensical words. This is because the period character matches everything, including the space, the tab character, and even line breaks:
Regular expression: t.n
Matches: tan, Ten, tin, ton, t n, t#n, tpn, etc.
The bracket notation
To solve the problem of the period's indiscriminate matches, you can specify characters you consider meaningful with the bracket ("[]") expression, so that only those characters would match the regular expression. Thus, "t[aeio]n" would just match "tan", "Ten", "tin", and "ton". "Toon" would not match because you can only match a single character within the bracket notation:
Regular expression: t[aeio]n
Matches: tan, Ten, tin, ton
The OR operator
If you want to match "toon" in addition to all the words matched in the previous section, you can use the "|" notation, which is basically an OR operator. To match "toon", use the regular expression "t(a|e|i|o|oo)n". You cannot use the bracket notation here because it will only match a single character. Instead, use parentheses -- "()". You can also use parentheses for groupings (more on that later):
Regular expression: t(a|e|i|o|oo)n
Matches: tan, Ten, tin, ton, toon
The quantifier notations
Table 1 shows the quantifier notations used to determine how many times a given notation to the immediate left of the quantifier notation should repeat itself:
Table 1. Quantifier notations
Notation
Number of Times
*
0 or more times
+
1 or more times
?
0 or 1 time
{n}
Exactly n number of times
{n,m}
n to m number of times
Let's say you want to search for a social security number in a text file. The format for US social security numbers is 999-99-9999. The regular expression you would use to match this is shown in Figure 1. In regular expressions, the hyphen ("-") notation has special meaning; it indicates a range that would match any number from 0 to 9. As a result, you must escape the "-" character with a forward slash ("\") when matching the literal hyphens in a social security number.
Figure 1. Matches: All social security numbers of the form 123-12-1234
If, in your search, you wish to make the hyphen optional -- if, say, you consider both 999-99-9999 and 999999999 acceptable formats -- you can use the "?" quantifier notation. Figure 2 shows that regular expression:
Figure 2. Matches: All social security numbers of the forms 123-12-1234 and 123121234
Let's take a look at another example. One format for US car plate numbers consists of four numeric characters followed by two letters. The regular expression first comprises the numeric part, "[0-9]{4}", followed by the textual part, "[A-Z]{2}". Figure 3 shows the complete regular expression:
Figure 3. Matches: Typical US car plate numbers, such as 8836KV
The NOT notation
The "^" notation is also called the NOT notation. If used in brackets, "^" indicates the character you don't want to match. For example, the expression in Figure 4 matches all words except those starting with the letter X.
Figure 4. Matches: All words except those that start with the letter X
The parentheses and space notations
Say you're trying to extract the birth month from a person's birthdate. The typical birthdate is in the following format: June 26, 1951. The regular expression to match the string would be like the one in Figure 5:
Figure 5. Matches: All dates with the format of Month DD, YYYY
The new "\s" notation is the space notation and matches all blank spaces, including tabs. If the string matches perfectly, how do you extract the month field? You simply put parentheses around the month field, creating a group, and later retrieve the value using the ORO API (discussed in a following section). The appropriate regular expression is in Figure 6:
Figure 6. Matches: All dates with the format Month DD, YYYY, and extracts Month field as Group 1
Other miscellaneous notations
To make life easier, some shorthand notations for commonly used regular expressions have been created, as shown in Table 2:
Table 2. Commonly used notations
Notation
Equivalent Notation
\d
[0-9]
\D
[^0-9]
\w
[A-Z0-9]
\W
[^A-Z0-9]
\s
[ \t\n\r\f]
\S
[^ \t\n\r\f]
To illustrate, we can use "\d" for all instances of "[0-9]" we used before, as was the case with our social security number expressions. The revised regular expression is in Figure 7:
Figure 7. Matches: All social security numbers of the form 123-12-1234
Jakarta-ORO library
Many open source regular expression libraries are available for Java programmers, and many support the Perl 5-compatible regular expression syntax. I use the Jakarta-ORO regular expression library because it is one of the most comprehensive APIs available and is fully compatible with Perl 5 regular expressions. It is also one of the most optimized APIs around.
The Jakarta-ORO library was formerly known as OROMatcher and has been kindly donated to the Jakarta Project by Daniel Savarese. You can download the package from a link in the Resources section below.
The Jakarta-ORO objects
I'll start by briefly describing the objects you need to create and access in order to use this library, and then I will show how you use the Jakarta-ORO API.
The PatternCompiler object
First, create an instance of the Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface and lets you compile a regular expression string into a Pattern object used for matching:
PatternCompiler compiler=new Perl5Compiler();
The Pattern object
To compile a regular expression into a Pattern object, call the compile() method of the compiler object, passing in the regular expression. For example, you can compile the regular expression "t[aeio]n" like so:
Pattern pattern=null;
try {
pattern=compiler.compile("t[aeio]n");
} catch (MalformedPatternException e) {
e.printStackTrace();
}
By default, the compiler creates a case-sensitive pattern, so that the above setup only matches "tin", "tan", "ten", and "ton", but not "Tin" or "taN". To create a case-insensitive pattern, you would call a compiler with an additional mask:
pattern=compiler.compile("t[aeio]n",Perl5Compiler.CASE_INSENSITIVE_MASK);
Once you've created the Pattern object, you can use it for pattern matching with the PatternMatcher class.
The PatternMatcher object
The PatternMatcher object tests for a match based on the Pattern object and a string. You instantiate a Perl5Matcher class and assign it to the PatternMatcher interface. The Perl5Matcher class is an implementation of the PatternMatcher interface and matches patterns based on the Perl 5 regular expression syntax:
PatternMatcher matcher=new Perl5Matcher();
You can obtain a match using the PatternMatcher object in one of several ways, with the string to be matched against the regular expression passed in as the first parameter:
boolean matches(String input, Pattern pattern) : Used if the input string and the regular expression should match exactly; in other words, the regular expression should totally describe the string input
boolean matchesPrefix(String input, Pattern pattern) : Used if the regular expression should match the beginning of the input string
boolean contains(String input, Pattern pattern) : Used if the regular expression should match part of the input string (i.e., should be a substring)
You could also pass in a PatternMatcherInput object instead of a String object to the above three method calls; if you did so, you could continue matching from the point at which the last match was found in the string. This is useful when you have many substrings that are likely to be matched by a given regular expression. The method signatures with the PatternMatcherInput object instead of String are as follows:
boolean matches(PatternMatcherInput input, Pattern pattern)
boolean matchesPrefix(Patter
Read
Tutorial at: Click here to view the tutorial
Rate Tutorial: Matchmaking with
regular expressions - JavaWorld July
2001
View Tutorial: Matchmaking with
regular expressions - JavaWorld July
2001
Related
Tutorials:
Connect the
enterprise with the JCA, Part 1
Connect the
enterprise with the JCA, Part 1 |
Check out three
collections libraries
Check out three
collections libraries |
JSP Standard Tag Library eases Webpage
development
JSP Standard Tag Library eases Webpage
development |
Isolate server includes' runtime context
Isolate server includes' runtime context |
good
design pattern
good
design pattern |
Joott Quick Start Guide
Joott Quick Start Guide
JooTemplates is a templating system to generate business documents, such as forms, mailings and reports. It is being developed with the following aims |
JSP 2.0: The New Deal, Part 3
JSP 2.0: The New Deal, Part 3
More Flexible JSP Document Format Rules
The JSP specification supports two types of JSP pages: regular JSP pages containing any type of text or markup, and JSP Documents, which are well-formed XML documents; i.e., docum |
Creating EL-Aware Taglibs Using XDoclet
Creating EL-Aware Taglibs Using XDoclet
When the JSP Tag Extensions (also known as taglibs) first came out, the only option to pass dynamic values as tag attributes was using Request Time (RT) expressions. With the advent of JSTL 1.0, another option ha |
JEP - Java Mathematical Expression Parser
JEP - Java Mathematical Expression Parser
JEP is a Java API for parsing and evaluating mathematical expressions. With this library you can allow your users to enter an arbitrary formula as a string, and instantly evaluate it. JEP supports user defined |
alt.lang.jre: Take a shine to JRuby
JRuby combines the object-oriented strength of Smalltalk, the expressiveness of Perl, and the flexibility of the Java class libraries into a single, efficient rapid development framework for the Java platform. In this third installment in the alt.lang.jre |
JFindReplace
JFindReplace is a "find and replace" swing component working with various options (regulation expression, incremental mode...) and standard text components like JTextArea, JTextPane, JEditorPane.... |
JFormula 2.9 - Math expression API
JFormula 2.9 - Math expression API
JFormula is a Java library for evaluating various expressions (boolean, math, if/then/else...).
A lot of companies chose JFormula like EADS Space Transportation. |
JSP Tutorial
Adding dynamic content via expressionsAs we saw in the previous section, any HTML file can be turned into a JSP file by changing its extension to .jsp. Of course, what makes JSP useful is the ability to embed Java. Put the following text in a file wit |
Parsing an XML Document with XPath
The getter methods in the org.w3c.dom package API are commonly used to parse an XML document. But J2SE 5.0 also provides the javax.xml.xpath package to parse an XML document with the XML Path Language (XPath) .
|
Interoperability with Patterns and Strategies for Document-Based Web Services
In Part 2 of this article, we demonstrate interoperability for document-driven web services with Microsoft .NET (C#) using strategies discussed in Part 1. |
Biological Databases Links
Biological Databases Links
Biological Databases
Biological Databases are like any other databases. Biological Database contains the sequence data of DNA, RNA etc.. These database are organized for optimal retrieval and analysis.
Here are the |
Building Web Application With Ant and Deploying on Jboss 3.0
Building Web Application With Ant and Deploying on Jboss 3.0
Building Web Application With Ant and Deploying on Jboss 3.0
Previous Tutorial Index Next
In this lesson I will show you how to build you web application and install on the Jboss 3.0 |
Introduction to JSP Scriptlets
Introduction to JSP Scriptlets
INTRODUCTION TO JSP SCRIPTLETS
Syntax of JSP Scriptles are:
<%
//java codes
%>
JSP Scriptlets begins with <% and ends %> .We can embed any amount of java code in the JSP Scriptlets. JSP Engine places these code |
New Technical Articles: 64-bit Programming on Solaris 10 OS for x86 Platforms
Four technical articles describe the new Sun Studio 10 software's 64-bit programming features on the Solaris 10 OS for x86 and AMD64 platforms. Important issues regarding the AMD64 ABI (Application Binary Interface), debugging, migration to 64-bits, and p |
Urchin RSS Aggregator
Urchin is a Web based, customisable, RSS aggregator and filter. It\'s primary purpose is to allow the generation of new RSS feeds by running queries against the collection of items in the Urchin database. |
|
|
|