2007/12/18

C# Regular Expressions

Introduction


Regular expressions have been used in various programming languages and tools for many years. The .NET Base Class Libraries include a namespace and a set of classes for utilizing the power of regular expressions. They are designed to be compatible with Perl 5 regular expressions whenever possible.
In addition, the regexp classes implement some additional functionality, such as named capture groups, right- to-left pattern matching, and expression compilation.
In this article, I'll provide a quick overview of the classes and methods of the System.Text.RegularExpression assembly, some examples of matching and replacing strings, a more detailed walk-through of a grouping structure, and finally, a set of cookbook expressions for use in your own applications.
Presumed Knowledge Base
Regular expression knowledge seems to be one of those topics that most programmers have learned and forgotten, more than once. For the purposes of this article, I will presume some previous use of regular expressions, and specifically, some experience with their use within Perl 5, as a reference point. The .NET regexp classes are a superset of Perl 5 functionality, so this will serve as a good conceptual starting point.
I'm also presuming a basic knowledge of C# syntax and the .NET Framework environment.
If you are new to regular expressions, I suggest starting with some of the basic Perl 5 introductions. The perl.com site has some great resource materials and introductory tutorials.
The definitive work on regular expressions is Maketing regula expression, by Jeffrey E. F. Friedl. For those who want to get the most out of working with regular expressions, I highly recommend this book.
The RegularExpression Assembly
The regexp classes are contained in the System.Text.RegularExpressions.dll assembly, and you will have to reference the assembly at compile time in order to build your application. For example: csc /r:System.Text.RegularExpressions.dll foo.cs will build the foo.exe assembly, with a reference to the System.Text.RegularExpressions assembly.
A Brief Overview of the Namespace
There are actually only six classes and one delegate definition in the assembly namespace. These are:
Capture: Contains the results of a single match
CaptureCollection: A sequence of Capture's
Group: The result of a single group capture, inherits from Capture
Match: The result of a single expression match, inherits from Group
MatchCollection: A sequence of Match's
MatchEvaluator: A delegate for use during replacement operations
Regex: An instance of a compiled regular expression
The Regex class also contains several static methods:
Escape: Escapes regex metacharacters within a string
IsMatch: Methods return a boolean result if the supplied regular expression matches within the string
Match: Methods return Match instance
Matches: Methods return a list of Match as a collection
Replace: Methods that replace the matched regular expressions with replacement strings
Split: Methods return an array of strings determined by the expression
Unescape: Unescapes any escaped characters within a string
Simple Matches
Let's start with simple expressions using the Regex and the Match class.
Match m = Regex.Match("abracadabra", "(abr)+");
You now have an instance of a Match that can be tested for success, as in:
if (m.Success)
...
without even looking at the contents of the matched string.
If you wanted to use the matched string, you can simply convert it to a string:
Console.WriteLine("Match="+m.ToString());
This example gives us the output:
Match=abra
which is the amount of the string that has been successfully matched.
Replacing Strings
Simple string replacements are very straightforward. For example, the statement:
string s = Regex.Replace("abracadabra", "abra", "zzzz");
returns the string zzzzcadzzzz, in which all occurrences of the matching pattern are replaced by the replacement string zzzzz.
Now let's look at a more complex expression:
string s = Regex.Replace(" abra ", @"^\s*(.*?)\s*$", "$1");
This returns the string abra, with preceeding and trailing spaces removed.
The above pattern is actually generally useful for removing leading and trailing spaces from any string. We also have used the literal string quote construct in C#. Within a literal string, the compiler does not process the \ as an escape character. Consequently, the @"..." is very useful when working with regular expressions, when you are specifying escaped metacharacters with a \. Also of note is the use of $1 as the replacement string. The replacement string can only contain substitutions, which are references to Capture Group in the regular expression.
Engine Details
Now let's try to understand a slightly more complex sample by doing a walk-through of a grouping structure. Given the following sample:
string text = "abracadabra1abracadabra2abracadabra3";
string pat = @"
( # start the first group
abra # match the literal 'abra'
( # start the second (inner) group
cad # match the literal 'cad'
)? # end the second (optional) group
) # end the first group
+ # match one or more occurences
";
// use 'x' modifier to ignore comments
Regex r = new Regex(pat, "x");
// get the list of group numbers
int[] gnums = r.GetGroupNumbers();
// get first match
Match m = r.Match(text);
while (m.Success)
{
// start at group 1
for (int i = 1; i < g =" m.Group(gnums[i]);" cc =" g.Captures;" j =" 0;" c =" cc[j];" index=" + c.Index + " length=" + c.Length); } } // get next match m = m.NextMatch(); } the output of this sample would be: Group1=[abra] Capture0=[abracad] Index=0 Length=7 Capture1=[abra] Index=7 Length=4 Group2=[cad] Capture0=[cad] Index=4 Length=3 Group1=[abra] Capture0=[abracad] Index=12 Length=7 Capture1=[abra] Index=19 Length=4 Group2=[cad] Capture0=[cad] Index=16 Length=3 Group1=[abra] Capture0=[abracad] Index=24 Length=7 Capture1=[abra] Index=31 Length=4 Group2=[cad] Capture0=[cad] Index=28 Length=3 Let's start by examining the string pat, which contains the regular expression. The first capture group is marked by the first parenthesis, and then the expression will match an abra, if the regex engine matches the expression to that which is found in the text. Then the second capture group, marked by the second parenthesis, begins, but the definition of the first capture group is still ongoing. What this tells us is that the first group must match abracad and the second group would just match the cad. So, if you decide to make the cad match an optional occurrence with the ? metacharacter, then abra or abracad will be matched. Next, you end the first group, and ask the expression to match 1 or more occurrences by specifying the + metacharacter. Now let's examine what happens during the matching process. First, create an instance of the expression by calling the Regex constructor, which is also where you specify your options. In this case, I'm using the x option, as I have included comments in the regular expression itself, and some whitespace for formatting purposes. By turning on the x option, the expression will ignore the comments, and all whitespace that I have not explicitly escaped. Next, get the list of group numbers (gnums) defined in this regular expression. You could also have used these numbers explicitly, but this provides you with a programmatic method. This method is also useful if you have specified named groups, as a way of quickly indexing through the set of groups. Next, perform the first match. Then enter a loop testing for success of the current match. The next step is to iterate through the list of groups starting at group 1. The reason you do not use group 0 in this sample is that group 0 is the fully captured match string, and what you usually (but not always) want to pick out of a string is a subgroup. You might use group 0 if you wanted to collect the fully matched string as a single string. Within each group, iterate through the CaptureCollection. There is usually only one capture per match, per group, but in this case, for Group1, two captures show: Capture0 and Capture1. And if you had asked for only the ToString of Group1, you would have received abra, although it also did match the abracad. The group ToString value will be the value of the last Capture in its CaptureCollection. This is the expected behavior, and if you want the match to stop after just the abra, you would remove the + from the expression, telling the regex engine to match on just the expression. Procedural-Based vs. Expression-Based Generally, the users of regular expressions will tend to fall into one of two groups. The first group tends to use minimal regular expressions that provide matching or grouping behaviors, and then write procedural code to perform some iterative behavior. The second group tries to utilize the maximum power and functionality of the expression-processing engine itself, with as little procedural logic as possible. For most of us, the best answer is somewhere in between, and I hope this article outlines both the capabilities of the .NET regexp classes, as well as the trade-offs in complexity and performance of the solution. Procedural-Based Patterns A common processing need is to match certain parts of a string and perform some processing. So, here's an example that matches words within a string and capitalizes them: string text = " text="[" result = "" pattern =" @" x =" m.ToString();" x =" char.ToUpper(x[0])" result="[" text="[the" result="[The" x =" m.ToString();" text = "the quick red fox jumped over the lazy brown dog." text="[" pattern =" @" result =" Regex.Replace(text," result="[" name="end">Cookbook Expressions
To wrap up this overview of how regular expressions are used in the C# environment, I'll leave you with a set of useful expressions that have been used in other environments. I got them from a great book, the Perl Cookbook, by Tom Christiansen and Nathan Torkington, and updated them for C# programmers. I hope you find them useful.

Roman Numbers
string p1 = "^m*(d?c{0,3}c[dm])"
+ "(l?x{0,3}x[lc])(v?i{0,3}i[vx])$";
string t1 = "vii";
Match m1 = Regex.Match(t1, p1);

Swapping First Two Words
string t2 = "the quick brown fox";
string p2 = @"(\S+)(\s+)(\S+)";
Regex x2 = new Regex(p2);
string r2 = x2.Replace(t2, "$3$2$1", 1);
Keyword = Value
string t3 = "myval = 3";
string p3 = @"(\w+)\s*=\s*(.*)\s*$";
Match m3 = Regex.Match(t3, p3);

Line of at Least 80 Characters
string t4 = "********************"
+ "******************************"
+ "******************************";
string p4 = ".{80,}";
Match m4 = Regex.Match(t4, p4);
MM/DD/YY HH:MM:SS
string t5 = "01/01/01 16:10:01";
string p5 =
@"(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)";
Match m5 = Regex.Match(t5, p5);

Changing Directories (for Windows)
string t6 =
@"C:\Documents and Settings\user1\Desktop\";
string r6 = Regex.Replace(t6,
@"\\user1\\",
@\\user2\\);

Expanding (%nn) Hex Escapes
string t7 = "%41"; // capital A
string p7 = "%([0-9A-Fa-f][0-9A-Fa-f])";
// uses a MatchEvaluator delegate
string r7 = Regex.Replace(t7, p7,
HexConvert);

Deleting C Comments (Imperfectly)
string t8 = @"
/*
* this is an old cstyle comment block
*/
";
string p8 = @"
/\* # match the opening delimiter
.*? # match a minimal numer of chracters
\*/ # match the closing delimiter
";
string r8 = Regex.Replace(t8, p8, "", "xs");

Removing Leading and Trailing Whitespace
string t9a = " leading";
string p9a = @"^\s+";
string r9a = Regex.Replace(t9a, p9a, "");

string t9b = "trailing ";
string p9b = @"\s+$";
string r9b = Regex.Replace(t9b, p9b, "");

Turning '\' Followed by 'n' Into a Real Newline
string t10 = @"\ntest\n";
string r10 = Regex.Replace(t10, @"\\n", "\n");

IP Address
string t11 = "55.54.53.52";
string p11 = "^" +
@"([01]?\d\d2[0-4]\d25[0-5])\." +
@"([01]?\d\d2[0-4]\d25[0-5])\." +
@"([01]?\d\d2[0-4]\d25[0-5])\." +
@"([01]?\d\d2[0-4]\d25[0-5])" +
"$";
Match m11 = Regex.Match(t11, p11);

Removing Leading Path from Filename
string t12 = @"c:\file.txt";
string p12 = @"^.*\\";
string r12 = Regex.Replace(t12, p12, "");
Joining Lines in Multiline Strings
string t13 = @"this is
a split line";
string p13 = @"\s*\r?\n\s*";
string r13 = Regex.Replace(t13, p13, " ");

Extracting All Numbers from a String
string t14 = @"
test 1
test 2.3
test 47
";
string p14 = @"(\d+\.?\d*\.\d+)";
MatchCollection mc14 = Regex.Matches(t14, p14);

Finding All Caps Words
string t15 = "This IS a Test OF ALL Caps";
string p15 = @"(\b[^\Wa-z0-9_]+\b)";
MatchCollection mc15 = Regex.Matches(t15, p15);

Finding All Lowercase Words
string t16 = "This is A Test of lowercase";
string p16 = @"(\b[^\WA-Z0-9_]+\b)";
MatchCollection mc16 = Regex.Matches(t16, p16);

Finding All Initial Caps
string t17 = "This is A Test of Initial Caps";
string p17 = @"(\b[^\Wa-z0-9_][^\WA-Z0-9_]*\b)";
MatchCollection mc17 = Regex.Matches(t17, p17);

Finding Links in Simple HTML
string t18 = @"


";
string p18 = @"]*?HREF\s*=\s*[""']?"
+ @"([^'"" >]+?)[ '""]?>";
MatchCollection mc18 = Regex.Matches(t18, p18, "si");

Finding Middle Initial
string t19 = "Hanley A. Strappman";
string p19 = @"^\S+\s+(\S)\S*\s+\S";
Match m19 = Regex.Match(t19, p19);

Changing Inch Marks to Quotes
string t20 = @"2' 2"" ";
string p20 = "\"([^\"]*)";
string r20 = Regex.Replace(t20, p20, "``$1''");

1 comment:

Chrisranjana.com software said...

Regex RgxUrl = new Regex(("^(www\\.)?.+\\.(com|net|org|in)$"));
bool res = RgxUrl.IsMatch(url);
if(res == true)
{
Response.Write("URL is valid.");
}
else
{
Response.Write("URL is invalid!");
}

The above validates a URL in thw format of www.xxxx.com,net,org

Chrisranjana.com,
C# Programmers