Readable Regular Expressions
My main point of focus at work lately has been promoting maintainable code. One of the key tenets is readable code. The single responsibility principle and a low cyclomatic complexity are important, but if you are still using cryptic, prefixed, acronymed, and highly abbreviated identifiers, it is still going to be a chore for the reader to decipher. My slogan: "let's take the code out of source code".
I was just listening to Roy Osherove talk about regular expressions on .NET Rocks. A recurring theme brought up was how hard regular expressions are to deal with. Not necessarily creating them - you can do a lot by just knowing the basics - but dealing with them after they've been written. As they mentioned on the show, your source code ends up looking like a cartoon character swearing, which is the likely response you'll get from the poor maintenance developer that has to deal with it. Regular expressions are often referred to as a "write-only" language.
It got me thinking that this was a problem worth solving. Regular expressions are too powerful to ignore. For a certain set of problems, a regular expression can eliminate a LOT of potentially error-prone code. I cannot justify advocating avoiding regular expressions, no matter how much I value source readability. So what if we could make regular expressions readable?
Inspired by the Ayende's Rhino.Mocks syntax, I created a library that provides a better way to define regular expressions in your source code. The easiest way to describe it is to show it in action. Suppose we want to check for social security numbers. You might write code like this:
Regex socialSecurityNumberCheck = new Regex(@"^\d{3}-?\d{2}-?\d{4}$");
Using ReadableRex (not settled on the name yet...), it would look like:
Regex socialSecurityNumberCheck = new Regex(Pattern.With.AtBeginning
.Digit.Repeat.Exactly(3)
.Literal("-").Repeat.Optional
.Digit.Repeat.Exactly(2)
.Literal("-").Repeat.Optional
.Digit.Repeat.Exactly(4)
.AtEnd);
You could argue that the second example is actually harder to read, because the reader is bogged down with the details of how a social security number check is performed. It may be a bad example, because the algorithm for detecting a SSN is both well-known (in the US, at least) and unlikely to change. Consider a situation where the expected match is not well-known, and very likely to change: screen scraping HTML. In that case, being able to read through the algorithm, and easily identify which parts need to change becomes very important. To illustrate, I dug up some old code that was used to scrape basketball scores from espn.com. It's a good example of an ugly pattern that had to be maintainable, since the HTML layout could change at any time.
const string findGamesPattern = @"<div\s*class=""game""\s*id=""(?<gameID>\d+)-game""(?<content>.*?)<!--gameStatus\s*=\s*(?<gameState>\d+)-->";
Using ReadableRex:
Pattern findGamesPattern = Pattern.With.Literal(@"<div")
.WhiteSpace.Repeat.ZeroOrMore
.Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")
.NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
.Literal(@"-game""")
.NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
.Literal(@"<!--gameStatus")
.WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
.NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
.Literal("-->");
I think this would be much easier to maintain. Note that this library doesn't actually perform an regular expression operations - it simply provides another way to define regular expression patterns. You still need to use the System.Text.RegularExpression.Regex object with the pattern you create. Since the Pattern type has an implicit conversion to System.String, so you can easily pass it to the the methods/constructors on Regex.
What do you think? Download the code or just the assembly DLL, give it a try, and tell me what you think. None of the method/property names are set in stone, so the syntax may change, but the approach will remain the same.
Comments
I think you're on to something!
Someone needs to hire you away from Hell.com and put the rest of your creativity to effective work on a day-to-day basis.
One suggestion:
Instead of
{Digit|WhiteSpace}.Repeat.ZeroOrMore ==> {Digit|Whitespace}.Optional
{Digit|WhiteSpace}.Repeat.OneOrMore ==> {Digit|WhiteSpace}.Required
Or at least, instead of ZeroOrMore or OneOrMore use Optional or Required respectively.
But how about those really hard to craft/understand expressions?
For example, in my project we use the following expressions:
^((?!my string).)*$
\A((?!my string).)*$\Z
Any idea what they do?
*SPOILER*
They're actually a NOT operators, matching text that does not contain the "my string" phrase. The first one is for single line search, and the second is for multiline.
*END SPOILER*
I'd sure love to see a fluent expression that describes those expressions.
I think people just have to find a nice guide to learning regular expressions. They're very powerful and useful and if you put the effort in, it pays off. You *could* learn to do it this way, but I think in the end it would be just as hard trying to remember what does what as it is to remember what a character does.
Nice idea though!
'Literal("-").Repeat.Optional' does not automatically imply the same in my mind as '-?', but instead something more like '-*' or '-+'.
In my opinion a better system would be a regex compiler whereby you can communicate what you want in a far more English way, something along the lines of <a href="http://blogs.msdn.com/ericgu/archive/2003/07/07/52362.aspx">Regular Expression Workbench</a> but perhaps even simpler.
Omer: Yes, there are some edge cases that my proof-of-concept code does not cover. That doesn't mean they can't be done, it just means I didn't want to spend the time to cover every case before asking for feedback. Thanks for providing a real-world case that I can use for testing when I get to that. Any suggestions on how you think it should look?
John & Chris: I know there are other ways that would help make "creating" regular expressions easier (learning the syntax, or using a workbench tool). My attempt was to make "reading" regular expressions easier. Of course I understand that is completely subjective. If you think the one character symbols and punctuation are easier to understand, who am I to disagree?
I completely expected disagreements about the method names that I chose - that's why I said none of the names are set in stone and was asking for feedback. It sounds like the choice of "Repeat.Optional" for '0 or 1' was not a good choice, as a couple people have mentioned it. Any alternative suggestions? Would you prefer a "wrapping" appearance for optional, like this:
Pattern.With.Digit.Optional(Pattern.With.Literal("-")); // renders as: \d-?
I originally did all the repetition using that wrapping syntax, but felt it didn't read as smoothly.
Thanks for the reminder Chris - I need to fix DasBlog so it doesn't show "Some html is allowed" when I have all HTML disabled. It normally lists the symbols you allow, but since I don't allow anything, it doesn't list anything...
Very cool! You might also be interested in this approach, which is a way to get a similar result with a more concise syntax: http://dotnet.agilekiwi.com/blog/2006/10/shorthand-interfaces.html
I wrote about the need for a better RegEx syntax last year, but didn't have any ideas on how to implement it. This is a really cool solution to the "write only" RegEx syntax.
http://weblogs.asp.net/jgalloway/archive/2005/11/02/429218.aspx
Maybe we could put a collection of common expression together using the lib?
Schneider
The only way I could help this was to create a debug visualizer, you might look into that. You could hover over the variable, and the visualizer could show you the resultant regex. Just a thought...
I think there is a misconception of readability here: The problem is not that you need to know what the syntactical elements of the regexp mean. Rather you need to understand what the whole expression actually does, and the verbose form doesn't help here at all.
Also I thing the fluency suffers from being forced into the chained methods. There is no syntactical indication that 'Repeats' modifies the meaning of the previous entity while 'Digits' doesn't.
Jimmy: There is nothing to debug here anyway: The object incantation just produces a regular regexp.
And I completely agree about the quantifiers and grouping being ambiguous in this version of the API. It was something I wanted to improve (and you can see in the comments above that I was playing with different ideas), but never followed up on.
p.s. thanks for this library, I've used it as a tool many times to generate regular expressions, especially in helping out in the CodeProject forums.
Perhaps there should be more focus on building sub-expressions in the Regex (as with regular Expression<Func<T>>), rather than continuous chains which are little more than longhand for the existing syntax? Perhaps also more extensions which encapsulate common patterns, eg .Digit(3); .Digit(3.OrMore). Something highly composable.
Dunno.. difficult one to solve.
I'm imagining something like a List<Expression>, where Expression can be Digit, Letter, Group etc and each can have child expressions and a Multiplicity - which would be a class with properties like Min, Max, Greedy, Lazy etc... Then all you need is a static method to convert a List<Expression> structure to a string. You could even subclass List<Expression> and avoid the ugly static if you felt inclined.
I so use regexes a lot at feelitlive.com as you might guess, but just my 2c
string patttern = "^\d{3}-?\d{2}-?\d{4}$";
to (YOUR BLOG REMOVES SPACES....TACKY)
string pattern = @"
^ # Start at beginning of Line
\d{3} # First three digits of SSN
-? # optional dash
\d{2} # Middle two digits of SSN
-? # Optional Dash
\d{4} # Last Four digits
$ # End of Line";
If your blog didn't remove spaces you would see the # were aligned on the *FAR* right. Maintenance assured.
1) Using Fluent Interfaces, you also conceived a practical, elegant way to simplify regular expression construction.
2) Using a regular expression simplification utility, you conceived a practical, elegant way to demonstrate Fluent Interfaces.
I applaud you heartily.
- Jason Cook (uihero.wordpress.com)
Thought you might like to know that Anders Heljsberg, the designer of C#, used your ReadableRegex library during his talk at JAOO 2008. You can watch the video here:
http://blog.jaoo.dk/2008/10/07/the-future-of-programming-languages/
ReadableRegex makes an appearance at about 15 minutes in.