Readable Regular Expressions

My main point of focus at work lately has been promoting maintainable code. One of the key tenets is readable code. The single responsibility principle and a low cyclomatic complexity are important, but if you are still using cryptic, prefixed, acronymed, and highly abbreviated identifiers, it is still going to be a chore for the reader to decipher. My slogan: "let's take the code out of source code".

I was just listening to Roy Osherove talk about regular expressions on .NET Rocks. A recurring theme brought up was how hard regular expressions are to deal with. Not necessarily creating them - you can do a lot by just knowing the basics - but dealing with them after they've been written. As they mentioned on the show, your source code ends up looking like a cartoon character swearing, which is the likely response you'll get from the poor maintenance developer that has to deal with it. Regular expressions are often referred to as a "write-only" language.

It got me thinking that this was a problem worth solving. Regular expressions are too powerful to ignore. For a certain set of problems, a regular expression can eliminate a LOT of potentially error-prone code. I cannot justify advocating avoiding regular expressions, no matter how much I value source readability. So what if we could make regular expressions readable?

Inspired by the Ayende's Rhino.Mocks syntax, I created a library that provides a better way to define regular expressions in your source code. The easiest way to describe it is to show it in action. Suppose we want to check for social security numbers. You might write code like this:

    Regex socialSecurityNumberCheck = new Regex(@"^\d{3}-?\d{2}-?\d{4}$");

Using ReadableRex (not settled on the name yet...), it would look like:

    Regex socialSecurityNumberCheck = new Regex(Pattern.With.AtBeginning
        .Digit.Repeat.Exactly(3)
        .Literal("-").Repeat.Optional
        .Digit.Repeat.Exactly(2)
        .Literal("-").Repeat.Optional
        .Digit.Repeat.Exactly(4)
        .AtEnd);

You could argue that the second example is actually harder to read, because the reader is bogged down with the details of how a social security number check is performed. It may be a bad example, because the algorithm for detecting a SSN is both well-known (in the US, at least) and unlikely to change. Consider a situation where the expected match is not well-known, and very likely to change: screen scraping HTML. In that case, being able to read through the algorithm, and easily identify which parts need to change becomes very important. To illustrate, I dug up some old code that was used to scrape basketball scores from espn.com. It's a good example of an ugly pattern that had to be maintainable, since the HTML layout could change at any time.

    const string findGamesPattern = @"<div\s*class=""game""\s*id=""(?<gameID>\d+)-game""(?<content>.*?)<!--gameStatus\s*=\s*(?<gameState>\d+)-->";

Using ReadableRex:

    Pattern findGamesPattern = Pattern.With.Literal(@"<div")
        .WhiteSpace.Repeat.ZeroOrMore
        .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")
        .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
        .Literal(@"-game""")
        .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
        .Literal(@"<!--gameStatus")
        .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
        .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
        .Literal("-->");

I think this would be much easier to maintain. Note that this library doesn't actually perform an regular expression operations - it simply provides another way to define regular expression patterns. You still need to use the System.Text.RegularExpression.Regex object with the pattern you create. Since the Pattern type has an implicit conversion to System.String, so you can easily pass it to the the methods/constructors on Regex.

What do you think? Download the code or just the assembly DLL, give it a try, and tell me what you think. None of the method/property names are set in stone, so the syntax may change, but the approach will remain the same.

Comments

Great work. I'm definitely going to try this out!

Steven - October 07, 2006 03:06pm

Great Job! How about calling the project as ReadEx for READable regular EXpression. I like shorther names :D

Keith Rull - October 13, 2006 02:47pm

Whoa... I never would have thought of trying that. But I'll certainly be trying your library out.

I think you're on to something!

David Mohundro - October 17, 2006 09:30pm

Dammit, I was working on this exact same blog post! :)

Someone needs to hire you away from Hell.com and put the rest of your creativity to effective work on a day-to-day basis.

Scott Bellware - October 23, 2006 09:28am

I found this announcement[^] of a readable regular expressions library...

Judah - October 23, 2006 10:05am

I like it, even if I prefer the more terse regex "language."

One suggestion:

Instead of

{Digit|WhiteSpace}.Repeat.ZeroOrMore ==> {Digit|Whitespace}.Optional
{Digit|WhiteSpace}.Repeat.OneOrMore ==> {Digit|WhiteSpace}.Required

Or at least, instead of ZeroOrMore or OneOrMore use Optional or Required respectively.

ahz - October 23, 2006 11:45am

Great idea...
But how about those really hard to craft/understand expressions?
For example, in my project we use the following expressions:

^((?!my string).)*$
\A((?!my string).)*$\Z

Any idea what they do?

*SPOILER*
They're actually a NOT operators, matching text that does not contain the "my string" phrase. The first one is for single line search, and the second is for multiline.
*END SPOILER*

I'd sure love to see a fluent expression that describes those expressions.

Omer Mor - October 23, 2006 12:42pm

I think you're wrong personally. Before I read anything about your article, I tried to read your easy to read expression, it was easier to read, but not easier to understand.

I think people just have to find a nice guide to learning regular expressions. They're very powerful and useful and if you put the effort in, it pays off. You *could* learn to do it this way, but I think in the end it would be just as hard trying to remember what does what as it is to remember what a character does.

Nice idea though!

John Hunt - October 25, 2006 05:04am

A nice idea in principle, but from the example in this post it strikes me that a user of this library needs to learn a fairly complex syntax which is almost as far from "plain english" as regex, when they could simply learn how to do regex.

'Literal("-").Repeat.Optional' does not automatically imply the same in my mind as '-?', but instead something more like '-*' or '-+'.

In my opinion a better system would be a regex compiler whereby you can communicate what you want in a far more English way, something along the lines of <a href="http://blogs.msdn.com/ericgu/archive/2003/07/07/52362.aspx">Regular Expression Workbench</a> but perhaps even simpler.

Chris Hollis - October 25, 2006 05:47am

May I suggest "Some html is allowed" be elaborated on? :( Never mind, I expect those reading the comment can extract the URL anyhow... maybe they can construct a regex to do it for them ;)

Chris Hollis - October 25, 2006 05:49am

Thanks for the feedback, everyone.

Omer: Yes, there are some edge cases that my proof-of-concept code does not cover. That doesn't mean they can't be done, it just means I didn't want to spend the time to cover every case before asking for feedback. Thanks for providing a real-world case that I can use for testing when I get to that. Any suggestions on how you think it should look?

John & Chris: I know there are other ways that would help make "creating" regular expressions easier (learning the syntax, or using a workbench tool). My attempt was to make "reading" regular expressions easier. Of course I understand that is completely subjective. If you think the one character symbols and punctuation are easier to understand, who am I to disagree?
I completely expected disagreements about the method names that I chose - that's why I said none of the names are set in stone and was asking for feedback. It sounds like the choice of "Repeat.Optional" for '0 or 1' was not a good choice, as a couple people have mentioned it. Any alternative suggestions? Would you prefer a "wrapping" appearance for optional, like this:
Pattern.With.Digit.Optional(Pattern.With.Literal("-")); // renders as: \d-?

I originally did all the repetition using that wrapping syntax, but felt it didn't read as smoothly.

Thanks for the reminder Chris - I need to fix DasBlog so it doesn't show "Some html is allowed" when I have all HTML disabled. It normally lists the symbols you allow, but since I don't allow anything, it doesn't list anything...

Joshua Flanagan - October 25, 2006 11:18pm

Hi Joshua,

Very cool! You might also be interested in this approach, which is a way to get a similar result with a more concise syntax: http://dotnet.agilekiwi.com/blog/2006/10/shorthand-interfaces.html

John Rusk - October 30, 2006 12:38am

Very cool!

I wrote about the need for a better RegEx syntax last year, but didn't have any ideas on how to implement it. This is a really cool solution to the "write only" RegEx syntax.

http://weblogs.asp.net/jgalloway/archive/2005/11/02/429218.aspx

Jon Galloway - October 30, 2006 02:41pm

Sweet.

Maybe we could put a collection of common expression together using the lib?

Schneider

Schneider - February 18, 2007 12:00am

Absolutely ridiculous. If you can't read/write regex, stop coding. You turned 1 line of code into 10. Maintainability is also related to the number of lines of code. Working with .NET must have clouded your thinking. Objects are not the be all end all of the programming world. Just sit down and learn Perl, then this stuff won't be so hard for you.

Steve - October 31, 2007 01:46pm

Steve - if you can't read/write your code using 1s and 0s, please stop coding. Working with higher level languages like perl must have clouded your thinking.

Joshua Flanagan - October 31, 2007 06:16pm

The only thing that absolutely stinks about fluent interfaces is their debuggability. The compiler treats this as one line of code, so it's impossible to step into individual calls.

The only way I could help this was to create a debug visualizer, you might look into that. You could hover over the variable, and the visualizer could show you the resultant regex. Just a thought...

Jimmy Bogard - November 02, 2007 08:42am

Josh: I *can* read and write in hex. Well, a workable Z80 subset. However, you ignored the main point: ones and zeroes are in no way shorter than higher-level code.

I think there is a misconception of readability here: The problem is not that you need to know what the syntactical elements of the regexp mean. Rather you need to understand what the whole expression actually does, and the verbose form doesn't help here at all.

Also I thing the fluency suffers from being forced into the chained methods. There is no syntactical indication that 'Repeats' modifies the meaning of the previous entity while 'Digits' doesn't.

Jimmy: There is nothing to debug here anyway: The object incantation just produces a regular regexp.

Andreas Krey - November 04, 2007 02:59pm

Andreas: good point on the 1s and 0s - they definitely would involve more typing. My attempt at humor failed miserably.
And I completely agree about the quantifiers and grouping being ambiguous in this version of the API. It was something I wanted to improve (and you can see in the comments above that I was playing with different ideas), but never followed up on.

Joshua Flanagan - November 04, 2007 08:36pm

Josh, what does an email address validator look like using Readable Regex?

p.s. thanks for this library, I've used it as a tool many times to generate regular expressions, especially in helping out in the CodeProject forums.

Judah - November 16, 2007 11:45am

For Java programmers, something similar has been in Hamcrest for a while.

Nat - May 10, 2008 10:17am

To the users who think this is crap - Well, that's why you'll remain in the worker class, or should I say, last layer. Can't you see the creativity? Regex is too good to be less frequently used, which I see in many projects. In today's RAD world, time is money, and if I can get my developers rolling fast, well.....figure it out if you are good at Regex. If Mr. Flanagan was on my team, he would be considered management material. Great work!

Saif Khan - May 10, 2008 11:03am

Great work! and great idea!

Bryan Reynolds - May 10, 2008 12:06pm

I think the idea is sound, but the syntax seems lacking; in particular I agree with the comment regarding ambiguity about whether any given operation modifies the preceding one or not.

Perhaps there should be more focus on building sub-expressions in the Regex (as with regular Expression<Func<T>>), rather than continuous chains which are little more than longhand for the existing syntax? Perhaps also more extensions which encapsulate common patterns, eg .Digit(3); .Digit(3.OrMore). Something highly composable.

Dunno.. difficult one to solve.

Keith J. Farmer - May 17, 2008 06:40am

Thanks for the feedback Keith. C# 3.0 wasn't released when I first wrote this - it may be worth revisiting to try and take advantage of the new syntactical conveniences.

Joshua Flanagan - May 17, 2008 09:22am

I think its important that things which look like Properties don't actually modify the thing to the left of "." and that methods *do* modify the thing to the left, not the one to left of that. Basically, it looks like what you've created isn't entirely OO, or at l;east makes the OO view difficult to obtain - more so than regex syntax does... I say take a look at Hibernate Criteria objects, and riff off that instead.

I'm imagining something like a List<Expression>, where Expression can be Digit, Letter, Group etc and each can have child expressions and a Multiplicity - which would be a class with properties like Min, Max, Greedy, Lazy etc... Then all you need is a static method to convert a List<Expression> structure to a string. You could even subclass List<Expression> and avoid the ugly static if you felt inclined.

I so use regexes a lot at feelitlive.com as you might guess, but just my 2c

Simon Gibbs - June 09, 2008 12:07pm

Doesn't the regex option IgnorePatternWhiteSpace make your code irrelevant? The premise you propose is the *after* maintenance. If the developer were to comment the pattern using the option it would go from

string patttern = "^\d{3}-?\d{2}-?\d{4}$";

to (YOUR BLOG REMOVES SPACES....TACKY)

string pattern = @"
^ # Start at beginning of Line
\d{3} # First three digits of SSN
-? # optional dash
\d{2} # Middle two digits of SSN
-? # Optional Dash
\d{4} # Last Four digits
$ # End of Line";

If your blog didn't remove spaces you would see the # were aligned on the *FAR* right. Maintenance assured.

OmegaMan - August 09, 2008 05:49pm

Beautiful, Joshua. You scored a home run from both sides of the plate.

1) Using Fluent Interfaces, you also conceived a practical, elegant way to simplify regular expression construction.

2) Using a regular expression simplification utility, you conceived a practical, elegant way to demonstrate Fluent Interfaces.

I applaud you heartily.

- Jason Cook (uihero.wordpress.com)

Jason Cook - September 02, 2008 05:41pm

Hey Joshua,

Thought you might like to know that Anders Heljsberg, the designer of C#, used your ReadableRegex library during his talk at JAOO 2008. You can watch the video here:

http://blog.jaoo.dk/2008/10/07/the-future-of-programming-languages/

ReadableRegex makes an appearance at about 15 minutes in.

Judah Himango - October 12, 2008 04:16pm

Great post! Reading through the comments above, I guess we shouldn't be surprised at people like Steve who hide behind the ubiquitous "web anonymity" and produce nonsense like "If you can't read/write regex, stop coding". Indeed, maintainability is related to the number of lines of code, but it is also intrinsically tied to the readability of the code. Even the best Regex Gurus I've worked with have trouble debugging their own expressions 3 or 4 months later. I think this is a fascinating example of how fluent interfaces can be used to abstract the complexity of things like Regex to make it possible for us "non-Perl-ites" to incorporate readble AND maintainabe code into our applications....

Jim - January 08, 2009 10:18am

blog.flimflan.com

Readable Regular Expressions

Comments