Regex Toolkit, Prayer-Based Parsing, Bad Examples

Posted: August 18, 2011 in PowerShell
Tags:

There was a recent entry on the Scripting Guy blog showing how to use PowerShell to parse email message headers. While it’s true to say that I have a number of problems with the script itself in this article it was the regex that really caught my eye. You need to read the Scripting Guy article to understand the context, but here’s the regex:

‘Received: from([\s\S]*?)by([\s\S]*?)with([\s\S]*?);([(\s\S)*]{32,36})(?:\s\S*?)’

Unfortunately there are serious issues with the Regex. I explain why and present an alternative later in this post.

What looked immediately strange to me in the regex was the character set ‘[\s\S]’. This matches a single character that is either a space (‘\s’) or is not a space (‘\S’) – in other words it matches *any* single character (which is [almost] the same as the ‘.’ matching character)

It’s clear, too, that the regex will likely fail whenever the server names in the email headers contain the substrings ‘by’ or ‘with’ as there are no delimiters around these characters (it would be better/safer/more correct to test for white space around the delimiters using ‘\s+’ – which means ‘match one or more white space characters’; so ‘\s+by\s+’ and ‘\s+with\s+’)

Looking further on I was struggling to see what this part of the regex was supposed to do: ‘([(\s\S)*]{32,36})’, so I broke it down … The surrounding parens in this case mean it captures something – taking that away leaves ‘[(\s\S)*]{32,36}’.

The {32,36} part says ‘match the preceding pattern between 32 and 36 times'; the pattern actually being matched in this case is then ‘[(\s\S)*]‘.

Because this pattern is enclosed in square brackets it means that ‘[(\s\S)*]‘ is actually any single one of a set of characters – matching any of the single characters in the set. The characters it will match in this case are therefore: an open paren, ‘any space character’, ‘any non-space character’, a close paren or an asterisk. By inspection you can see that this matches any character (repeated 32-36 times).

Huh?

At this point I was thoroughly confused. This is a good time to say that (a) I have no affiliation with the RegexBuddy company or (b) that I’m not getting any payment for a plug! I can say that I’m a big fan of the RegexBuddy program; if you need to write a regex that is more complex than the average then RegexBuddy is a real help (and the documentation and regex library are great too).  So, I started up RegexBuddy. Its decoding of the regular expression is included at the end of this post but it confirmed what I thought, seriously broken…

So, in the Scripting Guy article, the author is right when he says ‘If you are good at Windows PowerShell and still haven’t used regular expressions, you are missing an important weapon in your Windows PowerShell arsenal’.  But, unfortunately, his solution is misleading and dangerous as an example.

Where’s the QA on the article??  This is posted on a Microsoft site; a site aimed at beginners – unfortunately not good.

A Fixed Regex

Here’s a better way.  Initially this is a simple version that defines a domain name (e.g. outgoing.red.com) as any sequence of any non-blank characters; we’ll refine that in a moment:

Received: from\s+([^\s]+)\s+by\s+([^\s]+)\s+with\s+([^;]+);\s(.+)

I’m not claiming this is perfect; Regex gurus will undoubtedly have refinements. However, I will say that in comparison to the original this is shorter, more robust and, importantly, correct.

Improving Further

We can refine this further by using RegexBuddy’s library (or looking on the web) for a domain name pattern. RegexBuddy suggests this:

\b([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b

This matches ‘a-z’ or ‘0-9′ characters one of more times and allows embedded hyphens, all this followed by a dot. This pattern is then repeated (it must occur at least once). It then matches ‘a-z’ characters (at least 2 of them) in order to match the top-level domain name.

If you haven’t yet got the regex way of things, this looks incomprehensible. Breaking it down into chunks makes it manageable. If you’re trying to learn regex then buy yourself a copy of “Mastering Regular Expressions” by Jeffrey Friedl. This is the regex reference, no question. If you can, get a copy of the second edition which covers .Net regex. (Or get a copy of RegexBuddy and read the extensive help).

Non-Capturing Parens

We need to revise this domain name pattern slightly because some of the parenthesised parts of the regex here are used for grouping. Because it isn’t explicitly stated otherwise these grouping parens will, by default, also capture whatever they happen to match. This isn’t a big deal but at the very least it means that referring to the captured groups will need to use different indexes. To avoid this we can modify the grouping parens so that they don’t capture by changing them from ‘(…)’ to ‘(?:…)’, so:

\b(?:[a-z0-9]+(?:-[a-z0-9]+)*\.)+[a-z]{2,}\b

Even more gobbledygook!

Outstanding Issues

This is pretty robust now, but I can spot at least one outstanding risk. If the name of SMTP system includes a semicolon followed by white space (unlikely I agree) then this will be taken to be the delimiter between the SMTP system name and the date. This could be fixed by parsing explicitly along the date, but because the date is in the rather unfortunate RFC822 format [(Use RFC 3339/ISO 8601 format people!)] it’s not so easy to tie down. Instead, we can make sure that the semicolon we match as a delimiter is the last semicolon before the end of the string (of course, this fix assumes the date will never contain a semicolon!)

To modify the regex to do this we can change the delimiter and the final capture to:

‘;\s([^;]+)’

This gives us the following as the final regex:

Received: from\s+(\b(?:[a-z0-9]+(?:-[a-z0-9]+)*\.)+[a-z]{2,}\b)\s+by\s+(\b(?:[a-z0-9]+(?:-[a-z0-9]+)*\.)+[a-z]{2,}\b)\s+with\s+([^;]+);\s([^;]+)

Oh dear.  We’ve fixed a bunch of things, but this is not very user friendly (even if you have got the regex way of thinking…)

Layout

Things can be made better by splitting the regex over multiple lines. In order to be able to do this we first have to include the ‘(?x)’ flag at the start of the regex. This turns on ‘Extended mode’ and allows white space (including newlines) in the regex pattern, as well as allowing comments. Here’s the final formatted pattern, enclosed in a Here-string. This is more complex than the original working solution but it’s also likely to work more of the time…

$regex= @’

(?x)Received:\sfrom\s+  # Starting delimiter

(

\b(?:[a-z0-9]+(?:-[a-z0-9]+)*\.)+[a-z]{2,}\b  # Match domain name and capture

)

\s+by\s+    # Delimiter between domain names

(

\b(?:[a-z0-9]+(?:-[a-z0-9]+)*\.)+[a-z]{2,}\b  # Match domain name and capture

)

\s+with\s+   # Delimiter between domain names and SMTP system name

(

[^;]+   # Capture SMTP system name

)

;\s     # Delimiter between SMTP system name and date

(

[^;]+   # Capture the date

)

‘@

Aren’t objects great?!  If the SMTP headers were emitted as objects we wouldn’t need to do this ‘Prayer-based Parsing’ James O’Neill has recently posted on.

Finally

Finally, note that this parsing is still potentially broken. The RFC2822 header format (RFC2822 supersedes RFC822) defines a number of optional components in the header. Here’s an example extracted from the RFC:

Received: from x.y.test

by example.net

via TCP

with ESMTP

id ABC12345

Oops!  Where did that ‘via TCP’ part come from??!  Obviously Prayer-Based parsing is – errr, pretty shaky.

 

—————

Here’s what RegexBuddy says when decoding the faulty regex :

Received: from([\s\S]*?)by([\s\S]*?)with([\s\S]*?);([(\s\S)*]{32,36})(?:\s\S*?)

Match the characters “Received: from” literally «Received: from»

Match the regular expression below and capture its match into backreference number 1 «([\s\S]*?)»

Match a single character present in the list below «[\s\S]*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

A whitespace character (spaces, tabs, line breaks, etc.) «\s»

Any character that is NOT a whitespace character «\S»

Match the characters “by” literally «by»

Match the regular expression below and capture its match into backreference number 2 «([\s\S]*?)»

Match a single character present in the list below «[\s\S]*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

A whitespace character (spaces, tabs, line breaks, etc.) «\s»

Any character that is NOT a whitespace character «\S»

Match the characters “with” literally «with»

Match the regular expression below and capture its match into backreference number 3 «([\s\S]*?)»

Match a single character present in the list below «[\s\S]*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

A whitespace character (spaces, tabs, line breaks, etc.) «\s»

Any character that is NOT a whitespace character «\S»

Match the character “;” literally «;»

Match the regular expression below and capture its match into backreference number 4 «([(\s\S)*]{32,36})»

Match a single character present in the list below «[(\s\S)*]{32,36}»

Between 32 and 36 times, as many times as possible, giving back as needed (greedy) «{32,36}»

The character “(” «(»

A whitespace character (spaces, tabs, line breaks, etc.) «\s»

Any character that is NOT a whitespace character «\S»

One of the characters “)*” «)*»

Match the regular expression below «(?:\s\S*?)»

Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.) «\s»

Match a single character that is a “non-whitespace character” «\S*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

Created with RegexBuddy

About these ads
Comments
  1. [...] Chris Warwick commented on the regex used in the article over here. [...]

  2. [...] Regex Toolkit, Prayer-Based Parsing, Bad Examples (chrisjwarwick.wordpress.com) [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s