A friend I work with was asking me today how to match HTML comments using regular expressions. It was an interesting example of some of the pitfalls and design that needs to go into regular expression code.
HTML comments are marked out by <!– and –>, for those who don’t know. For our examples we’ll use the code below, and change the patterns and subjects defined by $pattern and $subject respectively. This is written in PHP, but the same patterns are true for any Perl Compatible regular expression.
<?
//Show Target and pattern
echo "Subject: " . htmlspecialchars($subject) ."<br>Pattern:" . htmlspecialchars($pattern) . "<br>";
//Show how many times pattern matches in subject
echo "Matched: " . preg_match_all($pattern,$subject,$matches) . " times<br>";//Show each match
foreach( $matches[1] as $match ) {
echo htmlspecialchars($match) ."<br>";
}
?>
Right, so, patterns and subjects. We’ll start with the following subject:
$subject = "This is a test of <!--HTML COMMENTS--> and where we can see them";
So what pattern can we use to find this? The normal patten would be to find the <>, but this would find normal HTML code to. Instead, we’ll look for <!– and –>. Simple eh?
$pattern = "/<!--(.*)-->/i";
That pattern is ‘Get me the opening tag, then as many characters as possible, then a closing tag’. It produces:
HTML COMMENTS
Great! Well, not quite. If the subject is:
$subject = "This is a test of <!--HTML COMMENTS--> and where we <!--can--> see them";
We get:
HTML COMMENTS --> and where we <!-- can
D’oh! the .* in the pattern has match as many characters as possible, then a closing tag. Bugger.
So what can we do? Well, how about matching on anything that isn’t the closing tag. The closing tag starts with a hyphen (-), so why not match on any character that isn’t that?
$pattern = "/<!--([^-]*)-->/i";
Gives us:
HTML COMMENTS
can
Great! No, not quite. We could have a hyphen in the subject’s comments that is not the start of the ‘close comment’ tag. E.g.:
$subject = "This is a test of <!--HTML-COMMENTS--> and where we <!--can--> see them";
Gives:
can
Ah. We no longer find the closing tag on the first comment, so it doesn’t match. If we change the pattern (again. There’s a fair bit of this in regular expressions), we could match on non-hyphens or hyphens not followed by a ‘>’.
The following pattern means ‘Find a start tag, followed by strings of no hyphens or hyphens followed by non-angle-brackets, and find many of these strings, and then an end tag’.
$pattern = "/<!--(([^-]*|-[^>])+)-->/i";
Gives:
HTML-COMMENTS
can
Done? Weeelll, no. You see, we could have two hypens (‘–‘) in the comment. Or angle brackets (specifically, a ‘>’). Or a single hyphen and angle bracket (‘->’).
Well, the first two are okay already. The first part of the ORed statement ([^-]*) will capture the angle bracket – as it isn’t a hyphen. And the double hyphen will match the second part of the ORed statement (-[^>]) as it is a hypen followed by a non-bracket.
However, the single hyphen and angle bracket causes a problem. Neither clause will match it. It has a hypen, and it is not a hyphen followed by a non-bracket. If our subject were:
$subject = "My <!--car-goes--> much <!--much->faster 'cos > it's--> red";
We get:
car-goes
So, we want to match any non-hyphen, then a hyphen and an angle bracket, as well as the other clauses. This would be [^-]->
$pattern = "/<!--(([^-]*|-[^>]|[^-]->)+)-->/i";
Gives:
car-goes
much->faster 'cos > it's
Great, works! There is an interesting point to note, however…
The regular expression engine works by starting with the left most character in the string, and trying to match. If it can not, then it moves along a character, and tries again, and so on.
Once it has started to match, if given an OR statement, it will try to match as much as possible be trying the next alternative. Thus, if our subject were:
$subject = "Car <!--much->faster --> red";
the expression wouldn’t start matching until ‘<‘. The ‘<!–‘ would then match the fixed characters in the pattern. Next, we open some capturing brackets, so we would get the comment, not the tag. Then the [^-]* would match ‘much’. I imagine that there is a stack containing “<!–much”, to help me understand.
Next we try to match the first hypen. It doesn’t match the [^-] (‘Not a hyphen’). So the engine will try to match the next option (-[^>]). The hyphen will match and be added to the stack (now “<!–much-“), but the bracket will not. This option will fail, and the hyphen removed from the stack.
Next, the final option ([^-]->) will be tried. It fails at the first hurdle – we’re trying to match the hyphen, and it starts with a ‘not hyphen’.
So what happens? Well, the engine pops another character off the stack, from what it matched in the first option, and tries the other options again. Our stack is now “<!–muc”, and we are matching from the ‘h’ onwards. The second option fails as the ‘h’ isn’t a ‘-‘.
Now, however, the third option matches as we do have a non-hyphen, a hyphen and a bracket (‘h->’). Hurrah! As we can have as many of these 3 ORed options as we want, the remaining string is matched by [^-]* . Finally, none of the 3 options match the ‘–>’that ends the comment, so we end the capturing brackets, and match the end of the comment.
Looking at what is in the brackets, we get:
much->faster
Clear as mud? Well, I’m afraid that that is common in regular expressions. I mean, these things are potentially very complex, have a nasty habit of matching things that you don’t mean to, and in some engines you can right statements that would take forever to actually execute.
If you want (or care) enough to learn more, I recommend ‘Mastering Regular Expressions’ from O’Reilly, by Jeffery F. Friedl. It’s not exactly, um, light, but it gives a good grounding, explains the differences between the types of regular expression engines, tips on common tasks, and it does it in a way that isn’t terminally dull. And, I’ve got to say, Regular Expressions are a very useful tool to have in one’s armoury – I mean, who doesn’t have to process text?