Optimizing regular expressions with PHP


2372872685_5e35d92617_o I was intrigued by the following text in the PHP reference, especially because there is considerable regex use in the wehoneypot project:

S When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.

My questions were: (a) what exactly does it mean by “using”, since PHP doesn’t have the concept of “compiling” regular expressions like other languages have and (b) in which cases are these optimizations useful?

The first useful information about the issue I found on stackoverlow, where a comment by EBGreen mentioned the book “Mastering Regular Expressions” by Jeffrey Friedl. You can take a peek into the book using books.google.com. Starting with page 478 it discusses efficiency issues in PHP, including the S modifier. A quote from it:

Currently, the situations where study can and can’t help are fairly well defined: it enhances what Chapter 6 calls the initial class discrimination optimization.

This is essentially the same as the explanation from the manual, however it also gives an example which makes the issue clearer:

Let’s say that you have the regex /0x[a-f0-9]+/i. It is pretty clear that a match is possible in the string where we find a zero, and it makes no sense for the regex engine to try matching in other places (and in fact, it doesn’t). However, if we have an expressions like the following /<i>|<b>|<em>/i, it is still clear to the human observer that only places containing “<” can be a starting point for the match, however the regex engine doesn’t know this, unless the S flag is specified and it has the chance to perform some analysis on the regex.

Now lets put some relative numbers to this explanation: I used preg_match_all with an expression similar to the second one in a loop to extract all the matches from from a ~20MB string 10 000 times. However the variation was less than 2% (in absolute terms this would mean less than 2 seconds on my machine – this also falls below the statistically significant threshold, since between different runs was ~50%). Given that most applications do far fewer calls to the PCRE library on much shorter strings, the S modifier for the moment doesn’t seem have a noticeable performance impact.

Finally, here is an interesting presentation about writing a compiler for PHP from the Google Tech-Talks collection:

Picture taken from Geek&Poke’s photostream with permission.

,

Leave a Reply

Your email address will not be published. Required fields are marked *