regex – Grey Panthers Savannah

In praise of Regexp::Assemble

gpanther — Mon, 15 Mar 2010 08:42:00 +0000

…and of the Perl modules in general. I had the following problem:

Given a list of 16 character alphanumeric IDs, find all the lines from a large-ish (~6GB) logfile which contain at least one of the IDs.

The naive approach was to construct a big regular expression like W(QID1E|QID2E|QID3E...)W and match it against every line (I needed to capture the actual ID to know in which bucket to place the line). Needless to say, as it is the case with most naive approaches, it was slooooow (basically, it was hogging the CPU, not the disk). So, by searching around a little bit I found Regexp::Optimizer and Regexp::Assemble. Of the two the later seemed the more mature one, so – after quickly installing it with CPAN – I’ve put it into my code and made it run at the “speed of the disk”. W00t! Perl + CPAN + clever modules rock!

PS. A little benchmark data (take it with a grain of salt, since you should be profiling not benchmarking most of the time):

Unoptimized regex size: 873 427 characters
Optimized regex: 69 536 characters
Unoptimized regex matchtime over 380 MB of data: ~1.9 hours (which would mean a throughput of ~58KB / sec – well below disk speed)
Optimized regex matching over the same 380 MB of data: 2 sec (throughput: 190 MB/sec !!!)

How cool is this?

RegEx which matches strings not containing a substring

gpanther — Wed, 10 Mar 2010 12:08:00 +0000

This is an interesting problem which can appear in certain cases (although not very often). A little searching around led me to many posts stating that there is no easy solution and the following easy solution:

^((?!my string).)*$

It works as follows: the matching string must contain zero or more characters which are not preceded (?! is the negative look behind operator) by the given string.

It is quite straight-forward, uses operators which are widely supported by regualar expression engines and works even if “my string” is at the end of the string we are trying to match – for reasons which are not entirely clear to me.

Obviously it is a hack and you shouldn’t use it if you can use a clearer way to indicate your intention, but it is a nifty tool to have in your toolbox for that one moment when you need it.

Picture taken from crowt59’s photostream with permission.

Optimizing regular expressions with PHP

gpanther — Tue, 07 Apr 2009 12:42:00 +0000

I was intrigued by the following text in the PHP reference, especially because there is considerable regex use in the wehoneypot project:

S When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.

My questions were: (a) what exactly does it mean by “using”, since PHP doesn’t have the concept of “compiling” regular expressions like other languages have and (b) in which cases are these optimizations useful?

The first useful information about the issue I found on stackoverlow, where a comment by EBGreen mentioned the book “Mastering Regular Expressions” by Jeffrey Friedl. You can take a peek into the book using books.google.com. Starting with page 478 it discusses efficiency issues in PHP, including the S modifier. A quote from it:

Currently, the situations where study can and can’t help are fairly well defined: it enhances what Chapter 6 calls the initial class discrimination optimization.

This is essentially the same as the explanation from the manual, however it also gives an example which makes the issue clearer:

Let’s say that you have the regex /0x[a-f0-9]+/i. It is pretty clear that a match is possible in the string where we find a zero, and it makes no sense for the regex engine to try matching in other places (and in fact, it doesn’t). However, if we have an expressions like the following /||/i, it is still clear to the human observer that only places containing “<” can be a starting point for the match, however the regex engine doesn’t know this, unless the S flag is specified and it has the chance to perform some analysis on the regex.

Now lets put some relative numbers to this explanation: I used preg_match_all with an expression similar to the second one in a loop to extract all the matches from from a ~20MB string 10 000 times. However the variation was less than 2% (in absolute terms this would mean less than 2 seconds on my machine – this also falls below the statistically significant threshold, since between different runs was ~50%). Given that most applications do far fewer calls to the PCRE library on much shorter strings, the S modifier for the moment doesn’t seem have a noticeable performance impact.

Finally, here is an interesting presentation about writing a compiler for PHP from the Google Tech-Talks collection:

Picture taken from Geek&Poke’s photostream with permission.

Alternative regular expression syntax

gpanther — Fri, 27 Mar 2009 12:29:00 +0000

For a long time I was a believer in the “Perl way” of doing regular expressions and an avid reader of perlre. All other implementations I viewed as a “poor man’s copy” of the one true idea.

However, after reading the Lua Patterns Tutorial, I found it quite enlightening. Even though it is called “patterns” and not “regular expressions”, it is a very similar concept. The very nice touch is that it uses % as escape character rather than (like in PCRE). For example, to represent a digit you would say %d instead of d, a syntax which I suppose is familiar to a larger audience of programmers (everybody who used the printf / scanf family of functions). An excellent idea!

Check out the complete reference (or the wiki) for more details.

Picture taken from Uqbar is back’s photostream with permission.

Javascript regex quirk

gpanther — Sat, 03 Jan 2009 15:43:00 +0000

When I’ve written the SMOG analyzer javascript I found a quirk of javascript and this recent post inspired me to share it:

The javascript regex specification doesn’t have the s modifier. This is necessary when you want to match multiple lines with a construct like .*. The suggested workaround I found was to specify the [sS] character class which means “any space or non-space character”.

BTW, I find that the multi-line / single-line name for the /m and /s modifiers is somewhat a misnomer, since it leads you to believe that you can’t use them together (how can something be a single line and multiple lines the same time?), however this is not true, since they refer to different elements (^ and $ versus .), which means that you can use and sometimes you need to use them.

The big java regex shoutout

gpanther — Thu, 11 Dec 2008 13:57:00 +0000

I discovered recently that the built-in java regex library has problems with some expressions, so I set out to find alternatives.

Searching for regex benchmarks, I found the following page: Java Regular expression library benchmarks (it also has an older version). The original IBM article also contains a benchmark. However both of these resources are a little dated, so I thought that I’ll remake the benchmark. Below are the results. I’ve only given relative results, because the exact times are irrelevant:

Packages Failures Time

java.util.regex.* 1.6 0
6

dk.brics.automaton.* 1.7.2 3
1

gnu.regexp.RE 1.1.4 0
175

jregex.* 1.2.01 0
5

com.karneim.util.collection.regex.* 1.1.1 3
2

org.apache.regexp.* 1.5 0
100

com.stevenrbrandt.ubiq2.v10.pattwo.* 0
176

kmy.regex.util.* 0.1.2 5
2

How to read the table? The failures column means that (a) either the library created exceptions or (b) failed to correctly match strings. These libraries will have shorted times because they effectively skipped some tests.

My conclusion is: the built in library is very good (and widely available). Try to stick with it. Also, porting regular expressions between engines can be very tricky, even if they use only a few more “exotic” features (like backreferences). The more such features you use, the less chance you have of changing out the regex library implementation and not have any problems. The best thing is if you have unit tests to confirm that you match / reject what you intend.

Update: download the source code for the benchmark here (available under the GPL v3 license).

Regular Expressions in Java

gpanther — Thu, 27 Nov 2008 12:51:00 +0000

I was wondering why the gnu.regexp package exists, when Java already includes libraries for it. One thing I can think of is the fact that they’ve been added only in 1.4.

During searching around I found some surprising facts about the built-in regex libraries (the site goes up and and down, so here is the Google Cache link in case it’s down again):

regular expressions are not compiled to a finite automaton, the way it’s done in other languages / libraries. This (I feel – I didn’t test it personally) can cause some considerable performance hits.

it can break for some extreme regular expressions. The given example is in Ruby (run under JRuby), but I translated to Java and found the same results (stack overflow exceptions):

public static void main(String[] args) { String long_string = ""; for(int i = 0; i < 76; ++i) long_string += "xxxxxxxxxxxxxxxxxx"; if (Pattern.matches("\A((?:.|\n)*?)?([rn]{1,2}|--)", long_string)) System.out.println("foo"); else System.out.println("bar"); }

Some conclusions: java.util.regex is still useful if you take care not to use overcomplicated regex's. There are alternate regular expression engines out there. Specifically I found this article which is a little old (from 2002) and feels a little like architecture astronautics (abstracting away the regex layer? really?), but it does include some benchmarks about the alternatives. Most probably all the packages have evolved since, so you should do your own benchmarking, but this is a good start. There is also some useful discussion at two bugs related to this: Bug 4675952 and Bug 5050507.

Update: I just posted a small test comparing alternative implementations for regex's under java.

Regex magic

gpanther — Wed, 04 Jul 2007 18:51:00 +0000

First of all I want to apologize to my readers (both of them :-)) for bein AWOL, but real life sometimes interferes pretty badly.

I always been a big fan or regular expressions and one of the main reasons I love Perl is because they are so deeply integrated in it and are natural to use. (Of course there are many negative aspects one must be aware, like speed or the fact that sometimes they can be quite hard to read). To deal with the later problem, here is a link to a Perl module which tries to dissect and explain step by step what a regular expression does:

YAPE::Regex::Explain. Be aware that it has a dependency on YAPE::Regex, but this fact is not specified in the package, so doing an install YAPE::Regex::Explain will fail if it’s not preceded by an install YAPGE::Regex, even though this should be done automatically (and it would be if the package would be created properly). Running a regular expression through this module will produce an output like the following:

The regular expression: (?-imsx:a+?) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching n) (matching whitespace and # normally): ---------------------------------------------------------------------- a+? 'a' (1 or more times (matching the least amount possible)) ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

An other interesting module I came on thanks to this blog post is Regexp::Assemble, which can be used to combine regular expressions and create a big expression which would match anything the starting expressions would have matched (so it is a reunion of the regular expressions), but it’s also optimized! Wicked cool.

Input validation

gpanther — Thu, 05 Apr 2007 06:43:00 +0000

The month of PHP bugs is over, but you should still watch the PHP-Security blog, since there are good things coming from there, like this article: Holes in most preg_match() filters. Go read it if you are using regular expressions for input validation. Two tips to avoid these pitfalls:

Cast your input to the datatype you expect before validating

Use capture to get the values out which interest you rather than trying to validate the whole string (this also adds usability because it helps users if they included tabs / spaces at the beginning or end of the input – for example because they were copy-pasting it from a Word document)

Moving to Ubuntu – The Regex Coach

gpanther — Thu, 05 Oct 2006 05:52:00 +0000

After reaching 21 posts and caching up with the Security Now! episodes, I thought that it’s time to start a new series. I am what I consider a pro Windows user and lately I started moving to Ubuntu. I toyed with Linux distros before, but this is the first I feel that I can learn. This series is for other people like me, who come from a Windows background and want to play with Linux.

One of the programs I used over on windows was The Regex Coach. This is a very powerful free (like beer) program written in LispWorks to test regular expressions. There are installation instructions for Linux on the site, however there is one more little thing you must do before you can run it: from a terminal do sudo apt-get install lesstif2 if it complains that it can’t find libXm.so.2. Also the part where in the instruction it says that you should use xrdb -merge, the complete command line would be xrdb -merge regex-coach-resources, where the regex-coach-resources file can be found in the regex coach directory. The installation of lesstif2 probably also solves problems if other LispWorks programs complain when starting up under Ubuntu (or other Debian based distributions). A final quirk is that you can’t (or at least I haven’t discovered how to) copy / paste using the keyboard, but if you right click on the selected text, you get a pop-up menu which you can use to do these things.

Two more thoughts: when you have a problem with Ubuntu, you most probably can solve it by googling for it with the keyword ubuntu, since the ubuntu community is very large. If it so happens that you don’t find your answers, you should try to google for the problem with the keyword debian, because Ubuntu is based on Debian so what works in one usually works in the other. My second closing thought would be: .so files are shared objects. This corresponds to the DLLs from windows. If you don’t know where to get a certain .so from, go to http://packages.ubuntulinux.org/, go down to package content search and put in the file you’re looking for. You will get back the name of the package which you then can install with apt-get or Synaptic.

Update: I was informed by a good friend of mine that you can copy text without the popup menu: select the text you want to copy, go to the place where you want to copy it and middle click (or click simultaneously both mouse buttons). This should work in other graphical applications too that were written in X.

Packages	Failures	Time
java.util.regex.* 1.6	0		6
dk.brics.automaton.* 1.7.2	3		1
gnu.regexp.RE 1.1.4	0		175
jregex.* 1.2.01	0		5
com.karneim.util.collection.regex.* 1.1.1	3		2
org.apache.regexp.* 1.5	0		100
com.stevenrbrandt.ubiq2.v10.pattwo.*	0		176
kmy.regex.util.* 0.1.2	5		2