Regular Expressions in Java


I was wondering why the gnu.regexp package exists, when Java already includes libraries for it. One thing I can think of is the fact that they’ve been added only in 1.4.

During searching around I found some surprising facts about the built-in regex libraries (the site goes up and and down, so here is the Google Cache link in case it’s down again):

  • regular expressions are not compiled to a finite automaton, the way it’s done in other languages / libraries. This (I feel – I didn’t test it personally) can cause some considerable performance hits.
  • it can break for some extreme regular expressions. The given example is in Ruby (run under JRuby), but I translated to Java and found the same results (stack overflow exceptions):

 public static void main(String[] args) {
  String long_string = "";
  for(int i = 0; i < 76; ++i)
   long_string += "xxxxxxxxxxxxxxxxxx";
  if (Pattern.matches("\A((?:.|\n)*?)?([rn]{1,2}|--)", long_string))
   System.out.println("foo");
  else
   System.out.println("bar");
 }

Some conclusions: java.util.regex is still useful if you take care not to use overcomplicated regex's. There are alternate regular expression engines out there. Specifically I found this article which is a little old (from 2002) and feels a little like architecture astronautics (abstracting away the regex layer? really?), but it does include some benchmarks about the alternatives. Most probably all the packages have evolved since, so you should do your own benchmarking, but this is a good start. There is also some useful discussion at two bugs related to this: Bug 4675952 and Bug 5050507.

Update: I just posted a small test comparing alternative implementations for regex's under java.

,

4 responses to “Regular Expressions in Java”

  1. The way that the Java regex library compiles expressions into an object tree rather (than a state machine in the traditional sense) isn’t such a bad design decision. The finite state machine approach makes sense when you don’t have an object-oriented language (and when you have performance characteristics of machines of the 1970s…). But I would argue that the object tree approach makes sense when the thing that you have at your disposal is an object-oriented language with JIT compilation, as in Java. (Or put another way, if you want to program in Fortran, use Fortran, not Java… 😉

    That said, it *is* disappointing to see some of the practical limits that you’ve highlighted and it would be great to see them worked on. I think I’ve been lucky, because I use the Java regex library quite a lot for various language/feed processing tasks, and I’ve not hit upon these limits.

  2. It’s not entirely clear to me how the “object tree” approach works, however one thing which is clear to me is that with a finite state automaton you need very little resources to keep state, while with this approach you consume large amount of stack space, so for this particular case the FSA method seems better.

    That said, the given regex is “cooked up” and very bad, so it is probably just a cornercase for this particular method, with the FSA having probably other corner cases.

  3. I have to agree that this is problem as it is not always possible to rewrite a regex to avoid alternates.

    For example, if I wish to allow users submit a rich subset of html but wish to prevent event handlers I'm going to have to filter their submissions, however I don't want a stackOverflowError to bring down Tomcat, so I have to catch the stckOverflowError and continue on with the unfiltered html, This is undesirable.

    For example in order to filter the following pattern to disable the event handler

    <img src="http://image.site.com/image.gif&quot; misdirection=">" onload="doEvil();>

    I need a regex similar to:

    <([^>]|(["`']).*?2)*W(onw+)W([^>]|(["`']).*?2)*>

    However this will cause a stack overflow in java.util.regex if there is any length of a string quoted in a tag.

    There is a library that promises not to overflow under any circumstances
    http://javaregex.com by Steven R Brandt, it maintains the same syntax as java.util.regex so replacement is easy. I'm not affiliated to this project in any way but have found it useful 😉

    Or alternatively the http://www.brics.dk/~amoeller/automaton/ library which has more limited regex support but does compile to automata and seems to show it in benchmarks.

Leave a Reply

Your email address will not be published. Required fields are marked *