We discovered that we had a couple of requests that took extreme amounts of time and
cpu power on our servers.
After dumping the thread stacks, we found out that the culprit was this regex:
When encountering strings that contains a large number of spaces, for example:
it uses very large amounts of cpu time to try to match.
After some testing, we determined that the execution time scaled in the order of O(2^n).
As seen in this table.
The problem class is called catastrophic backtracking, and many regex engines are
of the backtracking kind, also the standard one in java.
The problematic regex can be simplified to this in order to better highlight the
What happens here is that the regex engine have multiple paths for matching the spaces
it encounters, it can either add it to a the last space in the previous group or it
can start a new group with it. When it encounters the ending A in the string it’s
obvious for the human eye that the regex will never match, but the engine doesn’t know
that without trying all combinations of groups of spaces, so it will backtrack in
the string and try another way, which doesn’t work either, until all combinations
There is a number of ways to fix this problem, the most radical might be to replace
the regex engine with one that guarantees linear execution, for example RE2J.
Or you could avoid nesting quantifiers (the pluses and stars), but this requires
that you can restructure the problem.
Another way would be to rewrite the regex so that there isn’t a choice between
different ways to match the string, for example by changing the first * into a +.