Home

Speeding Up Slow RegExp-based Tokenizers in ECMAScript

Andy Chu, January 2010

This is a simple proposal to enhance the regular expression API in ECMAScript, based on some problems encountered during real usage.

Motivation

I ported the Narcissus JavaScript parser to engines such as v8 and JScript (IE), by removing Mozilla specific extensions (github repo). In doing so, I found that there is a fundamental limitation of JavaScript's RegExp API which causes O(n^2) blowup in regular-expression-based tokenizers, where n is the number of characters in the source string.

I also plan to do a port of JSON Pattern to JavaScript. This same limitation of JavaScript complicates it as well.

Problem

To describe the problem succinctly, JavaScript's RegExp API lacks the ability to match exactly at a given character position in a string.

The Narcissus tokenizer thus uses the following algorithm:

  1. Determine the token at position 0 in the source string by matching successively against a series of regular expressions (e.g. /^[+\-*/ ... ]/ for operators).

  2. If c is the length of that token, then create a new source string s.substring(c) to match against.

  3. Repeat steps 1 and 2 until the entire string is consumed.

Because the substring operation takes linear time in a nearly all JavaScript engines this algorithm blows up quadratically.

To work around this problem, you can write a tokenizer which avoids regular expressions altogether and instead uses a loop with successive .charAt() calls. Each character is tested individually. This avoids quadratic blowup but has a slow constant factor. You still must create O(#chars) objects, rather than scanning a single object.

Another way to avoid the blowup is to split the source string into lines. There's still a lot of wasteful work being done, but the algorithm does not blow up as the source file becomes longer.

Alternatively, some small API additions to RegExp would solve this problem in a simpler and more efficient way.

API additions

I propose two additions to the RegExp API:

  1. The ability to match exactly, rather than searching for match.
  2. The ability to match at a character position

Note that setting RegExp.lastIndex combined with a left-anchoring ^ does not do this. This is correct, as ^ should only match character position 0 in the string.

But, importantly, these existing features make this proposal more like "exposing something that already exists" rather than "adding something new".

For comparison, Python has this capability via the re.match / re.search distinction, and the pos and argument to both.

Signatures

Update: Please see this addendum for a new API proposal.

There are a few ways to add this to ECMAScript.

1. Positional arguments:

RegExp.exec(str, pos, leftAnchored)

Example call:

operatorRegex.exec("var a = 5+3;", 10, true);

This is not great, because boolean arguments are hard to read at the call site. In addition, both arguments are optional, so writing:

operatorRegex.exec("var a = 5+3;", 0, true);

is awkward when you want to override the default true but not the default 0.

2. Using an object literal for optional arguments:

operatorRegex.exec("var a = 5+3;", {pos: 10, leftAnchored: true});

This style is becoming popular now, but not consistent with too many native APIs.

3. Using two methods:

operatorRegex.exec("var a = 5+3;", 10);
operatorRegex.execLeft("var a = 5+3;", 10);

Downside: You will also need .test() and .testLeft(). For simplicity, I propose that only the RegExp methods and not the String methods be changed.

After some thought, I recommend option 3, for readability and future-proofing of the API. Python also has an endpos parameter to both .match() and .search(). With option 1, adding that argument becomes impossible in the future.

Adding optional, named arguments to ECMAScript, in the style of Python, may produce a nicer API.

Conclusion

This is a very simple addition that will increase the power of JavaScript's regular expression engine, and speed up many real programs, without complicating its implementation. Since anchored matches are already possible with ^, and matching within a string is possible with RegExp.lastIndex, this should be a small change for all implementations.