Home |
Speeding Up Slow RegExp-based Tokenizers in ECMAScriptAndy Chu, January 2010 This is a simple proposal to enhance the regular expression API in ECMAScript, based on some problems encountered during real usage. MotivationI ported the
Narcissus
JavaScript parser to engines such as v8 and
JScript (IE), by removing Mozilla
specific extensions (github repo). In
doing so, I found that there is a fundamental limitation of JavaScript's
I also plan to do a port of JSON Pattern to JavaScript. This same limitation of JavaScript complicates it as well. ProblemTo describe the problem succinctly, JavaScript's The Narcissus tokenizer thus uses the following algorithm:
Because the To work around this problem, you can write a tokenizer which avoids regular
expressions altogether and instead uses a loop with successive Another way to avoid the blowup is to split the source string into lines. There's still a lot of wasteful work being done, but the algorithm does not blow up as the source file becomes longer. Alternatively, some small API additions to API additionsI propose two additions to the
Note that setting But, importantly, these existing features make this proposal more like "exposing something that already exists" rather than "adding something new". For comparison, Python has this capability via the
re.match /
re.search
distinction, and the SignaturesUpdate: Please see this addendum for a new API proposal. There are a few ways to add this to ECMAScript. 1. Positional arguments:
Example call:
This is not great, because boolean arguments are hard to read at the call site. In addition, both arguments are optional, so writing:
is awkward when you want to override the default 2. Using an object literal for optional arguments:
This style is becoming popular now, but not consistent with too many native APIs. 3. Using two methods:
Downside: You will also need After some thought, I recommend option 3, for readability and
future-proofing of the API. Python also has an Adding optional, named arguments to ECMAScript, in the style of Python, may produce a nicer API. ConclusionThis is a very simple addition that will increase the power of JavaScript's
regular expression engine, and speed up many real programs, without complicating
its implementation. Since anchored matches are already possible with |