Optimize hot path in tokenizer loop#3955
Conversation
Added caching to the edit() function's getRegex() method to avoid recompiling the same regex patterns repeatedly. Also added caching to dynamic regex functions in the 'other' object that are called during parsing with different parameters (listItemRegex, nextBulletRegex, hrRegex, fencesBeginRegex, headingBeginRegex, htmlBeginRegex, blockquoteBeginRegex). This reduces regex compilation overhead and improves benchmark performance by ~4.9% (304 -> 318.88 ops/sec). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…tion-in-rules-ts Thesis #1: Optimize regex compilation in rules.ts
Optimized the blockTokens() and inlineTokens() methods by caching this.tokenizer and this.tokenizer.rules in local variables at the start of the main tokenization loops. This eliminates repeated property lookups on every iteration. Benchmark results (average of 10 runs): - Baseline: 319.02 ops/sec (3134.60ms) - Optimized: 324.53 ops/sec (3081.40ms) - Improvement: ~1.7% Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Someone is attempting to deploy a commit to the MarkedJS Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Code Review
This pull request implements performance optimizations for the marked parser by caching tokenizer and rules references in Lexer.ts to reduce property lookup overhead in hot loops. It also introduces regex caching in rules.ts for the edit utility and several dynamic regex generators. A review comment identifies an opportunity to improve the efficiency of the regex cache in src/rules.ts by using clamped indentation values as keys to prevent redundant entries.
| return (indent: number) => { | ||
| let regex = cache.get(indent); | ||
| if (!regex) { | ||
| regex = new RegExp(`^ {0,${Math.min(3, indent - 1)}}(?:[*+-]|\\d{1,9}[.)])((?:[ \t][^\\n]*)?(?:\\n|$))`); | ||
| cache.set(indent, regex); | ||
| } | ||
| return regex; | ||
| }; |
There was a problem hiding this comment.
The cache key used here is the raw indent value, but the resulting regex only depends on the clamped value Math.min(3, indent - 1). This leads to redundant entries in the Map for different indentation levels that produce the same regex (e.g., indent 4, 5, 6... all result in the same regex). Using the clamped value as the key would be more efficient. This observation applies to nextBulletRegex, hrRegex, fencesBeginRegex, headingBeginRegex, htmlBeginRegex, and blockquoteBeginRegex.
return (indent: number) => {
const key = Math.min(3, indent - 1);
let regex = cache.get(key);
if (!regex) {
regex = new RegExp("^ {0," + key + "}(?:[*+-]|\\d{1,9}[.)])((?:[ \\t][^\\n]*)?(?:\\n|$))");
cache.set(key, regex);
}
return regex;
};
Summary
this.tokenizerandthis.tokenizer.rulesin local variables within theblockTokens()andinlineTokens()hot loopsPerformance Results
Benchmark results (average of 10 runs):
Test plan
🤖 Generated with Claude Code