Skip to content

Optimize HTML entity decoding in direct_fetch provider#451

Open
d-oit wants to merge 6 commits into
mainfrom
perf/optimize-entity-decoding-direct-fetch-13342671035324967602
Open

Optimize HTML entity decoding in direct_fetch provider#451
d-oit wants to merge 6 commits into
mainfrom
perf/optimize-entity-decoding-direct-fetch-13342671035324967602

Conversation

@d-oit

@d-oit d-oit commented Jun 19, 2026

Copy link
Copy Markdown
Owner

The decode_entities function in the direct_fetch provider was identified as a performance bottleneck due to 20 sequential .replace() calls. Each call triggered a full string traversal and a potential new string allocation.

I implemented a manual single-pass scanner that iterates through the string once and performs lookahead only when an ampersand (&) or word joiner (\u{2060}) is encountered. I also added a fast-path check using .contains() to return early if no replacement characters are present.

Performance improvements measured with Criterion:

  • Fast-path (no entities): ~3.05µs reduced to ~82ns (~37x faster).
  • Content with entities: ~5.77µs reduced to ~1.52µs (~3.8x faster).
  • Long content: ~30.6µs reduced to ~14.9µs (~2x faster).

All tests passed, including the direct_fetch integration tests that verify LaTeX and code block extraction which rely on this decoding logic.


PR created automatically by Jules for task 13342671035324967602 started by @d-oit

Replaced 20 sequential `.replace()` calls in `decode_entities` with a
manual single-pass scanner and a fast-path check for strings without
entities.

Benchmark results (`decode_entities_bench`):
- Text with entities: ~5.7µs -> ~1.5µs (3.8x speedup)
- Text without entities: ~3.0µs -> ~82ns (36.5x speedup)
- Long text with entities: ~30.5µs -> ~15.0µs (2x speedup)

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@vercel

vercel Bot commented Jun 19, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
do-web-doc-resolover Ready Ready Preview, Comment Jun 19, 2026 4:22pm

@deepsource-io

deepsource-io Bot commented Jun 19, 2026

Copy link
Copy Markdown

DeepSource Code Review

We reviewed changes in 2a53733...76bc49f on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade   Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
JavaScript Jun 19, 2026 11:26a.m. Review ↗
Python Jun 19, 2026 11:26a.m. Review ↗
Rust Jun 19, 2026 11:26a.m. Review ↗
Shell Jun 19, 2026 11:26a.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

@codacy-production

codacy-production Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 22 complexity · 2 duplication

Metric Results
Complexity 22
Duplication 2

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

}

// Proposed optimized implementation
fn decode_entities_optimized(text: &str) -> String {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`fn decode_entities_optimized` has a cyclomatic complexity of 30 with "very-high" risk


A function with high cyclomatic complexity can be hard to understand and
maintain. Cyclomatic complexity is a software metric that measures the number of
independent paths through a function. A higher cyclomatic complexity indicates
that the function has more decision points and is more complex.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the implementation to be truly zero-allocation by using slices. Benchmarks show a further speedup (now ~45x faster than the original code for clean text).

Comment thread cli/benches/decode_entities_bench.rs Outdated

while let Some(ch) = chars.next() {
if ch == '&' {
let mut entity = String::new();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty call to `new()`


The new() function is used to initialise an object with specific data.
If no arguments are passed, the behaviour is identical to default().

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jules address feedback

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored the scanner to use char_indices() and string slicing. This eliminates the String::new() allocation inside the ampersand check, making it a zero-allocation entity detector.


/// Decode basic HTML entities
/// Decode basic HTML entities using a single-pass scanner for performance
fn decode_entities(text: &str) -> String {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`fn decode_entities` has a cyclomatic complexity of 30 with "very-high" risk


A function with high cyclomatic complexity can be hard to understand and
maintain. Cyclomatic complexity is a software metric that measures the number of
independent paths through a function. A higher cyclomatic complexity indicates
that the function has more decision points and is more complex.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've optimized this function by implementing a single-pass manual scanner with a fast-path check. This replaces the previous 20 sequential .replace() calls, which were inefficient due to multiple full-string scans and intermediate allocations. Benchmarks show a ~37x speedup for clean text and a ~3.8x speedup for content with entities.

Comment thread cli/src/providers/direct_fetch.rs Outdated

while let Some(ch) = chars.next() {
if ch == '&' {
let mut entity = String::new();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty call to `new()`


The new() function is used to initialise an object with specific data.
If no arguments are passed, the behaviour is identical to default().

Fixed `clippy::while_let_on_iterator` in `src/providers/direct_fetch.rs`
and `benches/decode_entities_bench.rs` by replacing `while let` with
`for` loops. Removed unused `OnceLock`, `Regex`, and `Captures` imports
from the benchmark file.

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
Applied `cargo fmt` to `cli/benches/decode_entities_bench.rs` to resolve
CI formatting check failure.

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
Finalized the performance optimization of `decode_entities` by ensuring
full compliance with repository linting and formatting standards.

- Applied `cargo fmt` to resolve CI formatting failures.
- Fixed `clippy::while_let_on_iterator` in the benchmark scanner.
- Maintained ~37x speedup for text without entities.

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
Refactored `decode_entities` to use `char_indices()` and string slicing,
eliminating the transient `String` allocation for every ampersand
encountered.

Benchmark results (`decode_entities_bench`):
- Text with entities: ~5.7µs -> ~915ns (6x speedup)
- Text without entities: ~3.0µs -> ~60ns (50x speedup)
- Long text with entities: ~30.5µs -> ~8.6µs (3.5x speedup)

Co-authored-by: d-oit <6849456+d-oit@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant