Skip to content

feat: add fastCRW search connector#1490

Open
us wants to merge 2 commits into
MODSetter:mainfrom
us:feat/add-fastcrw
Open

feat: add fastCRW search connector#1490
us wants to merge 2 commits into
MODSetter:mainfrom
us:feat/add-fastcrw

Conversation

@us

@us us commented Jun 13, 2026

Copy link
Copy Markdown

What

Adds fastCRW as a web-search connector, alongside the existing ones (SearXNG, Tavily, …).

Why

fastCRW is a fully open-source (AGPL) web-scraping and search engine — a single ~8 MB Rust binary, ~6 MB RAM at idle — that ships the complete stack in its open core, no cloud dependency required.

100% local / open core — stealth, proxies, JS rendering all included

Self-hosted Firecrawl's OSS cannot reach Cloudflare-protected or JS-heavy sites in practice: its real anti-bot / stealth path (fire-engine) lives behind a cloud-only flag, so the self-hosted build falls back to plain fetch. fastCRW ships Cloudflare JS-challenge handling, UA rotation, SPA rendering, and BYO-proxy + rotation unconditionally — no flags, no cloud account, no asterisks. The combination of SurfSense + fastCRW gives users an actually-complete, fully self-hosted stack.

Faster + higher quality (measured on Firecrawl's own benchmark dataset)

On Firecrawl's public benchmark dataset, fastCRW scores 63.74% truth-recall vs 56.04% for Firecrawl, with faster median latency (p50 ~1.9 s vs ~2.3 s). Single binary deployment means no Redis, no workers, no Playwright sidecar.

Search — built on SearXNG, with a quality layer on top

crw is not an alternative to SearXNG — it is built on top of it. SearXNG is the metasearch aggregator underneath; crw adds a quality layer: query expansion (multi-variant rewrite), content-aware reranking (re-scoring by fetched content rather than SearXNG's content-blind ordering), a calibrated direct-answer mode, and category routing (research queries fan out to arxiv / Semantic Scholar / Google Scholar, code queries to GitHub). The /v1/search endpoint also uses multi-round retrieval for deeper research flows. The result is SearXNG's breadth plus a measurable accuracy layer — all open-source (AGPL) and fully self-hostable with configurable engines.

Firecrawl-API compatibility — why the diff is tiny

fastCRW implements the Firecrawl REST API, which is why this connector is a small additive diff that mirrors the existing pattern exactly.

Changes (additive only)

  • app/services/connector_service.py: crw search provider mirroring the existing connector.
  • Alembic migration 160_add_crw_api_enum + connector_searchable_types.py, db.py, validators.py: registers the CRW_API connector enum.
  • Wired into the research/chat web_search tools.
  • tests/unit/services/test_connector_service_crw.py.

CRW_API_KEY from https://fastcrw.com/dashboard (free tier); self-host base URL supported. Happy to adjust — I maintain it and can provide free credits.

High-level PR Summary

This PR adds fastCRW as a new web-search connector option for SurfSense. The implementation follows the existing pattern used by other search providers (Tavily, Linkup, Baidu) by registering a new CRW_API connector type via database migration, implementing the search_crw method in the connector service that calls fastCRW's /v1/search endpoint, and wiring the new connector into the research and chat agent web search tools. The connector supports both managed cloud and self-hosted deployments with optional API key authentication, and includes comprehensive unit tests covering success cases, error handling, and configuration variants.

⏱️ Estimated Review Time: 15-30 minutes

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/app/db.py
2 surfsense_backend/alembic/versions/160_add_crw_api_enum.py
3 surfsense_backend/app/utils/validators.py
4 surfsense_backend/app/services/connector_service.py
5 surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/runtime/connector_searchable_types.py
6 surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/deliverables/tools/knowledge_base.py
7 surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
8 surfsense_backend/app/agents/chat/shared/tools/web_search.py
9 surfsense_backend/app/services/ai_file_sort_service.py
10 surfsense_backend/tests/unit/services/test_connector_service_crw.py

Need help? Join our Discord

Summary by CodeRabbit

  • New Features
    • Added fastCRW (CRW_API) as a live search connector with configurable endpoint and optional API key; integrated into search UI/flows and file-labeling.
  • Chores
    • Database migration to register the new connector type.
  • Behavior Changes
    • Web search now returns raw engine/chunk results (no URL-based deduplication); knowledge-base formatting treats fastCRW results as live-search items.
  • Tests
    • Expanded tests cover fastCRW mapping, error cases, and result formatting.

@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

@us is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds fastCRW (CRW_API) as a new live-search connector across the backend: a migration and enum update, validator entries, a new async search_crw connector method, agent/tool wiring (web search, research subagent, knowledge-base, labels), removal of URL deduplication in aggregated web results, and unit tests covering many edge cases.

Changes

CRW API Connector Feature

Layer / File(s) Summary
Schema and Validation Foundation
surfsense_backend/alembic/versions/160_add_crw_api_enum.py, surfsense_backend/app/db.py, surfsense_backend/app/utils/validators.py
PostgreSQL enum migration adds CRW_API; SearchSourceConnectorType gains CRW_API; validator rules allow CRW_API_KEY and CRW_BASE_URL with URL validation.
Core Connector Service Implementation
surfsense_backend/app/services/connector_service.py
New async search_crw reads connector config, builds /v1/search endpoint (optional Bearer auth), POSTs {query, limit} with 90s timeout, handles HTTP/JSON/envelope/data-shape failures, and maps successful items into sources and documents (markdown preferred).
Agent and Tool Integration
surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/runtime/connector_searchable_types.py, surfsense_backend/app/agents/chat/shared/tools/web_search.py, surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py, surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/deliverables/tools/knowledge_base.py, surfsense_backend/app/services/ai_file_sort_service.py
Maps CRW_API into live-search connector sets and dispatcher specs (search_crw), labels it fastCRW, excludes it from KB selection (URL-based chunk IDs for CRW), and removes URL deduplication so all returned chunks are formatted.
Unit Tests
surfsense_backend/tests/unit/services/test_connector_service_crw.py
Test module with helpers validates search_crw across scenarios: missing connector, successful mapping and markdown preference, self-hosted base URL handling, API error envelopes, HTTP exceptions, malformed envelopes, non-list data, skipping invalid items, and preserving duplicate-URL entries.

Sequence Diagram(s)

sequenceDiagram
  participant Agent
  participant ConnectorService
  participant DB
  participant HTTPClient
  participant fastCRW_API

  Agent->>ConnectorService: search_crw(query, search_space_id, top_k)
  ConnectorService->>DB: get_connector_by_type('CRW_API')
  ConnectorService->>HTTPClient: POST /v1/search {"query", "limit"} (optional Bearer)
  HTTPClient->>fastCRW_API: request
  fastCRW_API-->>HTTPClient: response JSON {success, data: [...]}
  HTTPClient-->>ConnectorService: response
  ConnectorService->>ConnectorService: validate envelope, map items -> sources/documents
  ConnectorService-->>Agent: (result_object, documents)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • MODSetter

Poem

🐰 A tiny rabbit hops and sings,
fastCRW brings live-search wings,
Enum, service, tools aligned,
Tests ensure no edge's unkind,
Hooray — search results now take wing!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding fastCRW as a new web-search connector to the SurfSense backend.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
surfsense_backend/tests/unit/services/test_connector_service_crw.py (1)

89-127: ⚡ Quick win

Add a duplicate-result preservation assertion to lock citation behavior.

Please extend this test (or add a sibling test) with two CRW results that share the same URL/title and assert both entries are preserved in result_object["sources"] and documents with distinct IDs/chunk IDs. This guards against accidental source dedup regressions in the connector mapping path.

As per coding guidelines, "Do not deduplicate sources when processing search results; preserve every chunk's unique source entry to maintain accurate citation tracking."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@surfsense_backend/tests/unit/services/test_connector_service_crw.py` around
lines 89 - 127, Update the test that calls svc.search_crw to include two CRW
result entries that share the same "url" and "title" (e.g., duplicate URL/title
pair) in the patched JSON response (the json_data passed to _patch_post), then
assert that both entries are preserved: check result_object["sources"] contains
two separate entries for that URL/title and documents contains two entries as
well, and verify their IDs/chunk identifiers differ (e.g., compare
documents[0]["document"]["id"] != documents[1]["document"]["id"] or
documents[0]["chunk_id"] != documents[1]["chunk_id"]); keep the existing
assertions about content/markdown and metadata and the captured request checks
intact so the test still verifies endpoint/headers/json payload.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@surfsense_backend/app/agents/chat/shared/tools/web_search.py`:
- Around line 25-33: The CRW connector enablement exposes a URL-dedup bug:
update the dedup behavior so merged search outputs preserve every chunk/source
entry (no URL-level collapse). In
surfsense_backend/app/agents/chat/shared/tools/web_search.py (lines 25-33) where
_LIVE_CONNECTOR_SPECS includes "CRW_API", modify _web_search_impl to stop
collapsing results by URL and instead return/passthrough merged chunk entries
unchanged (preserve per-chunk source metadata and citation fields); ensure any
URL-keyed aggregation logic is removed or bypassed. In
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
(lines 19-27) apply the same change: remove URL-level dedupe/aggregation and
keep each chunk/source entry intact so citation granularity is preserved (no
other changes required at that site).

In `@surfsense_backend/app/services/connector_service.py`:
- Around line 951-983: The code assumes `data` and each `result` in
`crw_results` are dicts before calling `.get(...)`; add explicit type guards so
unexpected shapes return the safe empty envelope instead of raising.
Specifically, in the branch that checks `data.get("success")` and when
extracting `crw_results = data.get("data", [])`, first verify `isinstance(data,
dict)` and if not, log/warn and return the empty fastCRW envelope (the same
object returned currently). Then ensure `crw_results` is a list (else return the
empty envelope), and inside the async with self.counter_lock loop verify
`isinstance(result, dict)` before using `result.get(...)`—if an item is not a
dict skip it. These checks should be applied around the existing variables
`data`, `crw_results`, and the loop that builds `sources_list`/`documents` (no
other changes needed).

In `@surfsense_backend/app/utils/validators.py`:
- Around line 516-524: Add an optional URL validator for the CRW_BASE_URL config
so malformed URLs fail at validation time: inside the CRW_API entry in
validators.py (the dict shown with key "CRW_API"), add "CRW_BASE_URL" to the
"validators" mapping and point it to the module's existing URL validation
function (e.g., the url validator utility in this file, such as validate_url or
UrlValidator), ensuring it runs only when CRW_BASE_URL is provided (since it's
listed under "optional"). This will enforce proper URL format for CRW_BASE_URL
during config validation.

---

Nitpick comments:
In `@surfsense_backend/tests/unit/services/test_connector_service_crw.py`:
- Around line 89-127: Update the test that calls svc.search_crw to include two
CRW result entries that share the same "url" and "title" (e.g., duplicate
URL/title pair) in the patched JSON response (the json_data passed to
_patch_post), then assert that both entries are preserved: check
result_object["sources"] contains two separate entries for that URL/title and
documents contains two entries as well, and verify their IDs/chunk identifiers
differ (e.g., compare documents[0]["document"]["id"] !=
documents[1]["document"]["id"] or documents[0]["chunk_id"] !=
documents[1]["chunk_id"]); keep the existing assertions about content/markdown
and metadata and the captured request checks intact so the test still verifies
endpoint/headers/json payload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7a7b0856-496c-4437-8b79-993fcadd6488

📥 Commits

Reviewing files that changed from the base of the PR and between 3e53931 and 5375bf4.

📒 Files selected for processing (10)
  • surfsense_backend/alembic/versions/160_add_crw_api_enum.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/main_agent/runtime/connector_searchable_types.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/deliverables/tools/knowledge_base.py
  • surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
  • surfsense_backend/app/agents/chat/shared/tools/web_search.py
  • surfsense_backend/app/db.py
  • surfsense_backend/app/services/ai_file_sort_service.py
  • surfsense_backend/app/services/connector_service.py
  • surfsense_backend/app/utils/validators.py
  • surfsense_backend/tests/unit/services/test_connector_service_crw.py

Comment thread surfsense_backend/app/agents/chat/shared/tools/web_search.py
Comment thread surfsense_backend/app/services/connector_service.py
Comment thread surfsense_backend/app/utils/validators.py
- web_search: stop URL-deduping merged results so each chunk keeps its own source entry for accurate citation tracking (shared + research tools)
- connector_service: guard fastCRW envelope/item types before .get() so unexpected JSON shapes degrade to empty results instead of raising
- validators: validate optional CRW_BASE_URL format at config time
- tests: cover non-dict envelope/data, non-dict item skipping, and duplicate-URL source preservation

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@surfsense_backend/app/agents/chat/shared/tools/web_search.py`:
- Around line 222-225: Both sites must keep the no-dedupe merge but change how
formatter generates the <chunk id> to avoid collisions: in
surfsense_backend/app/agents/chat/shared/tools/web_search.py (lines 222-225)
update the formatter (around original Line 88) to use doc["chunk_id"] as the
primary id and, if missing or not unique, append a short deterministic fallback
suffix (e.g., an incremental counter or hash of the chunk content/URL) to ensure
uniqueness; in
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
(lines 216-219) make the same change to its formatter (around original Line 82)
so both formatters emit doc["chunk_id"] with the same fallback-unique suffix
strategy when URLs repeat. Ensure no other deduplication logic is reintroduced.

In `@surfsense_backend/app/utils/validators.py`:
- Around line 523-529: The validator for "CRW_BASE_URL" currently only checks
truthiness and may call validate_url_field on non-string truthy values; change
the lambda to first guard that config.get("CRW_BASE_URL") is a string (and
non-empty) before invoking validate_url_field so it never calls .strip() on
non-strings—i.e., in the validators dict update the "CRW_BASE_URL" entry to only
call validate_url_field("CRW_BASE_URL", "fastCRW") when
isinstance(config.get("CRW_BASE_URL"), str) and the string is non-empty,
otherwise return None.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: afd7fc83-b17c-4e2c-bbc1-0a331122a306

📥 Commits

Reviewing files that changed from the base of the PR and between 5375bf4 and 8e0a510.

📒 Files selected for processing (5)
  • surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
  • surfsense_backend/app/agents/chat/shared/tools/web_search.py
  • surfsense_backend/app/services/connector_service.py
  • surfsense_backend/app/utils/validators.py
  • surfsense_backend/tests/unit/services/test_connector_service_crw.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • surfsense_backend/app/services/connector_service.py

Comment on lines +222 to +225
# Do not deduplicate by URL: each chunk is a distinct source entry that
# citation tracking relies on. Collapsing by URL would drop evidence
# returned by different live-search engines for the same page.
formatted = _format_web_results(all_documents)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

No-dedupe merge now exposes duplicate chunk-id collisions because chunk IDs are URL-derived. Both tools correctly preserve duplicate URL entries, but each formatter still uses URL for <chunk id>, which can produce duplicate IDs and ambiguous citations.

  • surfsense_backend/app/agents/chat/shared/tools/web_search.py#L222-L225: keep no-dedupe behavior, but update formatter chunk id generation (Line 88) to use doc["chunk_id"] with a fallback unique suffix.
  • surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py#L216-L219: apply the same formatter change (Line 82) so chunk IDs remain unique when URLs repeat.
📍 Affects 2 files
  • surfsense_backend/app/agents/chat/shared/tools/web_search.py#L222-L225 (this comment)
  • surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py#L216-L219
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@surfsense_backend/app/agents/chat/shared/tools/web_search.py` around lines
222 - 225, Both sites must keep the no-dedupe merge but change how formatter
generates the <chunk id> to avoid collisions: in
surfsense_backend/app/agents/chat/shared/tools/web_search.py (lines 222-225)
update the formatter (around original Line 88) to use doc["chunk_id"] as the
primary id and, if missing or not unique, append a short deterministic fallback
suffix (e.g., an incremental counter or hash of the chunk content/URL) to ensure
uniqueness; in
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
(lines 216-219) make the same change to its formatter (around original Line 82)
so both formatters emit doc["chunk_id"] with the same fallback-unique suffix
strategy when URLs repeat. Ensure no other deduplication logic is reintroduced.

Comment on lines +523 to +529
"validators": {
"CRW_BASE_URL": lambda: (
validate_url_field("CRW_BASE_URL", "fastCRW")
if config.get("CRW_BASE_URL")
else None
),
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard CRW_BASE_URL type before invoking URL validation.

Line 526 only checks truthiness. If CRW_BASE_URL is a truthy non-string, validate_url_field(...).strip() raises AttributeError instead of returning a clean validation error.

Suggested patch
-            "validators": {
-                "CRW_BASE_URL": lambda: (
-                    validate_url_field("CRW_BASE_URL", "fastCRW")
-                    if config.get("CRW_BASE_URL")
-                    else None
-                ),
-            },
+            "validators": {
+                "CRW_BASE_URL": lambda: (
+                    validate_url_field("CRW_BASE_URL", "fastCRW")
+                    if isinstance(config.get("CRW_BASE_URL"), str)
+                    and config.get("CRW_BASE_URL").strip()
+                    else (
+                        (_ for _ in ()).throw(
+                            ValueError(
+                                "Invalid base URL format for fastCRW connector"
+                            )
+                        )
+                        if config.get("CRW_BASE_URL") is not None
+                        else None
+                    )
+                ),
+            },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"validators": {
"CRW_BASE_URL": lambda: (
validate_url_field("CRW_BASE_URL", "fastCRW")
if config.get("CRW_BASE_URL")
else None
),
},
"validators": {
"CRW_BASE_URL": lambda: (
validate_url_field("CRW_BASE_URL", "fastCRW")
if isinstance(config.get("CRW_BASE_URL"), str)
and config.get("CRW_BASE_URL").strip()
else (
(_ for _ in ()).throw(
ValueError(
"Invalid base URL format for fastCRW connector"
)
)
if config.get("CRW_BASE_URL") is not None
else None
)
),
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@surfsense_backend/app/utils/validators.py` around lines 523 - 529, The
validator for "CRW_BASE_URL" currently only checks truthiness and may call
validate_url_field on non-string truthy values; change the lambda to first guard
that config.get("CRW_BASE_URL") is a string (and non-empty) before invoking
validate_url_field so it never calls .strip() on non-strings—i.e., in the
validators dict update the "CRW_BASE_URL" entry to only call
validate_url_field("CRW_BASE_URL", "fastCRW") when
isinstance(config.get("CRW_BASE_URL"), str) and the string is non-empty,
otherwise return None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant