feat: add fastCRW search connector#1490
Conversation
|
@us is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel. A member of the Team first needs to authorize it. |
📝 WalkthroughWalkthroughThis PR adds fastCRW (CRW_API) as a new live-search connector across the backend: a migration and enum update, validator entries, a new async ChangesCRW API Connector Feature
Sequence Diagram(s)sequenceDiagram
participant Agent
participant ConnectorService
participant DB
participant HTTPClient
participant fastCRW_API
Agent->>ConnectorService: search_crw(query, search_space_id, top_k)
ConnectorService->>DB: get_connector_by_type('CRW_API')
ConnectorService->>HTTPClient: POST /v1/search {"query", "limit"} (optional Bearer)
HTTPClient->>fastCRW_API: request
fastCRW_API-->>HTTPClient: response JSON {success, data: [...]}
HTTPClient-->>ConnectorService: response
ConnectorService->>ConnectorService: validate envelope, map items -> sources/documents
ConnectorService-->>Agent: (result_object, documents)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
surfsense_backend/tests/unit/services/test_connector_service_crw.py (1)
89-127: ⚡ Quick winAdd a duplicate-result preservation assertion to lock citation behavior.
Please extend this test (or add a sibling test) with two CRW results that share the same URL/title and assert both entries are preserved in
result_object["sources"]anddocumentswith distinct IDs/chunk IDs. This guards against accidental source dedup regressions in the connector mapping path.As per coding guidelines, "Do not deduplicate sources when processing search results; preserve every chunk's unique source entry to maintain accurate citation tracking."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@surfsense_backend/tests/unit/services/test_connector_service_crw.py` around lines 89 - 127, Update the test that calls svc.search_crw to include two CRW result entries that share the same "url" and "title" (e.g., duplicate URL/title pair) in the patched JSON response (the json_data passed to _patch_post), then assert that both entries are preserved: check result_object["sources"] contains two separate entries for that URL/title and documents contains two entries as well, and verify their IDs/chunk identifiers differ (e.g., compare documents[0]["document"]["id"] != documents[1]["document"]["id"] or documents[0]["chunk_id"] != documents[1]["chunk_id"]); keep the existing assertions about content/markdown and metadata and the captured request checks intact so the test still verifies endpoint/headers/json payload.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@surfsense_backend/app/agents/chat/shared/tools/web_search.py`:
- Around line 25-33: The CRW connector enablement exposes a URL-dedup bug:
update the dedup behavior so merged search outputs preserve every chunk/source
entry (no URL-level collapse). In
surfsense_backend/app/agents/chat/shared/tools/web_search.py (lines 25-33) where
_LIVE_CONNECTOR_SPECS includes "CRW_API", modify _web_search_impl to stop
collapsing results by URL and instead return/passthrough merged chunk entries
unchanged (preserve per-chunk source metadata and citation fields); ensure any
URL-keyed aggregation logic is removed or bypassed. In
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
(lines 19-27) apply the same change: remove URL-level dedupe/aggregation and
keep each chunk/source entry intact so citation granularity is preserved (no
other changes required at that site).
In `@surfsense_backend/app/services/connector_service.py`:
- Around line 951-983: The code assumes `data` and each `result` in
`crw_results` are dicts before calling `.get(...)`; add explicit type guards so
unexpected shapes return the safe empty envelope instead of raising.
Specifically, in the branch that checks `data.get("success")` and when
extracting `crw_results = data.get("data", [])`, first verify `isinstance(data,
dict)` and if not, log/warn and return the empty fastCRW envelope (the same
object returned currently). Then ensure `crw_results` is a list (else return the
empty envelope), and inside the async with self.counter_lock loop verify
`isinstance(result, dict)` before using `result.get(...)`—if an item is not a
dict skip it. These checks should be applied around the existing variables
`data`, `crw_results`, and the loop that builds `sources_list`/`documents` (no
other changes needed).
In `@surfsense_backend/app/utils/validators.py`:
- Around line 516-524: Add an optional URL validator for the CRW_BASE_URL config
so malformed URLs fail at validation time: inside the CRW_API entry in
validators.py (the dict shown with key "CRW_API"), add "CRW_BASE_URL" to the
"validators" mapping and point it to the module's existing URL validation
function (e.g., the url validator utility in this file, such as validate_url or
UrlValidator), ensuring it runs only when CRW_BASE_URL is provided (since it's
listed under "optional"). This will enforce proper URL format for CRW_BASE_URL
during config validation.
---
Nitpick comments:
In `@surfsense_backend/tests/unit/services/test_connector_service_crw.py`:
- Around line 89-127: Update the test that calls svc.search_crw to include two
CRW result entries that share the same "url" and "title" (e.g., duplicate
URL/title pair) in the patched JSON response (the json_data passed to
_patch_post), then assert that both entries are preserved: check
result_object["sources"] contains two separate entries for that URL/title and
documents contains two entries as well, and verify their IDs/chunk identifiers
differ (e.g., compare documents[0]["document"]["id"] !=
documents[1]["document"]["id"] or documents[0]["chunk_id"] !=
documents[1]["chunk_id"]); keep the existing assertions about content/markdown
and metadata and the captured request checks intact so the test still verifies
endpoint/headers/json payload.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7a7b0856-496c-4437-8b79-993fcadd6488
📒 Files selected for processing (10)
surfsense_backend/alembic/versions/160_add_crw_api_enum.pysurfsense_backend/app/agents/chat/multi_agent_chat/main_agent/runtime/connector_searchable_types.pysurfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/deliverables/tools/knowledge_base.pysurfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.pysurfsense_backend/app/agents/chat/shared/tools/web_search.pysurfsense_backend/app/db.pysurfsense_backend/app/services/ai_file_sort_service.pysurfsense_backend/app/services/connector_service.pysurfsense_backend/app/utils/validators.pysurfsense_backend/tests/unit/services/test_connector_service_crw.py
- web_search: stop URL-deduping merged results so each chunk keeps its own source entry for accurate citation tracking (shared + research tools) - connector_service: guard fastCRW envelope/item types before .get() so unexpected JSON shapes degrade to empty results instead of raising - validators: validate optional CRW_BASE_URL format at config time - tests: cover non-dict envelope/data, non-dict item skipping, and duplicate-URL source preservation
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@surfsense_backend/app/agents/chat/shared/tools/web_search.py`:
- Around line 222-225: Both sites must keep the no-dedupe merge but change how
formatter generates the <chunk id> to avoid collisions: in
surfsense_backend/app/agents/chat/shared/tools/web_search.py (lines 222-225)
update the formatter (around original Line 88) to use doc["chunk_id"] as the
primary id and, if missing or not unique, append a short deterministic fallback
suffix (e.g., an incremental counter or hash of the chunk content/URL) to ensure
uniqueness; in
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
(lines 216-219) make the same change to its formatter (around original Line 82)
so both formatters emit doc["chunk_id"] with the same fallback-unique suffix
strategy when URLs repeat. Ensure no other deduplication logic is reintroduced.
In `@surfsense_backend/app/utils/validators.py`:
- Around line 523-529: The validator for "CRW_BASE_URL" currently only checks
truthiness and may call validate_url_field on non-string truthy values; change
the lambda to first guard that config.get("CRW_BASE_URL") is a string (and
non-empty) before invoking validate_url_field so it never calls .strip() on
non-strings—i.e., in the validators dict update the "CRW_BASE_URL" entry to only
call validate_url_field("CRW_BASE_URL", "fastCRW") when
isinstance(config.get("CRW_BASE_URL"), str) and the string is non-empty,
otherwise return None.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: afd7fc83-b17c-4e2c-bbc1-0a331122a306
📒 Files selected for processing (5)
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.pysurfsense_backend/app/agents/chat/shared/tools/web_search.pysurfsense_backend/app/services/connector_service.pysurfsense_backend/app/utils/validators.pysurfsense_backend/tests/unit/services/test_connector_service_crw.py
🚧 Files skipped from review as they are similar to previous changes (1)
- surfsense_backend/app/services/connector_service.py
| # Do not deduplicate by URL: each chunk is a distinct source entry that | ||
| # citation tracking relies on. Collapsing by URL would drop evidence | ||
| # returned by different live-search engines for the same page. | ||
| formatted = _format_web_results(all_documents) |
There was a problem hiding this comment.
No-dedupe merge now exposes duplicate chunk-id collisions because chunk IDs are URL-derived. Both tools correctly preserve duplicate URL entries, but each formatter still uses URL for <chunk id>, which can produce duplicate IDs and ambiguous citations.
surfsense_backend/app/agents/chat/shared/tools/web_search.py#L222-L225: keep no-dedupe behavior, but update formatter chunk id generation (Line 88) to usedoc["chunk_id"]with a fallback unique suffix.surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py#L216-L219: apply the same formatter change (Line 82) so chunk IDs remain unique when URLs repeat.
📍 Affects 2 files
surfsense_backend/app/agents/chat/shared/tools/web_search.py#L222-L225(this comment)surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py#L216-L219
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@surfsense_backend/app/agents/chat/shared/tools/web_search.py` around lines
222 - 225, Both sites must keep the no-dedupe merge but change how formatter
generates the <chunk id> to avoid collisions: in
surfsense_backend/app/agents/chat/shared/tools/web_search.py (lines 222-225)
update the formatter (around original Line 88) to use doc["chunk_id"] as the
primary id and, if missing or not unique, append a short deterministic fallback
suffix (e.g., an incremental counter or hash of the chunk content/URL) to ensure
uniqueness; in
surfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.py
(lines 216-219) make the same change to its formatter (around original Line 82)
so both formatters emit doc["chunk_id"] with the same fallback-unique suffix
strategy when URLs repeat. Ensure no other deduplication logic is reintroduced.
| "validators": { | ||
| "CRW_BASE_URL": lambda: ( | ||
| validate_url_field("CRW_BASE_URL", "fastCRW") | ||
| if config.get("CRW_BASE_URL") | ||
| else None | ||
| ), | ||
| }, |
There was a problem hiding this comment.
Guard CRW_BASE_URL type before invoking URL validation.
Line 526 only checks truthiness. If CRW_BASE_URL is a truthy non-string, validate_url_field(...).strip() raises AttributeError instead of returning a clean validation error.
Suggested patch
- "validators": {
- "CRW_BASE_URL": lambda: (
- validate_url_field("CRW_BASE_URL", "fastCRW")
- if config.get("CRW_BASE_URL")
- else None
- ),
- },
+ "validators": {
+ "CRW_BASE_URL": lambda: (
+ validate_url_field("CRW_BASE_URL", "fastCRW")
+ if isinstance(config.get("CRW_BASE_URL"), str)
+ and config.get("CRW_BASE_URL").strip()
+ else (
+ (_ for _ in ()).throw(
+ ValueError(
+ "Invalid base URL format for fastCRW connector"
+ )
+ )
+ if config.get("CRW_BASE_URL") is not None
+ else None
+ )
+ ),
+ },📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "validators": { | |
| "CRW_BASE_URL": lambda: ( | |
| validate_url_field("CRW_BASE_URL", "fastCRW") | |
| if config.get("CRW_BASE_URL") | |
| else None | |
| ), | |
| }, | |
| "validators": { | |
| "CRW_BASE_URL": lambda: ( | |
| validate_url_field("CRW_BASE_URL", "fastCRW") | |
| if isinstance(config.get("CRW_BASE_URL"), str) | |
| and config.get("CRW_BASE_URL").strip() | |
| else ( | |
| (_ for _ in ()).throw( | |
| ValueError( | |
| "Invalid base URL format for fastCRW connector" | |
| ) | |
| ) | |
| if config.get("CRW_BASE_URL") is not None | |
| else None | |
| ) | |
| ), | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@surfsense_backend/app/utils/validators.py` around lines 523 - 529, The
validator for "CRW_BASE_URL" currently only checks truthiness and may call
validate_url_field on non-string truthy values; change the lambda to first guard
that config.get("CRW_BASE_URL") is a string (and non-empty) before invoking
validate_url_field so it never calls .strip() on non-strings—i.e., in the
validators dict update the "CRW_BASE_URL" entry to only call
validate_url_field("CRW_BASE_URL", "fastCRW") when
isinstance(config.get("CRW_BASE_URL"), str) and the string is non-empty,
otherwise return None.
What
Adds fastCRW as a web-search connector, alongside the existing ones (SearXNG, Tavily, …).
Why
fastCRW is a fully open-source (AGPL) web-scraping and search engine — a single ~8 MB Rust binary, ~6 MB RAM at idle — that ships the complete stack in its open core, no cloud dependency required.
100% local / open core — stealth, proxies, JS rendering all included
Self-hosted Firecrawl's OSS cannot reach Cloudflare-protected or JS-heavy sites in practice: its real anti-bot / stealth path (
fire-engine) lives behind a cloud-only flag, so the self-hosted build falls back to plain fetch. fastCRW ships Cloudflare JS-challenge handling, UA rotation, SPA rendering, and BYO-proxy + rotation unconditionally — no flags, no cloud account, no asterisks. The combination of SurfSense + fastCRW gives users an actually-complete, fully self-hosted stack.Faster + higher quality (measured on Firecrawl's own benchmark dataset)
On Firecrawl's public benchmark dataset, fastCRW scores 63.74% truth-recall vs 56.04% for Firecrawl, with faster median latency (p50 ~1.9 s vs ~2.3 s). Single binary deployment means no Redis, no workers, no Playwright sidecar.
Search — built on SearXNG, with a quality layer on top
crw is not an alternative to SearXNG — it is built on top of it. SearXNG is the metasearch aggregator underneath; crw adds a quality layer: query expansion (multi-variant rewrite), content-aware reranking (re-scoring by fetched content rather than SearXNG's content-blind ordering), a calibrated direct-answer mode, and category routing (research queries fan out to arxiv / Semantic Scholar / Google Scholar, code queries to GitHub). The
/v1/searchendpoint also uses multi-round retrieval for deeper research flows. The result is SearXNG's breadth plus a measurable accuracy layer — all open-source (AGPL) and fully self-hostable with configurable engines.Firecrawl-API compatibility — why the diff is tiny
fastCRW implements the Firecrawl REST API, which is why this connector is a small additive diff that mirrors the existing pattern exactly.
Changes (additive only)
app/services/connector_service.py:crwsearch provider mirroring the existing connector.160_add_crw_api_enum+connector_searchable_types.py,db.py,validators.py: registers theCRW_APIconnector enum.web_searchtools.tests/unit/services/test_connector_service_crw.py.CRW_API_KEYfrom https://fastcrw.com/dashboard (free tier); self-host base URL supported. Happy to adjust — I maintain it and can provide free credits.High-level PR Summary
This PR adds fastCRW as a new web-search connector option for SurfSense. The implementation follows the existing pattern used by other search providers (Tavily, Linkup, Baidu) by registering a new
CRW_APIconnector type via database migration, implementing thesearch_crwmethod in the connector service that calls fastCRW's/v1/searchendpoint, and wiring the new connector into the research and chat agent web search tools. The connector supports both managed cloud and self-hosted deployments with optional API key authentication, and includes comprehensive unit tests covering success cases, error handling, and configuration variants.⏱️ Estimated Review Time: 15-30 minutes
💡 Review Order Suggestion
surfsense_backend/app/db.pysurfsense_backend/alembic/versions/160_add_crw_api_enum.pysurfsense_backend/app/utils/validators.pysurfsense_backend/app/services/connector_service.pysurfsense_backend/app/agents/chat/multi_agent_chat/main_agent/runtime/connector_searchable_types.pysurfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/deliverables/tools/knowledge_base.pysurfsense_backend/app/agents/chat/multi_agent_chat/subagents/builtins/research/tools/web_search.pysurfsense_backend/app/agents/chat/shared/tools/web_search.pysurfsense_backend/app/services/ai_file_sort_service.pysurfsense_backend/tests/unit/services/test_connector_service_crw.pySummary by CodeRabbit