feat: add recursive web crawler MVP with SPA support and fine-grained limits for add-resource by sponge225 · Pull Request #2424 · volcengine/OpenViking

sponge225 · 2026-06-03T12:16:03Z

Description

本 PR 实现了递归网页爬取 MVP，使 add_resource / add-resource 能够从一个网页入口出发，继续发现并导入其子页面内容。该能力主要面向文档站、知识库站点和单页应用（SPA/CSR）页面，支持通过 Playwright 渲染动态页面，并提供基础的爬取范围控制、URL 去重、重定向去重、网络目标校验和资源组织能力。

本次改动的目标不是做一个完整通用爬虫，而是交付一个安全、可控、可 review 的 MVP：默认行为尽量保守，避免无限爬取、重复抓取、资源树结构混乱或对目标站点造成过高压力。

核心能力概览

支持从入口 URL 递归发现并导入子页面。
支持 SPA/CSR 页面渲染，解决仅靠静态 HTML 无法抓取有效正文的问题。
支持通过 depth、max_pages、include_paths、exclude_paths、allow_external_links、use_playwright 控制爬取行为。
支持 depth=-1 表示不限制递归深度，同时仍受 max_pages、URL 去重和过滤规则约束。
支持重定向后的 final_url 去重，避免入口 URL 或子页面 301/302 后重复导入同一页面。
支持递归抓取阶段的网络目标校验，避免子链接绕过入口 URL 的公共目标校验。
支持子页面与入口页面平行写入同一父目录，避免递归结果嵌套进入口页面目录。
默认将 URL 导入到以域名命名的父目录下，避免多个站点直接平铺到 viking://resources 根目录。
CLI 侧新增统一的 --args 参数，用于向 add_resource 透传 parser-specific / import-specific 参数。

默认资源组织策略

当用户导入 HTTP(S) URL 且没有显式指定 to / parent 时，系统会默认使用 URL 域名作为父目录，并自动创建该目录。

例如：

ov add-resource "https://www.volcengine.com/docs/84313" \
  --args='depth:1,max_pages:30'

默认会写入：

viking://resources/www.volcengine.com/
  84313
  1860732
  ...

这样可以避免多个不同站点或多个 URL 直接混在：

viking://resources/

如果用户显式指定 --parent、--parent-auto-create 或 to，系统会尊重用户指定的位置，不会覆盖为域名目录。

`add-resource` 参数说明

当前 CLI 网页递归爬取参数统一通过 --args 传递：

--args='depth:1,max_pages:30,use_playwright:true'

支持的主要参数如下：

depth (int)：爬取深度。0 表示仅导入当前页面；1 表示导入当前页面及其直接子链接；以此类推。特殊值 -1 表示不限制递归深度，直到达到 max_pages 或无更多可抓取页面。
max_pages (int)：本次递归爬取的全局页面上限，默认 100。该限制控制整个爬取任务最多调度多少个页面请求，防止无限递归或大站点导致资源失控。
include_paths (str)：路径白名单，支持 Glob 模式。例如 /docs/84313/*。设置后，只有路径匹配白名单的页面才会被抓取。
exclude_paths (str)：路径黑名单，支持 Glob 模式。例如 *.pdf 或 */login*。匹配黑名单的页面会被跳过。
allow_external_links (bool)：是否允许抓取外部域名链接。默认 False，即只抓取与入口 URL 同域名的页面，避免爬虫扩散到无关站点。
use_playwright (bool)：是否使用 Playwright 无头浏览器渲染页面。默认开启，用于支持 SPA/CSR 页面；如果目标站点是纯静态 HTML，可设为 False 以提升速度。

CLI `--args` 设计说明

本次 CLI 不再提供 --depth、--max-pages、--include-paths、--exclude-paths、--allow-external-links 等网页爬取专用顶层参数，统一改为：

--args <key:value,...>

设计原因：

add_resource 底层会逐步支持更多 parser-specific / import-specific 参数，如果全部暴露为 CLI 顶层 flag，会导致 CLI 参数膨胀。
--args 允许 CLI 以更通用的方式向后端透传扩展参数，后续新增解析器参数时不需要频繁修改 CLI 顶层接口。
CLI 顶层参数保持简洁，add-resource 的核心参数仍聚焦于资源导入位置、上传、等待和通用处理逻辑。

--args 支持基础类型解析：

true / false 会被解析为 boolean。
null 会被解析为 null。
整数和浮点数会被解析为 number。
普通值会作为 string。
简单 JSON 对象/数组值可以在引号和括号匹配时被解析为 JSON。

--args 的分隔规则：

顶层多个参数用逗号分隔，例如：

--args='depth:-1,max_pages:25,use_playwright:true'

如果某个 value 本身需要包含逗号，需要用双引号包住该 value，例如：

--args='include_paths:"/docs/84313/*,/docs/84314/*",exclude_paths:"*.pdf,*/login*"'

上面的示例中，CLI 会把它解析为：
- include_paths = "/docs/84313/*,/docs/84314/*"
- exclude_paths = "*.pdf,*/login*"
后端收到后，会继续按逗号拆分 include_paths / exclude_paths 中的多个 Glob 模式。

Usage Examples

CLI 示例

# 仅导入当前页面
# 未指定 parent/to 时，会默认写入 viking://resources/<domain>/
ov add-resource "https://www.volcengine.com/docs/84313/1860732"

# 抓取当前页面及其直接子页面
ov add-resource "https://www.volcengine.com/docs/84313" \
  --args='depth:1,max_pages:30,use_playwright:true'

# 不限制递归深度，但最多调度 25 个页面
ov add-resource "https://www.volcengine.com/docs/84313" \
  --args='depth:-1,max_pages:25,include_paths:/docs/84313/*,allow_external_links:false,use_playwright:true'

# 限定多个目录，并排除多个噪音路径
ov add-resource "https://www.volcengine.com/docs/84313" \
  --args='depth:-1,max_pages:25,include_paths:"/docs/84313/*,/docs/84314/*",exclude_paths:"*.pdf,*/login*",allow_external_links:false,use_playwright:true'

# 如果希望自定义导入目录，也可以显式指定 --parent-auto-create
ov add-resource "https://www.volcengine.com/docs/84313" \
  --parent-auto-create "viking://resources/volcengine-docs" \
  --args='depth:1,max_pages:30,use_playwright:true'

# 如果父目录已经存在，也可以使用 --parent
ov add-resource "https://www.volcengine.com/docs/84313" \
  --parent "viking://resources/volcengine-docs" \
  --args='depth:1,max_pages:30,use_playwright:true'

# 允许跨域追踪外部链接，默认不建议开启
# 即使开启跨域，递归抓取阶段仍会执行公共网络目标校验
ov add-resource "https://example.com/docs" \
  --args='depth:2,max_pages:20,allow_external_links:true,use_playwright:true'

MCP / JSON 参数示例

MCP / API 调用可以直接传递结构化参数：

{
  "path": "https://www.volcengine.com/docs/84313",
  "depth": -1,
  "max_pages": 25,
  "include_paths": "/docs/84313/*,/docs/84314/*",
  "exclude_paths": "*.pdf,*/login*",
  "allow_external_links": false,
  "use_playwright": true
}

如果需要自定义父目录，也可以显式传：

{
  "path": "https://www.volcengine.com/docs/84313",
  "parent": "viking://resources/volcengine-docs",
  "depth": 1,
  "max_pages": 30,
  "use_playwright": true
}

资源结构示例

默认不传父目录时：

viking://resources/www.volcengine.com/
  84313
  1860732
  ...

显式指定父目录时：

viking://resources/volcengine-docs/
  84313
  1860732
  ...

不同站点默认会分到不同域名目录下：

viking://resources/
  www.volcengine.com/
  docs.openviking.ai/
  example.com/

核心设计说明

递归调度：使用异步 BFS 队列进行网页遍历，并通过并发控制限制同时执行的抓取任务数量。
动态页面支持：通过 Playwright 获取 JS 渲染后的 HTML，提升对 SPA/CSR 文档站的兼容性。
全局请求上限：通过 _pages_scheduled 记录已经调度的页面数，在发起请求前检查 max_pages，避免高并发场景下超额调度。
无限递归保护：depth=-1 只是不限制层级，不代表无限请求；实际仍受 max_pages、URL 去重和过滤规则约束。
URL 去重：维护全局 visited 集合，对规范化后的 URL 去重，避免自引用、互相引用或重复链接导致循环抓取。
重定向去重：抓取后使用最终跳转 URL 作为 canonical URL，避免入口 URL 或子页面 301/302 到真实页面后，同一页面被重复导入。
网络目标校验：递归抓取阶段会继续透传 request_validator，HTTP 抓取使用 httpx hooks 校验请求，Playwright 抓取通过 route 拦截校验页面加载请求，并在最终 URL 上再次校验，避免子链接绕过公共目标限制。
路径过滤：通过 include_paths、exclude_paths 和 allow_external_links 控制爬取边界，避免导入无关页面。
平行资源写入：递归发现的子 URL 不再写入 root URL 的资源目录内部，而是写入 root URL 的同一父目录下。这样 url1 和 url2 是平行资源，符合“每个 URL 都可能拥有独立文档结构”的使用模型。
默认父目录：当导入 HTTP(S) URL 且未指定 to / parent 时，系统默认使用 URL 域名作为父目录，并自动创建该目录，避免多个站点平铺到 viking://resources 根目录。
CLI 参数透传：CLI 新增 --args 通用参数，将 parser-specific import options 解析为 JSON 字段后合并进 add_resource 请求体，减少未来 CLI 顶层参数膨胀。

MVP 隐藏限制与已知边界

为了保证 MVP 版本安全可控，当前实现中包含一些保守的内置限制和启发式规则。它们可能影响用户实际使用效果，因此在 PR 中显式说明：

单页链接提取上限：CrawlConfig.max_links_per_page 当前默认固定为 50。即使某个页面包含大量合法内链，单个页面最多只会向队列追加前 50 个链接。这样可以避免 sitemap、导航页、页脚聚合链接瞬间填满队列并耗尽 max_pages。本 PR 已确保只有真正进入队列的 URL 才会写入 visited，避免超限链接被错误去重。后续应考虑将该上限暴露为可配置参数，或引入链接优先级/正文区域识别策略。
Playwright 固定等待：当前 Playwright 抓取在 domcontentloaded 后会额外等待约 2000ms，用于给前端框架留出渲染时间。这提升了动态页面抓取成功率，但会显著降低抓取速度。后续可优化为可配置等待时间或基于网络空闲/选择器的智能等待。
SSR 数据优先：如果页面中存在可解析的 SSR/框架注入数据，爬虫会优先使用结构化数据提取内容和子链接。该策略对文档站效果较好，但如果 SSR 数据不完整，可能遗漏 DOM 中额外的 <a> 链接。后续可考虑 SSR 与 DOM 链接合并。
Hash 锚点不会作为独立页面：URL 规范化会忽略 #fragment。例如 /docs/a#section1 和 /docs/a#section2 会被视为同一页面。这符合多数文档页的语义，但如果站点使用 hash routing，可能需要后续增强。
并发度当前未暴露：底层 CrawlConfig.concurrency 默认值为 5，当前没有通过 add-resource 参数暴露。这样默认更安全，但无法让高级用户按机器资源或目标站点能力调优。
空壳 SPA 页面跳过：当未能获得有效渲染内容、页面只包含“需要启用 JavaScript”等提示，且正文过短时，会被启发式判断为空壳页面并跳过，避免把无意义内容写入知识库。
子资源提交仍复用 add_resource 链路：当前 MVP 在抓取到子页面后，仍复用现有 add_resource 流程逐个导入子页面，而不是专门的 batch persist。这样可以最大化复用现有解析、写入、索引能力，但还不是最终架构。后续建议演进为 crawl -> batch persist -> batch index。
子资源写入仍保持有限并发：当前保留子资源提交并发，优先通过“平行资源写入”和默认域名父目录避免子页面落入 root URL 目录内部或平铺到根目录。若后续仍观察到大量资源锁 warning，可进一步将子资源提交串行化，或在底层区分“资源名已存在”和“路径暂时被锁”。
CLI --args 表达能力仍较轻量：当前 --args 支持顶层逗号分隔和双引号包裹含逗号 value，可以满足常见多参数与多路径模式场景；但相比完整 JSON 文件或重复参数形式，复杂嵌套配置的可读性仍有限，后续可继续优化。

Related Issue

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

新增递归网页爬取能力，支持从入口 URL 继续发现并导入子页面。
新增 depth、max_pages、include_paths、exclude_paths、allow_external_links、use_playwright 等网页爬取控制参数。
支持 depth=-1 表示不限制递归深度，并通过 max_pages 作为全局安全上限。
新增 CLI --args 通用参数，用于向 add_resource 透传 parser-specific / import-specific 参数。
移除 CLI 顶层网页爬取专用参数，add-resource 网页爬取选项统一通过 --args 传递。
增强 CLI --args 解析能力，支持 value 中包含逗号的多路径模式，例如 include_paths:"/a/*,/b/*"。
集成 Playwright 抓取动态渲染页面，提升对 SPA/CSR 页面内容的提取能力。
新增 URL 过滤、同域限制、路径白名单/黑名单和全局去重逻辑。
新增重定向后的 final_url 识别，并将其作为 canonical URL 用于子页面去重和写入。
增加递归抓取阶段的 request validator 透传，避免子链接请求绕过公共目标校验。
调整递归子页面写入层级，使子 URL 与 root URL 写入同一父目录，避免子 URL 嵌套进 root URL 的资源目录。
增加 URL 默认父目录策略：未显式指定 to / parent 时，HTTP(S) URL 默认导入到 viking://resources/<domain>/。
优化 Playwright 生命周期管理，避免并发初始化竞态和页面资源泄漏。
优化 max_pages 调度计数，避免高并发下超过用户设置的页面上限。
调整 max_links_per_page 与 visited 的顺序，避免超出单页上限的链接被错误标记为已访问。
在导入子页面时复用已抓取内容写入临时文件，避免对同一子页面再次执行底层下载。
为 CLI --args 增加解析测试，覆盖正常解析、包含逗号的 value、非法格式报错以及旧顶层爬取 flag 被拒绝。

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

已验证场景：

depth 和 max_pages 能限制递归深度与全局页面数量。
depth=-1 可通过 CLI --args='depth:-1,...' 正常传入。
include_paths 能限制只抓取指定路径范围内的页面。
exclude_paths 能排除 PDF、登录页等不需要导入的路径。
allow_external_links=false 时不会跨域抓取外部站点。
根 URL 和子页面发生 301/302 跳转时，最终 URL 能进入去重逻辑，避免重复导入。
递归抓取阶段会继承公共目标校验，避免子链接请求绕过入口 URL 的安全限制。
SPA/CSR 页面在 use_playwright=true 时可以获取渲染后的页面内容。
自引用页面不会在 depth=-1 下无限递归，因为 URL visited 去重会拦截重复 URL。
递归子页面不再嵌套写入 root URL 的资源目录，而是与 root URL 平行写入同一父目录。
未显式传 to / parent 的 HTTP(S) URL 会默认写入 viking://resources/<domain>/。
ov add-resource --help 已展示 --args <key:value,...>，不再展示旧的网页爬取专用 flag。
CLI 旧顶层爬取 flag 已被拒绝，例如 --depth、--max-pages。
CLI --args 单元测试覆盖了 depth、max_pages、allow_external_links、include_paths 等字段解析。
运行 python3 -m compileall openviking/service/resource_service.py 通过。
运行 python3 -m compileall openviking/utils/crawl_filter.py openviking/utils/page_fetcher.py openviking/utils/web_crawler.py openviking/service/resource_service.py 通过。
运行 cargo check -p ov_cli 通过。
运行 cargo test -p ov_cli add_resource 通过。
运行 cargo test -p ov_cli legacy_web_crawl_flags 通过。
重新构建并更新本地 uv 环境 CLI 后，ov add-resource --help 已显示新的 --args 用法。

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

本 PR 当前定位为 Web Crawler MVP。后续建议继续优化以下方向：

将 max_links_per_page、concurrency、Playwright 等待时间等内置参数暴露为高级配置。
优化 CLI --args 表达能力，例如支持 JSON 文件、重复参数或更完整的转义规则，方便传递复杂配置。
引入链接优先级策略，例如优先正文链接、同层文档链接，降低导航栏和页脚链接权重。
合并 SSR 数据和 DOM 链接提取结果，降低结构化数据不完整导致的漏抓概率。
增加更系统的单元测试和集成测试，覆盖过滤规则、重定向去重、depth=-1、循环链接、跨域控制和默认父目录策略。
增加更详细的运行日志或爬取 manifest，帮助用户理解哪些页面被抓取、跳过或过滤。
长期将递归网页导入从“逐个复用 add_resource”演进为 crawl -> batch persist -> batch index 的专用流程，实现稳定 URL-to-URI 映射和批量索引。

# Conflicts: # openviking/server/mcp_endpoint.py

github-actions · 2026-06-03T12:18:22Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add Web Crawler Core Utilities Relevant files: openviking/utils/crawl_filter.py openviking/utils/link_extractor.py openviking/utils/page_fetcher.py openviking/utils/ssr_extractor.py openviking/utils/web_crawler.py Sub-PR theme: Integrate Crawler with Resource Processing Pipeline Relevant files: openviking/parse/parsers/html.py openviking/utils/media_processor.py openviking/utils/resource_processor.py Sub-PR theme: Expose Crawler Parameters via API & CLI Relevant files: openviking/client/local.py openviking/server/mcp_endpoint.py openviking/server/routers/resources.py openviking/service/resource_service.py crates/ov_cli/src/client.rs crates/ov_cli/src/commands/resources.rs crates/ov_cli/src/handlers.rs crates/ov_cli/src/help_ui.rs crates/ov_cli/src/main.rs
⚡ Recommended focus areas for review Race Condition in Concurrent Counter Updates added_count and failed_count are modified concurrently in async tasks without synchronization, leading to potential race conditions and incorrect counts. added_count = 0 failed_count = 0 async def _add_child(page): nonlocal added_count, failed_count async with sem: try: import tempfile import os from openviking.parse.parsers.html import HTMLParser if page.content: # 1. 无论是 SSR 直接来的还是 HTML 退级，如果它有内容，都可以考虑清洗 content_to_save = page.content if page.source == "ssr" and page.content_type == "text/markdown": content_to_save = HTMLParser._clean_inline_images(content_to_save) # 2. 将已经拿到的干净文本写入临时文件，跳过底层无脑 httpx 下载 fd, temp_path = tempfile.mkstemp(suffix=".md" if page.content_type == "text/markdown" else ".html") try: with os.fdopen(fd, "w", encoding="utf-8") as f: f.write(content_to_save) await self.add_resource( path=temp_path, source_name=page.title or page.url, original_source=page.url, ctx=ctx, parent=parent_uri, instruction=instruction, reason=reason, build_index=build_index, summarize=summarize, depth=0, kwargs, ) finally: if os.path.exists(temp_path): os.remove(temp_path) else: # fallback (理论上不会发生) await self.add_resource( path=page.url, ctx=ctx, parent=parent_uri, instruction=instruction, reason=reason, build_index=build_index, summarize=summarize, depth=0, kwargs, ) added_count += 1 if added_count % 10 == 0: logger.info( f"[Crawl] Progress: {added_count}/{len(crawl_result.pages)} added" ) except Exception as e: failed_count += 1 if failed_count <= 5: logger.warning(f"[Crawl] Failed to add {page.url}: {e}") tasks = [_add_child(page) for page in crawl_result.pages if page.status == "success"] if tasks: await asyncio.gather(tasks, return_exceptions=True) Hardcoded Concurrency Overrides CrawlConfig* The concurrency limit is hardcoded to 3, ignoring the CrawlConfig.concurrency setting and adding redundant concurrency control on top of the WebCrawler's existing semaphore. logger.info( f"[Crawl] Adding {len(crawl_result.pages)} child resources " f"(concurrency=3, parent={parent_uri})" ) sem = asyncio.Semaphore(3) Missing use_playwright Parameter in Local Client The local client's add_resource method doesn't expose the use_playwright parameter, which is present in the server's API and other client interfaces, leading to inconsistent API usage. timeout: Optional[float] = None, build_index: bool = True, summarize: bool = False, telemetry: TelemetryRequest = False, watch_interval: float = 0, depth: int = 0, max_pages: int = 100, include_paths: Optional[str] = None, exclude_paths: Optional[str] = None, allow_external_links: bool = False, kwargs, ) -> Dict[str, Any]: """Add resource to OpenViking.""" if to and parent: raise ValueError("Cannot specify both 'to' and 'parent' at the same time.") execution = await run_with_telemetry( operation="resources.add_resource", telemetry=telemetry, fn=lambda: self._service.resources.add_resource( path=path, ctx=self._ctx, to=to, parent=parent, reason=reason, instruction=instruction, wait=wait, timeout=timeout, build_index=build_index, summarize=summarize, watch_interval=watch_interval, depth=depth, max_pages=max_pages, include_paths=include_paths, exclude_paths=exclude_paths, allow_external_links=allow_external_links, kwargs, CrawlConfig Timeout Not Used in Fetch Calls The fetcher's timeout isn't passed from the CrawlConfig, ignoring the configured timeout value. fetch_result: FetchResult = await self._fetcher.fetch(url) Dynamic Attributes on ParseResult (Type Safety) _html_content and _html_final_url are dynamically added to ParseResult, which may not be defined in the dataclass, leading to potential type errors. result._html_content = local_resource.path.read_text( encoding="utf-8", errors="replace" ) result._html_final_url = local_resource.meta.get("final_url", "") except Exception: pass

github-actions · 2026-06-03T12:21:07Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category Suggestion Impact

Possible issue

Use async file read to avoid blocking event loop

Replace synchronous file read with async execution using asyncio.to_thread to avoid
blocking the event loop in this async function.

openviking/utils/media_processor.py [202-207]

+import asyncio
 suffix = str(local_resource.path).lower()
 if suffix.endswith((".html", ".htm")) or not os.path.splitext(suffix)[1]:
-    result._html_content = local_resource.path.read_text(
-        encoding="utf-8", errors="replace"
+    result._html_content = await asyncio.to_thread(
+        local_resource.path.read_text, encoding="utf-8", errors="replace"
     )
     result._html_final_url = local_resource.meta.get("final_url", "")

Suggestion importance[1-10]: 5

__

Why: Replaces synchronous file reading with asyncio.to_thread to avoid blocking the event loop in an async function, improving concurrency and adherence to async best practices.

Low

Use async wrapper for temp file creation/writing

Use asyncio.to_thread for synchronous file operations (tempfile.mkstemp,
os.fdopen.write) in this async function to avoid blocking the event loop.

openviking/service/resource_service.py [472-475]

-fd, temp_path = tempfile.mkstemp(suffix=".md" if page.content_type == "text/markdown" else ".html")
-try:
-    with os.fdopen(fd, "w", encoding="utf-8") as f:
-        f.write(content_to_save)
+import asyncio
+def _create_temp_file(content: str, suffix: str) -> str:
+    fd, temp_path = tempfile.mkstemp(suffix=suffix)
+    try:
+        with os.fdopen(fd, "w", encoding="utf-8") as f:
+            f.write(content)
+        return temp_path
+    except Exception:
+        if os.path.exists(temp_path):
+            os.remove(temp_path)
+        raise
 
+temp_path = await asyncio.to_thread(
+    _create_temp_file, content_to_save, ".md" if page.content_type == "text/markdown" else ".html"
+)
+

Suggestion importance[1-10]: 5

__

Why: Wraps synchronous temp file operations in asyncio.to_thread to prevent blocking the event loop in the async _crawl_and_add_resources method, enhancing async performance.

Low

sponge225 added 2 commits June 3, 2026 14:43

feat: add recursive web crawler MVP

2d77f5d

Merge remote-tracking branch 'origin/main' into feature/web-crawler-mvp

5e8df8e

# Conflicts: # openviking/server/mcp_endpoint.py

github-project-automation Bot added this to OpenViking project Jun 3, 2026

github-project-automation Bot moved this to Backlog in OpenViking project Jun 3, 2026

github-actions Bot added the Review effort 4/5 label Jun 3, 2026

sponge225 added 3 commits June 4, 2026 16:09

feat(cli): route add-resource options through args

9a2c1cb

fix(crawler): harden recursive fetch controls

47a3001

feat(crawler): default URL imports under host parent

b5d44c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add recursive web crawler MVP with SPA support and fine-grained limits for add-resource#2424

feat: add recursive web crawler MVP with SPA support and fine-grained limits for add-resource#2424
sponge225 wants to merge 5 commits into
volcengine:mainfrom
sponge225:feature/web-crawler-mvp

sponge225 commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sponge225 commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

核心能力概览

默认资源组织策略

add-resource 参数说明

CLI --args 设计说明

Usage Examples

CLI 示例

MCP / JSON 参数示例

资源结构示例

核心设计说明

MVP 隐藏限制与已知边界

Related Issue

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

github-actions Bot commented Jun 3, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented Jun 3, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sponge225 commented Jun 3, 2026 •

edited

Loading

`add-resource` 参数说明

CLI `--args` 设计说明