Skip to content

Test fixes for many plugins#2

Open
pirate wants to merge 13 commits intomainfrom
refactors
Open

Test fixes for many plugins#2
pirate wants to merge 13 commits intomainfrom
refactors

Conversation

@pirate
Copy link
Member

@pirate pirate commented Feb 26, 2026


Open with Devin

Summary by cubic

Stabilizes Chrome-driven plugins by consolidating session helpers and smoothing tab/navigate/screenshot/cookie flows. Test suite now opts into Chrome prerequisites, adds clear hook checks, and fixes flaky cases (including DNS tests using real URLs).

  • Bug Fixes

    • Tests: ensure_chrome_test_prereqs is opt-in; modules explicitly use pytestmark/usefixtures, add small require_chrome_runtime fixtures, and raise FileNotFoundError when hooks are missing. TwoCaptcha tests now target the Recaptcha v2 demo, read API_KEY_2CAPTCHA as an alias, and warn when API keys are absent. DNS tests use a real URL.
    • Chrome: tab/navigate/screenshot/cookies share helpers (readCdpUrl/waitForChromeSession/connectToPage/getCookiesViaCdp), with better waits/cleanup and extension target detection.
    • Optional deps: requests/feedparser/sonic loaded via import_module with clearer errors; favicon imports requests directly. Papers-dl honors PAPERSDL_TIMEOUT, defaults to "fetch", and extracts arXiv IDs from DOIs. Misc trims ripgrep error text and stabilizes wget/sqlite/gallery-dl/ytdlp outputs and parse_rss expectations.
  • Refactors

    • Chrome utils: helpers for session files, opening/closing tabs, extension ID parsing; screenshot waits on navigation state.
    • SingleFile: extension save supports outputPath, snapshot hook uses cwd, and session dir resolution is fixed.
    • Puppeteer install: reuses CHROME_BINARY when present and caches downloads via PUPPETEER_CACHE_DIR.
    • ForumDL: remove wrapper and Pydantic monkeypatch; rely on install hook to pin deps; tests assert binary presence with clearer errors.
    • Discovery/deps: README clarifies install hooks and removes binaries.jsonl; get_plugins_dir uses file; pyproject adds feedparser/requests and test tooling (pytest, pytest-httpserver, ruff, pyright).

Written for commit 2f09cbf. Summary will update on new commits.

cubic-dev-ai[bot]

This comment was marked as resolved.

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 9 additional findings.

Open in Devin Review

cubic-dev-ai[bot]

This comment was marked as resolved.

cubic-dev-ai[bot]

This comment was marked as resolved.

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 23 additional findings in Devin Review.

Open in Devin Review

chromeSessionDir: CHROME_SESSION_DIR,
timeoutMs: 60000,
requireTargetId: false,
puppeteer,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 SingleFile save can target the wrong Chrome tab when target_id marker is missing

The SingleFile extension helper explicitly disables target-id enforcement when attaching to the shared browser session, so it can proceed even if the snapshot’s target_id.txt is absent or stale.

Root Cause

singlefile_extension_save.js now calls connectToPage(..., requireTargetId: false) (abx_plugins/plugins/singlefile/singlefile_extension_save.js:106). In connectToPage, if no target ID is available, it falls back to pages[pages.length - 1] (abx_plugins/plugins/chrome/chrome_utils.js:2008-2011).

That means a snapshot can attach to an arbitrary last-opened tab in the shared crawl browser instead of its own tab. The helper then may navigate that page and trigger SingleFile save on the wrong context.

Actual: missing/stale target marker can still produce a “successful” save from another tab.

Expected: fail fast unless the snapshot’s target id is present and matched.

Impact: incorrect/cross-snapshot capture (data integrity issue) in concurrent/shared Chrome runs.

Suggested change
puppeteer,
requireTargetId: true,
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 26 additional findings in Devin Review.

Open in Devin Review

Comment on lines +123 to 147
if forumdl_python:
# Inline compatibility shim so this hook stays self-contained.
inline_entrypoint = textwrap.dedent(
"""
import sys
try:
from forum_dl.writers.jsonl import JsonlWriter
from pydantic import BaseModel
if hasattr(BaseModel, "model_dump_json"):
def _patched_serialize_entry(self, entry):
return entry.model_dump_json()
JsonlWriter._serialize_entry = _patched_serialize_entry
except Exception:
pass
from forum_dl import main
raise SystemExit(main())
"""
).strip()
cmd = [forumdl_python, '-c', inline_entrypoint, *forumdl_args, '-f', output_format, '-o', str(output_file)]
else:
cmd = [resolved_binary, *forumdl_args, '-f', output_format, '-o', str(output_file)]

if not check_ssl:
cmd.append('--no-check-certificate')

if forumdl_args_extra:
cmd.extend(forumdl_args_extra)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Pydantic v2 compatibility shim dropped when binary shebang is unreadable

When get_binary_shebang(resolved_binary) returns None, the forum-dl Pydantic v2 compatibility patch is no longer applied. The old code fell back to sys.executable and always applied the wrapper shim:

Root Cause

The old code was:

forumdl_python = get_binary_shebang(resolved_binary) or sys.executable
cmd = [forumdl_python, str(wrapper_path), *forumdl_args, ...]

The or sys.executable fallback ensured the Pydantic v2 monkey-patch was always applied (via the wrapper script) even when the shebang couldn't be extracted.

The new code at on_Snapshot__04_forumdl.bg.py:123-149 is:

forumdl_python = get_binary_shebang(resolved_binary)
if forumdl_python:
    # ... inline shim with Pydantic v2 patch
else:
    cmd = [resolved_binary, *forumdl_args, ...]

When forumdl_python is None (e.g., binary lacks read permissions, is a compiled executable, or has no shebang line), the else branch runs the binary directly without the Pydantic v2 compatibility shim. This means forum_dl.writers.jsonl.JsonlWriter._serialize_entry is never patched, and forum-dl 0.3.0 will crash with TypeError when serializing entries with Pydantic v2.

Impact: forum-dl extraction fails silently or with cryptic Pydantic errors when the binary shebang is unreadable, whereas before it always worked via the sys.executable fallback.

(Refers to lines 123-149)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +46 to 47
OUTPUT_DIR = Path.cwd().resolve()
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 singlefile OUTPUT_DIR no longer self-computes from SNAP_DIR, silently writes to wrong directory if cwd is wrong

The singlefile snapshot hook changed from computing its output directory from SNAP_DIR to trusting cwd(). If the caller doesn't set cwd to SNAP_DIR/singlefile/, output files go to the wrong place and the Chrome session directory resolution breaks.

Detailed Explanation

The old code at on_Snapshot__50_singlefile.py was self-contained:

SNAP_DIR = Path(os.environ.get('SNAP_DIR', '.')).resolve()
OUTPUT_DIR = SNAP_DIR / PLUGIN_DIR
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
os.chdir(OUTPUT_DIR)

This always computed the correct output directory from the SNAP_DIR environment variable and changed into it, regardless of the invocation cwd.

The new code is:

OUTPUT_DIR = Path.cwd().resolve()
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

This trusts the caller to set cwd to the plugin output directory. If the caller invokes from a different directory (e.g., SNAP_DIR instead of SNAP_DIR/singlefile/), then:

  1. OUTPUT_FILE = 'singlefile.html' is written to the wrong location
  2. STATICFILE_DIR = '../staticfile' resolves to the wrong sibling directory
  3. CHROME_SESSION_DIR = '../chrome' resolves to the wrong directory, causing Chrome session lookup to fail
  4. The singlefile_extension_save.js subprocess (invoked with cwd=str(OUTPUT_DIR)) will also have the wrong Chrome session path

Impact: If any caller doesn't set cwd to SNAP_DIR/<plugin>/ before invoking this hook, output silently goes to the wrong location and Chrome session connection fails.

Suggested change
OUTPUT_DIR = Path.cwd().resolve()
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
SNAP_DIR = Path(os.environ.get('SNAP_DIR', '.')).resolve()
OUTPUT_DIR = SNAP_DIR / PLUGIN_DIR
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
os.chdir(OUTPUT_DIR)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant