Agent instructions — Site Audit (WebsiteProfiling)

Developer reference for agents and contributors. User-facing overview: README.md. Full doc index: docs/README.md.

What it is: python -m src from repo root (src/__main__.py -> package website_profiling). Config: stored in PostgreSQL (pipeline_config table, key/value/is_unknown/updated_at). A shadow pipeline-config.txt is auto-written to DATA_DIR on every Save/Run. CLI loads DB first (DATABASE_URL), then shadow file; --config overrides with a file. Reference keys: input.txt.example and pipeline-config.example.txt (not auto-loaded).

LLM / AI: Settings live in llm_config table in PostgreSQL. Providers: OpenAI, Google Gemini, Anthropic, Groq, Ollama (web/src/lib/llmConfigSchema.ts). Configure only via web UI AI tab (GET/PUT /api/llm-config, localhost). Never in pipeline-config.txt or --config.

Frontend: web/ (Next.js) -- server reads PostgreSQL via /api/report/*.

Key paths

src/website_profiling/ -- cli.py, config.py, crawl/, db/storage.py, lighthouse/, reporting/, analysis/, llm/, tools/
web/app/ -- routes; web/src/ -- React; pipeline: PipelineRunnerFab, server/pipelineJobs.ts, server/pipelineConfig.ts, server/llmConfig.ts, server/db.ts
alembic/ -- schema migrations

Local dev: ./local-run (Postgres in Docker wp-pg, Next.js on host; default DATABASE_URL: postgres://postgres:dev@127.0.0.1:5432/website_profiling). See scripts/local-run.sh. Local tests: ./local-test runs three Python coverage gates (core 100%, reporting 100%, tools 100%) plus web checks — mirrors CI python and web jobs; Docker CI is separate (see .github/workflows/ci.yml). ./local-test browser for @pytest.mark.browser integration tests — see scripts/local-test.sh. Mocked browser unit tests: tests/test_browser_fetcher_unit.py.

JavaScript crawl (optional): Config keys crawl_render_mode (static | javascript | auto) and crawl_js_* in pipeline config / pipelineConfigSchema.ts. JS/auto crawls can capture browser console errors and uncaught exceptions (crawl_js_capture_console, stored under page_analysis.browser). Auto mode uses static-first fetch, pre-parse SPA heuristics (needs_js_render), then post-parse low-outlink fallback (needs_js_render_after_parse) in crawler.py. Preflight: GET /api/crawl/browser-status (localhost) spawns Python browser_status(); Run audit settings/run validation calls it when render mode is javascript or auto. Browser deps: Playwright from requirements.txt (installed by ./local-run setup and ./local-test). Runtime needs Chromium on PATH or CHROME_PATH (Docker sets CHROME_PATH=/usr/bin/chromium). Integration tests: @pytest.mark.browser — excluded by default in pytest.ini; Docker CI runs tests/test_crawl_fetchers.py and tests/test_crawler_browser_e2e.py -m browser; locally ./local-test browser.

Run / APIs

Run audit (CLI): python -m src — reads config from PostgreSQL (pipeline_config); shadow DATA_DIR/pipeline-config.txt if table empty. CLI override: python -m src --config path
Optional step: crawl | report | plot | lighthouse | keywords | warnings | enrich | google | chat
preserve_crawl_history (default true): append crawls; false truncates crawl tables but restores report_payload, Lighthouse, google_data, keyword_data, keyword_history, keyword_suggest_cache, and crawl_runs
DATABASE_URL env: PostgreSQL connection string (required). DATA_DIR: secrets + shadow config (Docker: /data).
Pipeline storage (crawl, edges, nodes, report payload, Lighthouse, keywords, warnings) lives in PostgreSQL only. Deliverables use the Export view, GET /api/report/export, or MCP export_* tools — not files written by the main pipeline step.
Pool tuning: DB_POOL_MIN / DB_POOL_MAX (Python), PGPOOL_MAX (Node). Bulk crawl writes via executemany; optional crawl_stream_to_db streams rows during fetch. Per-URL raw HTML: crawl_page_html table (migration 015); API GET/POST /api/crawl/page-html (localhost).
web/ APIs: /api/report/* read routes (payload, meta, history — not localhost-guarded; protect with AUTH_* when exposed); /api/run spawns Python (localhost); /api/jobs, /api/jobs/[id], /api/jobs/[id]/cancel (localhost); /api/crawl/browser-status, /api/crawl/page-html (localhost); /api/pipeline-config GET/PUT; /api/llm-config GET/PUT; /api/chat POST (SSE); /api/chat/sessions GET/POST; /api/ollama/status (localhost); /api/properties/{id}/google/links/import POST; PipelineRunnerFab saves pipeline + LLM state before each run. Full route list: web/app/api/**/route.ts.
MCP: python -m website_profiling.mcp (stdio) or python -m website_profiling.mcp.http (remote Streamable HTTP). Configure at /mcp in the web UI. See docs/MCP.md.
AI Chat UI: /chat — property-scoped chat with saved sessions (chat_sessions, chat_messages; migration 012_chat_sessions).
Job store: PostgreSQL pipeline_jobs when DATABASE_URL is set (pipelineJobsDb.ts — status, timestamps, truncated logs). In-memory map in pipelineJobs.ts holds live log tail and child process handles; stale rows reconciled via PIPELINE_JOB_STALE_HOURS.
Schema head: 015_crawl_page_html (recent: 013 link_edges/discovery, 014 job log truncation, 015 per-URL HTML storage).
Docker: Dockerfile + docker-compose.yml (postgres + web); docker-compose.prod.yml (production + remote MCP on :8000); docker-compose.pull.yml for pre-built images (WEB_IMAGE); LIGHTHOUSE_CHROME_FLAGS

Where to edit

Task	Where
Crawl	`crawl/crawler.py`, `crawl/fetchers/`
Report	`reporting/builder.py`, `reporting/categories.py`
DB schema	`alembic/versions/`
Local analysis	`analysis/local.py`, `requirements.txt`
AI insights (LLM)	`llm/enrich.py`, `llm/agent.py`, `llm_config.py`, `requirements.txt`
Audit query tools (MCP + chat)	`tools/audit_tools/`, `mcp/server.py`, `mcp/http_server.py`, `commands/chat_cmd.py`
Config / CLI	`config.py` (`load_config`, `load_config_from_db`), `cli.py`, `input.txt.example`
UI pipeline schema	`web/src/lib/pipelineConfigSchema.ts`
UI LLM schema	`web/src/lib/llmConfigSchema.ts`
UI config I/O	`web/src/server/pipelineConfig.ts`, `web/src/server/llmConfig.ts`

Schema changes: add Alembic migration (alembic revision).

Company standards: UI copy in web/src/strings.json (Site Audit, Properties, Run audit). Data provenance on report_meta in report payload. Docs: docs/COMPANY_STANDARDS.md, docs/GLOSSARY.md. Migration 003_company_standards (properties, pipeline_jobs, audit_log). Durable jobs in web/src/server/pipelineJobsDb.ts. Export: GET /api/report/export, src/website_profiling/tools/export_audit.py.

Common footguns (check before finishing web or DB work)

These recur when adding features. Verify explicitly — do not assume tests caught them.

React context — useReport / ReportProvider
- Report views call useReport(). That only works inside ReportAppClient → ReportProvider.
- Do: Render report views via ReportShell (wraps ReportAppClient internally).
- Don't: Import a view directly in app/*/page.tsx without ReportShell.
- Standalone routes under web/app/ (e.g. log-analyzer, indexation) are not auto-wrapped by (reports)/layout.
```
// ✅
import ReportShell from '@/ReportShell';
export default function Page() {
  return <ReportShell slug="log-analyzer" />;
}
```
Python — local imports shadow module imports
- from ..config import get_int anywhere inside a function makes that name local for the entire function. Using it earlier → UnboundLocalError.
- Do: Use the module-level import (see top of reporting/builder.py).
- Don't: Re-import inside a function if the same name is used above that line in the same function.
PostgreSQL rows — never row[0]
- Connections may use psycopg dict_row. row[0] → KeyError: 0 on dict rows; tuple-only unit tests still pass.
- Do: _row_field(row, "id", index=0) from website_profiling.db._common (pattern in property_store.py).
- Don't: fetchone()[0] on INSERT … RETURNING without _row_field.
```
from ._common import _row_field
row = cur.fetchone()
rid = _row_field(row, "id", index=0)
report_id = int(rid) if rid is not None else None
```

Python — local vs CI coverage gates (three jobs, not one)

CI runs three separate pytest coverage jobs (see .github/workflows/ci.yml and scripts/local-test.sh):

Gate	Config	Source	Threshold	Test scope
Core	`.coveragerc`	all packages except `tools/` and `reporting/`	100%	`pytest tests/ -m "not browser"`
Reporting	`.coveragerc.reporting`	`website_profiling.reporting`	100%	`pytest tests/reporting/`
Tools	`.coveragerc.tools`	`website_profiling.tools`	100%	`pytest tests/tools/`

Symptom: ./local-test or core pytest passes at 100%, but CI fails on tools/reporting (e.g. 84% tools).
Causes: (a) only ran core pytest, not reporting/tools gates; (b) added reporting/tools tests outside tests/reporting/ or tests/tools/; (c) changed code under website_profiling/tools/ without tests that hit those lines in the tools gate subset.
Do: Run full ./local-test before push. Put reporting coverage tests in tests/reporting/ and tools coverage tests in tests/tools/ (one module per file, e.g. test_<module>_coverage.py). Keep bash and PowerShell local-test scripts in sync.
Don't: Assume pytest tests/ alone matches CI. Don't maintain long per-file lists in CI — use the directory gates above.

Python — runpy.run_module / __main__ guard tests
- Tests that execute a module as __main__ via runpy.run_module(..., run_name="__main__") emit: RuntimeWarning: '<module>' found in sys.modules after import of package ... when the same module was already imported at the top of the test file (or by another import).
- Do: Before runpy.run_module, remove the target from sys.modules so Python re-executes __main__ cleanly. Name tests test_module_main_guard (see tests/test_schedule_runner.py).
- Don't: Call runpy.run_module on a module already imported in that test file without popping it first.
```
import runpy
import sys

sys.modules.pop("website_profiling.tools.schedule_runner", None)
runpy.run_module(
    "website_profiling.tools.schedule_runner",
    run_name="__main__",
    alter_sys=False,
)
```

Checklist: new report page uses ReportShell · no duplicate local imports in long functions · new fetchone() uses _row_field · ./local-test passes all three coverage gates · new tools coverage test file listed in CI + both local-test scripts · runpy main-guard tests pop sys.modules first

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent instructions — Site Audit (WebsiteProfiling)

FilesExpand file tree

AGENT.md

Latest commit

History

AGENT.md

File metadata and controls

Agent instructions — Site Audit (WebsiteProfiling)