Developer reference for agents and contributors. User-facing overview: README.md. Full doc index: docs/README.md.
What it is: python -m src from repo root (src/__main__.py -> package website_profiling). Config: stored in PostgreSQL (pipeline_config table, key/value/is_unknown/updated_at). A shadow pipeline-config.txt is auto-written to DATA_DIR on every Save/Run. CLI loads DB first (DATABASE_URL), then shadow file; --config overrides with a file. Reference keys: input.txt.example and pipeline-config.example.txt (not auto-loaded).
LLM / AI: Settings live in llm_config table in PostgreSQL. Providers: OpenAI, Google Gemini, Anthropic, Groq, Ollama (web/src/lib/llmConfigSchema.ts). Configure only via web UI AI tab (GET/PUT /api/llm-config, localhost). Never in pipeline-config.txt or --config.
Frontend: web/ (Next.js) -- server reads PostgreSQL via /api/report/*.
Key paths
src/website_profiling/--cli.py,config.py,crawl/,db/storage.py,lighthouse/,reporting/,analysis/,llm/,tools/web/app/-- routes;web/src/-- React; pipeline:PipelineRunnerFab,server/pipelineJobs.ts,server/pipelineConfig.ts,server/llmConfig.ts,server/db.tsalembic/-- schema migrations
Local dev: ./local-run (Postgres in Docker wp-pg, Next.js on host; default DATABASE_URL: postgres://postgres:dev@127.0.0.1:5432/website_profiling). See scripts/local-run.sh. Local tests: ./local-test runs three Python coverage gates (core 100%, reporting 100%, tools 100%) plus web checks — mirrors CI python and web jobs; Docker CI is separate (see .github/workflows/ci.yml). ./local-test browser for @pytest.mark.browser integration tests — see scripts/local-test.sh. Mocked browser unit tests: tests/test_browser_fetcher_unit.py.
JavaScript crawl (optional): Config keys crawl_render_mode (static | javascript | auto) and crawl_js_* in pipeline config / pipelineConfigSchema.ts. JS/auto crawls can capture browser console errors and uncaught exceptions (crawl_js_capture_console, stored under page_analysis.browser). Auto mode uses static-first fetch, pre-parse SPA heuristics (needs_js_render), then post-parse low-outlink fallback (needs_js_render_after_parse) in crawler.py. Preflight: GET /api/crawl/browser-status (localhost) spawns Python browser_status(); Run audit settings/run validation calls it when render mode is javascript or auto. Browser deps: Playwright from requirements.txt (installed by ./local-run setup and ./local-test). Runtime needs Chromium on PATH or CHROME_PATH (Docker sets CHROME_PATH=/usr/bin/chromium). Integration tests: @pytest.mark.browser — excluded by default in pytest.ini; Docker CI runs tests/test_crawl_fetchers.py and tests/test_crawler_browser_e2e.py -m browser; locally ./local-test browser.
Run / APIs
- Run audit (CLI):
python -m src— reads config from PostgreSQL (pipeline_config); shadowDATA_DIR/pipeline-config.txtif table empty. CLI override:python -m src --config path - Optional step:
crawl|report|plot|lighthouse|keywords|warnings|enrich|google|chat preserve_crawl_history(default true): append crawls;falsetruncates crawl tables but restoresreport_payload, Lighthouse,google_data,keyword_data,keyword_history,keyword_suggest_cache, andcrawl_runsDATABASE_URLenv: PostgreSQL connection string (required).DATA_DIR: secrets + shadow config (Docker:/data).- Pipeline storage (crawl, edges, nodes, report payload, Lighthouse, keywords, warnings) lives in PostgreSQL only. Deliverables use the Export view,
GET /api/report/export, or MCPexport_*tools — not files written by the main pipeline step. - Pool tuning:
DB_POOL_MIN/DB_POOL_MAX(Python),PGPOOL_MAX(Node). Bulk crawl writes viaexecutemany; optionalcrawl_stream_to_dbstreams rows during fetch. Per-URL raw HTML:crawl_page_htmltable (migration015); APIGET/POST /api/crawl/page-html(localhost). web/APIs:/api/report/*read routes (payload, meta, history — not localhost-guarded; protect withAUTH_*when exposed);/api/runspawns Python (localhost);/api/jobs,/api/jobs/[id],/api/jobs/[id]/cancel(localhost);/api/crawl/browser-status,/api/crawl/page-html(localhost);/api/pipeline-configGET/PUT;/api/llm-configGET/PUT;/api/chatPOST (SSE);/api/chat/sessionsGET/POST;/api/ollama/status(localhost);/api/properties/{id}/google/links/importPOST;PipelineRunnerFabsaves pipeline + LLM state before each run. Full route list:web/app/api/**/route.ts.- MCP:
python -m website_profiling.mcp(stdio) orpython -m website_profiling.mcp.http(remote Streamable HTTP). Configure at/mcpin the web UI. Seedocs/MCP.md. - AI Chat UI:
/chat— property-scoped chat with saved sessions (chat_sessions,chat_messages; migration012_chat_sessions). - Job store: PostgreSQL
pipeline_jobswhenDATABASE_URLis set (pipelineJobsDb.ts— status, timestamps, truncated logs). In-memory map inpipelineJobs.tsholds live log tail and child process handles; stale rows reconciled viaPIPELINE_JOB_STALE_HOURS. - Schema head:
015_crawl_page_html(recent:013link_edges/discovery,014job log truncation,015per-URL HTML storage). - Docker:
Dockerfile+docker-compose.yml(postgres + web);docker-compose.prod.yml(production + remote MCP on:8000);docker-compose.pull.ymlfor pre-built images (WEB_IMAGE);LIGHTHOUSE_CHROME_FLAGS
Where to edit
| Task | Where |
|---|---|
| Crawl | crawl/crawler.py, crawl/fetchers/ |
| Report | reporting/builder.py, reporting/categories.py |
| DB schema | alembic/versions/ |
| Local analysis | analysis/local.py, requirements.txt |
| AI insights (LLM) | llm/enrich.py, llm/agent.py, llm_config.py, requirements.txt |
| Audit query tools (MCP + chat) | tools/audit_tools/, mcp/server.py, mcp/http_server.py, commands/chat_cmd.py |
| Config / CLI | config.py (load_config, load_config_from_db), cli.py, input.txt.example |
| UI pipeline schema | web/src/lib/pipelineConfigSchema.ts |
| UI LLM schema | web/src/lib/llmConfigSchema.ts |
| UI config I/O | web/src/server/pipelineConfig.ts, web/src/server/llmConfig.ts |
Schema changes: add Alembic migration (alembic revision).
Company standards: UI copy in web/src/strings.json (Site Audit, Properties, Run audit). Data provenance on report_meta in report payload. Docs: docs/COMPANY_STANDARDS.md, docs/GLOSSARY.md. Migration 003_company_standards (properties, pipeline_jobs, audit_log). Durable jobs in web/src/server/pipelineJobsDb.ts. Export: GET /api/report/export, src/website_profiling/tools/export_audit.py.
Common footguns (check before finishing web or DB work)
These recur when adding features. Verify explicitly — do not assume tests caught them.
-
React context —
useReport/ReportProvider- Report views call
useReport(). That only works insideReportAppClient→ReportProvider. - Do: Render report views via
ReportShell(wrapsReportAppClientinternally). - Don't: Import a view directly in
app/*/page.tsxwithoutReportShell. - Standalone routes under
web/app/(e.g.log-analyzer,indexation) are not auto-wrapped by(reports)/layout.
// ✅ import ReportShell from '@/ReportShell'; export default function Page() { return <ReportShell slug="log-analyzer" />; }
- Report views call
-
Python — local imports shadow module imports
from ..config import get_intanywhere inside a function makes that name local for the entire function. Using it earlier →UnboundLocalError.- Do: Use the module-level import (see top of
reporting/builder.py). - Don't: Re-import inside a function if the same name is used above that line in the same function.
-
PostgreSQL rows — never
row[0]- Connections may use psycopg
dict_row.row[0]→KeyError: 0on dict rows; tuple-only unit tests still pass. - Do:
_row_field(row, "id", index=0)fromwebsite_profiling.db._common(pattern inproperty_store.py). - Don't:
fetchone()[0]onINSERT … RETURNINGwithout_row_field.
from ._common import _row_field row = cur.fetchone() rid = _row_field(row, "id", index=0) report_id = int(rid) if rid is not None else None
- Connections may use psycopg
-
Python — local vs CI coverage gates (three jobs, not one)
- CI runs three separate pytest coverage jobs (see
.github/workflows/ci.ymlandscripts/local-test.sh):Gate Config Source Threshold Test scope Core .coveragercall packages except tools/andreporting/100% pytest tests/ -m "not browser"Reporting .coveragerc.reportingwebsite_profiling.reporting100% pytest tests/reporting/Tools .coveragerc.toolswebsite_profiling.tools100% pytest tests/tools/ - Symptom:
./local-testor core pytest passes at 100%, but CI fails on tools/reporting (e.g. 84% tools). - Causes: (a) only ran core pytest, not reporting/tools gates; (b) added reporting/tools tests outside
tests/reporting/ortests/tools/; (c) changed code underwebsite_profiling/tools/without tests that hit those lines in the tools gate subset. - Do: Run full
./local-testbefore push. Put reporting coverage tests intests/reporting/and tools coverage tests intests/tools/(one module per file, e.g.test_<module>_coverage.py). Keep bash and PowerShell local-test scripts in sync. - Don't: Assume
pytest tests/alone matches CI. Don't maintain long per-file lists in CI — use the directory gates above.
- CI runs three separate pytest coverage jobs (see
-
Python —
runpy.run_module/__main__guard tests- Tests that execute a module as
__main__viarunpy.run_module(..., run_name="__main__")emit:RuntimeWarning: '<module>' found in sys.modules after import of package ...when the same module was already imported at the top of the test file (or by another import). - Do: Before
runpy.run_module, remove the target fromsys.modulesso Python re-executes__main__cleanly. Name teststest_module_main_guard(seetests/test_schedule_runner.py). - Don't: Call
runpy.run_moduleon a module already imported in that test file without popping it first.
import runpy import sys sys.modules.pop("website_profiling.tools.schedule_runner", None) runpy.run_module( "website_profiling.tools.schedule_runner", run_name="__main__", alter_sys=False, )
- Tests that execute a module as
Checklist: new report page uses ReportShell · no duplicate local imports in long functions · new fetchone() uses _row_field · ./local-test passes all three coverage gates · new tools coverage test file listed in CI + both local-test scripts · runpy main-guard tests pop sys.modules first