Skip to content

Latest commit

Β 

History

History
279 lines (221 loc) Β· 15.2 KB

File metadata and controls

279 lines (221 loc) Β· 15.2 KB

CLAUDE.md - Development Guide

Project Overview

Discogsography is a Python 3.13+ / Rust microservices platform that transforms Discogs music database exports into Neo4j knowledge graphs and PostgreSQL analytics.

CRITICAL: Use uv exclusively for all Python operations. Never use pip, python, pytest, or mypy directly β€” always prefix with uv run. See uv Commands below.

AI Development Rules

  • ALWAYS use uv run for any Python command (pytest, mypy, ruff, python scripts)
  • Use git worktrees for all feature work β€” create in .worktrees/ directory, branch from origin/main. Each worktree = one branch = one PR. Use the superpowers:using-git-worktrees skill.
  • Open a PR for every change β€” never push directly to main
  • Mermaid diagrams for all diagrams in Markdown files
  • Lowercase filenames with hyphens for new markdown files (except README). Do not rename existing markdown files.
  • Emojis in GitHub Actions step names; single quotes in ${{ }} expressions, double quotes for YAML strings
  • Composite actions preferred for reusable workflow steps (see .github/actions/)
  • Add perf tests for new API endpoints β€” update tests/perftest/config.yaml and tests/perftest/run_perftest.py
  • All log messages must use emojis from docs/emoji-guide.md β€” no ad-hoc emojis
  • pyproject.toml ordering: [build-system] β†’ [project] β†’ [project.scripts] β†’ [tool.hatch...] β†’ tool configs (ruff, mypy, coverage, pytest) β†’ [dependency-groups]. Sort dependencies alphabetically. Service-specific files extend from root config.

Directory Structure

api/                  API service β€” all user-facing HTTP endpoints (auth, search, graph, OAuth, insights proxy, MusicBrainz)
brainzgraphinator/    Brainzgraphinator service β€” MusicBrainz data β†’ Neo4j enrichment
brainztableinator/    Brainztableinator service β€” MusicBrainz data β†’ PostgreSQL
common/               Shared library β€” config, models, utilities used by all Python services
dashboard/            Dashboard service β€” real-time monitoring UI
explore/              Explore service β€” static file serving for graph exploration frontend (Vitest for JS tests)
extractor/            Rust-based extractor β€” high-performance Discogs XML and MusicBrainz JSONL ingestion
graphinator/          Graphinator service β€” consumes messages, builds Neo4j graph
insights/             Insights service β€” precomputed analytics (proxied via API at /api/insights/*)
mcp-server/           MCP server β€” exposes knowledge graph to AI assistants via API (no direct DB access)
schema-init/          Schema initialization β€” one-time Neo4j and PostgreSQL schema setup
tableinator/          Tableinator service β€” consumes messages, builds PostgreSQL tables
utilities/            Monitoring utilities β€” queue monitor, error checker, system monitor
tests/                All tests organized by service (tests/api/, tests/common/, etc.)
scripts/              Build and update scripts
docs/                 Documentation
backups/              Database backups

Architecture Notes

  • Extractor supports two modes: --source discogs (XML β†’ 4 fanout exchanges) and --source musicbrainz (JSONL β†’ 4 fanout exchanges). It has zero knowledge of consumers.
  • Exchange naming: {project}-{source}-{type} pattern β€” env vars DISCOGS_EXCHANGE_PREFIX and MUSICBRAINZ_EXCHANGE_PREFIX
  • Discogs exchanges: discogsography-discogs-{artists,labels,masters,releases} (4 fanout exchanges)
  • MusicBrainz exchanges: discogsography-musicbrainz-{artists,labels,release-groups,releases} (4 fanout exchanges)
  • Each consumer (graphinator, tableinator, brainzgraphinator, brainztableinator) independently declares its own queues, DLQs, and DLXs.
  • Brainzgraphinator enriches existing Neo4j nodes with MusicBrainz metadata (properties, relationships, cross-references). Skips entities without Discogs matches.
  • Brainztableinator stores all MusicBrainz data in musicbrainz PostgreSQL schema β€” including entities without Discogs matches β€” with relationships and external links.
  • Insights fetches data from API internal endpoints (/api/internal/insights/*) over HTTP β€” does NOT connect to Neo4j directly. Uses Redis for caching.
  • Explore serves static files only β€” no external HTTP endpoints, no Neo4j env vars.
  • State markers: The extractor uses version-specific state markers (.extraction_status_<version>.json) to track progress. See docs/state-marker-system.md.

uv Commands

uv sync --all-extras             # Install/sync all dependencies
uv add package-name              # Add dependency (updates pyproject.toml + uv.lock)
uv add --dev package-name        # Add dev dependency
uv run pytest                    # Run tests
uv run mypy .                    # Type checking
uv run ruff check .              # Linting
uv run ruff format .             # Formatting
uv run python script.py          # Run any Python script
uv lock --upgrade-package name   # Update specific package

just Commands (preferred)

The justfile is the single source of truth for all commands. Run just --list for the full list.

Setup

just install              # uv sync --all-extras
just install-all          # Sync + editable install all services (CI)
just install-e2e          # Frozen sync + E2E subset (CI)
just install-js           # cd explore && npm ci
just init                 # Install pre-commit hooks
just update-deps          # Comprehensive update (Python, Rust, pre-commit, Docker)
just update-uv            # Update uv itself
just lock-upgrade         # Lock with upgrades
just sync                 # Sync all deps (dev + extras)
just sync-upgrade         # Sync with upgrades
just update-npm           # Update Explore frontend npm deps
just update-cargo         # Update Rust deps
just update-hooks         # Update pre-commit hooks

Testing

just test                 # Python tests (excluding E2E)
just test-js              # JavaScript tests (Vitest)
just test-cov             # Python tests with coverage
just test-js-cov          # JavaScript tests with coverage
just test-e2e             # End-to-end browser tests
just test-all             # Everything including E2E
just test-parallel        # All service tests in parallel (fastest)
just test-api             # API tests with coverage
just test-common          # Common library tests with coverage
just test-dashboard       # Dashboard tests with coverage
just test-explore         # Explore tests with coverage
just test-extractor       # Rust extractor tests
just test-extractor-cov   # Rust tests with coverage (cargo-llvm-cov)
just test-insights        # Insights tests with coverage
just test-graphinator     # Graphinator tests with coverage
just test-schema-init     # Schema-init tests with coverage
just test-tableinator     # Tableinator tests with coverage
just test-brainzgraphinator   # Brainzgraphinator tests with coverage
just test-brainztableinator   # Brainztableinator tests with coverage
just test-mcp-server      # MCP server tests with coverage

Code Quality

just lint                 # All pre-commit hooks
just lint-python          # Ruff + mypy only
just format               # Auto-format Python (ruff format)
just security             # Security checks (bandit)
just pip-audit            # Python dependency vulnerability scan

Rust

just extractor-build      # Build release
just extractor-test       # Run tests
just extractor-bench      # Run benchmarks
just extractor-lint       # Clippy (warnings = errors)
just extractor-fmt        # Format code
just extractor-fmt-check  # Check formatting (CI)
just extractor-audit      # Rust advisory database scan
just extractor-deny       # Rust license and policy check
just extractor-clean      # Clean build artifacts

Docker

just up                   # Start all services
just down                 # Stop all services
just logs                 # Follow service logs
just rebuild              # Down + build + up
just build                # Build all service images
just build-prod           # Build production images
just deploy-prod          # Deploy in production mode

Monitoring & Cleanup

just monitor              # Monitor RabbitMQ queues
just check-errors         # Check service logs for errors
just system-monitor       # System resource monitoring
just clean                # Remove temp files and caches
just deep-clean           # Clean + Docker volumes (destructive)

Service Ports

Service Port Health
API 8004 8005
Dashboard 8003 8003
Explore 8006 8007
Insights 8008 8009
Extractor β€” 8000
Graphinator β€” 8001
Tableinator β€” 8002
Brainztableinator β€” 8010
Brainzgraphinator β€” 8011
Neo4j 7474 (browser), 7687 (bolt) β€”
PostgreSQL 5433 (mapped from 5432) β€”
RabbitMQ 5672 (AMQP), 15672 (management) β€”

Environment Variables

Core variables used across services (individual components, not composite URLs):

  • NEO4J_HOST, NEO4J_USERNAME, NEO4J_PASSWORD β€” Neo4j connection
  • POSTGRES_HOST, POSTGRES_USERNAME, POSTGRES_PASSWORD, POSTGRES_DATABASE β€” PostgreSQL connection
  • RABBITMQ_HOST, RABBITMQ_USERNAME, RABBITMQ_PASSWORD β€” RabbitMQ connection
  • REDIS_HOST β€” Redis hostname
  • JWT_SECRET_KEY β€” HMAC-SHA256 signing secret (API only)
  • ENCRYPTION_MASTER_KEY β€” HKDF master key for OAuth + TOTP encryption (API only)
  • API_BASE_URL β€” API service URL (used by Explore, Insights, MCP Server)
  • LOG_LEVEL β€” Logging level (defaults to INFO)

All variables support _FILE variants for Docker Compose runtime secrets in production. See docs/configuration.md for the full reference.

Code Style

  • Type hints on all function parameters and returns
  • PEP 8 with 150-character line length (Ruff formatter)
  • Descriptive variable names, docstrings on public APIs
  • Logging format: %(asctime)s - {service_name} - %(name)s - %(levelname)s - %(message)s
  • Services log to /logs/{service_name}.log
  • Each service displays ASCII art on startup (pure text, no emojis)
  • Never log sensitive data (passwords, tokens, PII)
  • Run containers as non-root users
  • Maintain >80% code coverage

Common Bug Prevention

PostgreSQL autocommit contract

  • AsyncPostgreSQLPool sets autocommit=True on all connections and resets it on return to pool
  • Before conn.transaction(): always call await conn.set_autocommit(False) first
  • After transaction exits: autocommit is restored by the pool automatically β€” do not rely on manual restoration
  • Single-statement writes (upserts, inserts): use autocommit directly, no conn.transaction() needed
  • Never call await conn.commit() on autocommit connections β€” it's a no-op

asyncio primitives

  • Never create asyncio.Lock, asyncio.Queue, asyncio.Event, or asyncio.Semaphore in __init__ or at module level β€” they bind to the current event loop at creation time
  • This applies to all scopes: __init__, class bodies, module-level globals, and dataclass field(default_factory=...)
  • Initialize as None in __init__/module scope, create lazily in the first async method or in initialize()
  • Pattern: self._lock: asyncio.Lock | None = None in __init__, then if self._lock is None: self._lock = asyncio.Lock() in the first async method

Batch processor ack/nack contract

  • _flush_queue is the single authority for ack/nack β€” individual _process_*_batch methods must NEVER call ack or nack directly
  • If a message is invalid (e.g., missing id), mark it in a set (e.g., nack_indices) and let _flush_queue handle it
  • This prevents double-ack-after-nack when _flush_queue acks all messages after a "successful" batch that internally nacked some

Lock scope discipline

  • If a lock protects shared state, ALL reads AND writes of that state must acquire the lock β€” not just save()/flush()
  • A lock that only protects serialization while mutations are unprotected is security theater

Timestamp comparison convention

  • When checking if a token/session was invalidated by an event (password change, revocation), use <= (inclusive) β€” a token issued at the exact same second as the event MUST be invalidated
  • Pattern: if issued_at <= int(changed_at): β€” never use <

Field name consistency

  • MusicBrainz messages from the extractor use mb_type (not type) for entity type, service (not type) for external link service names
  • When reading message fields, always match the exact field name the extractor outputs β€” check extractor/src/jsonl_parser.rs for the source of truth
  • When adding new fields, use consistent naming: extractor defines the schema, consumers must match

Message acknowledgment

  • Never ack a message before processing completes β€” ack only after successful DB write
  • Always nack (not silently skip) messages with missing required fields β€” silent skips cause data loss
  • After nack in an exception handler, always return β€” do not fall through to subsequent processing code

Exchange and queue naming

  • Exchange pattern: {project}-{source}-{type} (e.g., discogsography-discogs-artists)
  • Queue pattern: {exchange-prefix}-{consumer}-{type} (e.g., discogsography-discogs-graphinator-artists)
  • Use DISCOGS_EXCHANGE_PREFIX and MUSICBRAINZ_EXCHANGE_PREFIX env vars β€” never hardcode exchange names

Rarity tier boundaries

  • Use >= (inclusive) for threshold comparisons: if score >= threshold: β€” a score exactly at the boundary belongs to the higher tier
  • Same principle applies to any tiered classification with ordered thresholds

Docker service names

  • Extractor runs as two services: extractor-discogs and extractor-musicbrainz β€” there is no monolithic extractor service
  • When referencing services in utility scripts, always use the actual docker-compose.yml service names

Subprocess timeout discipline

  • Every subprocess.run() call must include timeout= β€” omitting it risks indefinite hangs
  • Always catch subprocess.TimeoutExpired alongside subprocess.CalledProcessError

Fix-one-fix-all discipline

  • When fixing a bug pattern, grep -rn for ALL instances of the same pattern across the entire codebase before marking the fix complete
  • Common patterns that repeat across files: timestamp comparisons, asyncio primitive creation, ack/nack guards, subprocess calls
  • Token validation code exists in api/dependencies.py, api/api.py, api/routers/sync.py, and api/routers/snapshot.py β€” changes to one must be applied to all
  • Batch processors exist in graphinator/batch_processor.py and tableinator/batch_processor.py β€” changes to one should be mirrored in the other