Discogsography is a Python 3.13+ / Rust microservices platform that transforms Discogs music database exports into Neo4j knowledge graphs and PostgreSQL analytics.
CRITICAL: Use uv exclusively for all Python operations. Never use pip, python, pytest, or mypy directly β always prefix with
uv run. See uv Commands below.
- ALWAYS use
uv runfor any Python command (pytest, mypy, ruff, python scripts) - Use git worktrees for all feature work β create in
.worktrees/directory, branch fromorigin/main. Each worktree = one branch = one PR. Use thesuperpowers:using-git-worktreesskill. - Open a PR for every change β never push directly to
main - Mermaid diagrams for all diagrams in Markdown files
- Lowercase filenames with hyphens for new markdown files (except README). Do not rename existing markdown files.
- Emojis in GitHub Actions step names; single quotes in
${{ }}expressions, double quotes for YAML strings - Composite actions preferred for reusable workflow steps (see
.github/actions/) - Add perf tests for new API endpoints β update
tests/perftest/config.yamlandtests/perftest/run_perftest.py - All log messages must use emojis from docs/emoji-guide.md β no ad-hoc emojis
- pyproject.toml ordering:
[build-system]β[project]β[project.scripts]β[tool.hatch...]β tool configs (ruff,mypy,coverage,pytest) β[dependency-groups]. Sort dependencies alphabetically. Service-specific files extend from root config.
api/ API service β all user-facing HTTP endpoints (auth, search, graph, OAuth, insights proxy, MusicBrainz)
brainzgraphinator/ Brainzgraphinator service β MusicBrainz data β Neo4j enrichment
brainztableinator/ Brainztableinator service β MusicBrainz data β PostgreSQL
common/ Shared library β config, models, utilities used by all Python services
dashboard/ Dashboard service β real-time monitoring UI
explore/ Explore service β static file serving for graph exploration frontend (Vitest for JS tests)
extractor/ Rust-based extractor β high-performance Discogs XML and MusicBrainz JSONL ingestion
graphinator/ Graphinator service β consumes messages, builds Neo4j graph
insights/ Insights service β precomputed analytics (proxied via API at /api/insights/*)
mcp-server/ MCP server β exposes knowledge graph to AI assistants via API (no direct DB access)
schema-init/ Schema initialization β one-time Neo4j and PostgreSQL schema setup
tableinator/ Tableinator service β consumes messages, builds PostgreSQL tables
utilities/ Monitoring utilities β queue monitor, error checker, system monitor
tests/ All tests organized by service (tests/api/, tests/common/, etc.)
scripts/ Build and update scripts
docs/ Documentation
backups/ Database backups
- Extractor supports two modes:
--source discogs(XML β 4 fanout exchanges) and--source musicbrainz(JSONL β 4 fanout exchanges). It has zero knowledge of consumers. - Exchange naming:
{project}-{source}-{type}pattern β env varsDISCOGS_EXCHANGE_PREFIXandMUSICBRAINZ_EXCHANGE_PREFIX - Discogs exchanges:
discogsography-discogs-{artists,labels,masters,releases}(4 fanout exchanges) - MusicBrainz exchanges:
discogsography-musicbrainz-{artists,labels,release-groups,releases}(4 fanout exchanges) - Each consumer (graphinator, tableinator, brainzgraphinator, brainztableinator) independently declares its own queues, DLQs, and DLXs.
- Brainzgraphinator enriches existing Neo4j nodes with MusicBrainz metadata (properties, relationships, cross-references). Skips entities without Discogs matches.
- Brainztableinator stores all MusicBrainz data in
musicbrainzPostgreSQL schema β including entities without Discogs matches β with relationships and external links. - Insights fetches data from API internal endpoints (
/api/internal/insights/*) over HTTP β does NOT connect to Neo4j directly. Uses Redis for caching. - Explore serves static files only β no external HTTP endpoints, no Neo4j env vars.
- State markers: The extractor uses version-specific state markers (
.extraction_status_<version>.json) to track progress. Seedocs/state-marker-system.md.
uv sync --all-extras # Install/sync all dependencies
uv add package-name # Add dependency (updates pyproject.toml + uv.lock)
uv add --dev package-name # Add dev dependency
uv run pytest # Run tests
uv run mypy . # Type checking
uv run ruff check . # Linting
uv run ruff format . # Formatting
uv run python script.py # Run any Python script
uv lock --upgrade-package name # Update specific packageThe justfile is the single source of truth for all commands. Run just --list for the full list.
just install # uv sync --all-extras
just install-all # Sync + editable install all services (CI)
just install-e2e # Frozen sync + E2E subset (CI)
just install-js # cd explore && npm ci
just init # Install pre-commit hooks
just update-deps # Comprehensive update (Python, Rust, pre-commit, Docker)
just update-uv # Update uv itself
just lock-upgrade # Lock with upgrades
just sync # Sync all deps (dev + extras)
just sync-upgrade # Sync with upgrades
just update-npm # Update Explore frontend npm deps
just update-cargo # Update Rust deps
just update-hooks # Update pre-commit hooksjust test # Python tests (excluding E2E)
just test-js # JavaScript tests (Vitest)
just test-cov # Python tests with coverage
just test-js-cov # JavaScript tests with coverage
just test-e2e # End-to-end browser tests
just test-all # Everything including E2E
just test-parallel # All service tests in parallel (fastest)
just test-api # API tests with coverage
just test-common # Common library tests with coverage
just test-dashboard # Dashboard tests with coverage
just test-explore # Explore tests with coverage
just test-extractor # Rust extractor tests
just test-extractor-cov # Rust tests with coverage (cargo-llvm-cov)
just test-insights # Insights tests with coverage
just test-graphinator # Graphinator tests with coverage
just test-schema-init # Schema-init tests with coverage
just test-tableinator # Tableinator tests with coverage
just test-brainzgraphinator # Brainzgraphinator tests with coverage
just test-brainztableinator # Brainztableinator tests with coverage
just test-mcp-server # MCP server tests with coveragejust lint # All pre-commit hooks
just lint-python # Ruff + mypy only
just format # Auto-format Python (ruff format)
just security # Security checks (bandit)
just pip-audit # Python dependency vulnerability scanjust extractor-build # Build release
just extractor-test # Run tests
just extractor-bench # Run benchmarks
just extractor-lint # Clippy (warnings = errors)
just extractor-fmt # Format code
just extractor-fmt-check # Check formatting (CI)
just extractor-audit # Rust advisory database scan
just extractor-deny # Rust license and policy check
just extractor-clean # Clean build artifactsjust up # Start all services
just down # Stop all services
just logs # Follow service logs
just rebuild # Down + build + up
just build # Build all service images
just build-prod # Build production images
just deploy-prod # Deploy in production modejust monitor # Monitor RabbitMQ queues
just check-errors # Check service logs for errors
just system-monitor # System resource monitoring
just clean # Remove temp files and caches
just deep-clean # Clean + Docker volumes (destructive)| Service | Port | Health |
|---|---|---|
| API | 8004 | 8005 |
| Dashboard | 8003 | 8003 |
| Explore | 8006 | 8007 |
| Insights | 8008 | 8009 |
| Extractor | β | 8000 |
| Graphinator | β | 8001 |
| Tableinator | β | 8002 |
| Brainztableinator | β | 8010 |
| Brainzgraphinator | β | 8011 |
| Neo4j | 7474 (browser), 7687 (bolt) | β |
| PostgreSQL | 5433 (mapped from 5432) | β |
| RabbitMQ | 5672 (AMQP), 15672 (management) | β |
Core variables used across services (individual components, not composite URLs):
NEO4J_HOST,NEO4J_USERNAME,NEO4J_PASSWORDβ Neo4j connectionPOSTGRES_HOST,POSTGRES_USERNAME,POSTGRES_PASSWORD,POSTGRES_DATABASEβ PostgreSQL connectionRABBITMQ_HOST,RABBITMQ_USERNAME,RABBITMQ_PASSWORDβ RabbitMQ connectionREDIS_HOSTβ Redis hostnameJWT_SECRET_KEYβ HMAC-SHA256 signing secret (API only)ENCRYPTION_MASTER_KEYβ HKDF master key for OAuth + TOTP encryption (API only)API_BASE_URLβ API service URL (used by Explore, Insights, MCP Server)LOG_LEVELβ Logging level (defaults to INFO)
All variables support _FILE variants for Docker Compose runtime secrets in production. See docs/configuration.md for the full reference.
- Type hints on all function parameters and returns
- PEP 8 with 150-character line length (Ruff formatter)
- Descriptive variable names, docstrings on public APIs
- Logging format:
%(asctime)s - {service_name} - %(name)s - %(levelname)s - %(message)s - Services log to
/logs/{service_name}.log - Each service displays ASCII art on startup (pure text, no emojis)
- Never log sensitive data (passwords, tokens, PII)
- Run containers as non-root users
- Maintain >80% code coverage
AsyncPostgreSQLPoolsetsautocommit=Trueon all connections and resets it on return to pool- Before
conn.transaction(): always callawait conn.set_autocommit(False)first - After transaction exits: autocommit is restored by the pool automatically β do not rely on manual restoration
- Single-statement writes (upserts, inserts): use autocommit directly, no
conn.transaction()needed - Never call
await conn.commit()on autocommit connections β it's a no-op
- Never create
asyncio.Lock,asyncio.Queue,asyncio.Event, orasyncio.Semaphorein__init__or at module level β they bind to the current event loop at creation time - This applies to all scopes:
__init__, class bodies, module-level globals, and dataclassfield(default_factory=...) - Initialize as
Nonein__init__/module scope, create lazily in the first async method or ininitialize() - Pattern:
self._lock: asyncio.Lock | None = Nonein__init__, thenif self._lock is None: self._lock = asyncio.Lock()in the first async method
_flush_queueis the single authority for ack/nack β individual_process_*_batchmethods must NEVER call ack or nack directly- If a message is invalid (e.g., missing
id), mark it in a set (e.g.,nack_indices) and let_flush_queuehandle it - This prevents double-ack-after-nack when
_flush_queueacks all messages after a "successful" batch that internally nacked some
- If a lock protects shared state, ALL reads AND writes of that state must acquire the lock β not just
save()/flush() - A lock that only protects serialization while mutations are unprotected is security theater
- When checking if a token/session was invalidated by an event (password change, revocation), use
<=(inclusive) β a token issued at the exact same second as the event MUST be invalidated - Pattern:
if issued_at <= int(changed_at):β never use<
- MusicBrainz messages from the extractor use
mb_type(nottype) for entity type,service(nottype) for external link service names - When reading message fields, always match the exact field name the extractor outputs β check
extractor/src/jsonl_parser.rsfor the source of truth - When adding new fields, use consistent naming: extractor defines the schema, consumers must match
- Never ack a message before processing completes β ack only after successful DB write
- Always nack (not silently skip) messages with missing required fields β silent skips cause data loss
- After nack in an exception handler, always
returnβ do not fall through to subsequent processing code
- Exchange pattern:
{project}-{source}-{type}(e.g.,discogsography-discogs-artists) - Queue pattern:
{exchange-prefix}-{consumer}-{type}(e.g.,discogsography-discogs-graphinator-artists) - Use
DISCOGS_EXCHANGE_PREFIXandMUSICBRAINZ_EXCHANGE_PREFIXenv vars β never hardcode exchange names
- Use
>=(inclusive) for threshold comparisons:if score >= threshold:β a score exactly at the boundary belongs to the higher tier - Same principle applies to any tiered classification with ordered thresholds
- Extractor runs as two services:
extractor-discogsandextractor-musicbrainzβ there is no monolithicextractorservice - When referencing services in utility scripts, always use the actual
docker-compose.ymlservice names
- Every
subprocess.run()call must includetimeout=β omitting it risks indefinite hangs - Always catch
subprocess.TimeoutExpiredalongsidesubprocess.CalledProcessError
- When fixing a bug pattern,
grep -rnfor ALL instances of the same pattern across the entire codebase before marking the fix complete - Common patterns that repeat across files: timestamp comparisons, asyncio primitive creation, ack/nack guards, subprocess calls
- Token validation code exists in
api/dependencies.py,api/api.py,api/routers/sync.py, andapi/routers/snapshot.pyβ changes to one must be applied to all - Batch processors exist in
graphinator/batch_processor.pyandtableinator/batch_processor.pyβ changes to one should be mirrored in the other