LLM_CODING_AGENT.md

This file provides guidance to Claude Code (claude.ai/code) and other LLM-based coding agents when working with code in this repository.

Project Overview

The NTP Pool Monitor is a distributed monitoring system for the NTP Pool project. It consists of three main components:

ntppool-agent - Monitoring client that runs on distributed nodes
monitor-api - Central API server for coordination and configuration
monitor-scorer - Processes monitoring results and calculates server scores

Task Management

Use TodoWrite/TodoRead tools for complex multi-step tasks (3+ steps) or when user explicitly requests todo management.

Best Practices:

Break down complex tasks into specific, actionable items
Mark tasks in_progress before starting work (only ONE at a time)
Complete tasks immediately after finishing
Only mark completed when fully accomplished

Pre-Commit Checklist

MANDATORY before any git commit:

Run gofumpt -w on all changed .go files
Run make test to ensure all tests pass
Verify compilation with go build for affected packages
Run lint tools if available (golangci-lint run, go vet ./...)
Check for race conditions and proper error handling

Never commit changes unless explicitly asked by the user. NEVER USE git add -A - always use explicit, targeted git staging.

Development Commands

make tools          # Install required development tools
make generate       # Generate all code (runs sqlc then go generate ./...)
make build          # Build all components
make test           # Run comprehensive test suite

Code Generation

Never edit generated files directly - changes will be lost on regeneration.

Generated file patterns:

*.pb.go - Protocol buffer generated files
*.sql.go - sqlc generated database code
otel.go - OpenTelemetry instrumentation wrappers
*_string.go - Enum string methods
Files in api/pb/ and gen/ directories

Run make generate after:

Modifying query.sql or .proto files
Adding/modifying //go:generate directives
Use ./scripts/test-db.sh start for SQL query testing

Problem Analysis Framework

Systematic debugging approach:

Understand exact symptoms and failure conditions
Trace code flow step by step
Identify state, caching, and persistence points
Consider simple explanations first (connection pooling, timing, config)
Verify assumptions before implementing solutions
Check for race conditions and concurrent operations
Prefer targeted fixes over architectural changes

Common constraint system bugs:

Self-reference bugs: Check if entities compare against lists including themselves
Order-dependent logic: Verify processing order matches business priorities
State consistency: Ensure constraint checks use correct state (target vs current)

SQL analysis tips:

Examine ORDER BY clauses for data prioritization
Check JOIN patterns for constraint relationships
Use query ordering to understand conflict resolution

Task Completion Criteria

Before marking any coding task as complete:

Code compiles successfully (go build)
Tests pass (make test)
Code is formatted (gofumpt -w)
Basic functionality verified

Never mark completed if: tests fail, implementation is partial, or compilation errors exist.

Constraint Checking Architecture Patterns

Common patterns in monitor selection systems:

Iterative constraint checking: Check constraints after each promotion/demotion
Emergency override logic: Safety overrides for critical system states
Self-exclusion: Always exclude entity being evaluated from conflict detection
Target vs current state: Check constraints against appropriate state
Grandfathering: Allow existing violations to persist temporarily

Implementation best practices:

Lazy evaluation, consistent error handling, audit logging

Concurrency and Thread Safety

Race condition detection:

Identify shared state accessed by multiple goroutines
Ensure write operations use Lock(), not RLock()
Check atomic operations and channel usage

Mutex best practices:

Use write locks for modifications, read locks for reads
Minimize lock scope and avoid nested locks
Create methodUnsafe() variants that assume lock is held

Common patterns:

Configuration hot reloading with mutex protection
Background goroutines with proper context cancellation
Atomic file operations with proper locking

Go Code Standards

Logging:

Use *slog.Logger type for logger fields
Use contextual logging: log.InfoContext(ctx, ...)
Get logger from context: log := logger.FromContext(ctx)

Error Handling:

Always include context in error messages
Use structured logging for errors with relevant fields

CLI Framework (Kong):

CLI commands defined as structs with Kong struct tags
Common tags: name:"flag-name", short:"x", help:"Description", default:"value"
Follow existing patterns in client/cmd/cmd.go

Testing:

Use table-driven tests
Avoid testify/assert or similar tools
SQL queries tested through integration tests
Always use make targets for running tests (see Testing Framework section below)

Test Data Validation:

Validate constraint mathematics ensure expected outcomes are possible
Cross-check test logic before writing assertions
Document mathematical relationships in comments
Use CI tools (./scripts/test-ci-local.sh) for debugging

Testing Framework

ALWAYS use make targets for running tests. Never run test commands directly unless specifically documented.

Primary Test Targets

make test - Default comprehensive test suite

Runs all unit tests across the entire codebase
Use this for general development and pre-commit validation
Equivalent to go test ./... but preferred for consistency

make test-unit - Fast unit tests only

Runs tests with -short flag, skipping slower integration tests
Use for rapid development feedback cycles
No database dependencies required

make test-integration - Integration tests with database

Automatically starts test database if TEST_DATABASE_URL not set
Runs all tests tagged with integration build tag
Includes database-dependent tests and scorer integration tests
Use for validating database interactions and full system behavior

make test-all - Complete test suite

Runs both unit and integration tests sequentially
Equivalent to make test-unit test-integration
Use for comprehensive validation before major changes

Database Management Targets

make test-db-start - Start test database

Starts MySQL 8.0 container on port 3308
Loads schema automatically
Database persists until explicitly stopped

make test-db-stop - Stop test database

Cleanly shuts down and removes test database container
Use when finished with integration testing

make test-db-restart - Restart test database

Equivalent to make test-db-stop test-db-start
Use to reset database state between test runs

make test-db-reset - Reset database schema

Drops and recreates database, reloads schema
Use when schema changes require fresh database state

Specialized Test Targets

make test-load - Performance and load tests

Runs tests tagged with load build tag
Requires test database, automatically started if needed
Extended timeout (30 minutes) for long-running tests

make test-ci-local - CI environment emulation

Replicates CI testing environment locally
Use for debugging CI-specific test failures
Includes additional validation and cleanup steps

Direct Script Usage (Special Cases Only)

./scripts/test-scorer-integration.sh - Scorer-specific debugging

When to use: Debugging scorer-specific issues in isolation
Port conflict detection: Checks for existing database on port 3308
Guidance: Use make test-db-stop first if port conflicts occur
Cleanup: Automatically cleans up its own database container
Alternative: Use make test-db-start && go test ./scorer -tags=integration -v

Testing Workflow Examples

Development workflow:

make test-unit          # Quick feedback during development
make test-integration   # Validate database interactions
make test               # Final validation before commit

Database testing workflow:

make test-db-start      # Start persistent test database
go test ./ntpdb -v      # Test specific package with database
make test-db-stop       # Clean up when done

CI debugging workflow:

make test-ci-local      # Replicate CI environment
# If failures occur, use individual targets to isolate issues

Scorer debugging workflow:

# Option 1: Using make targets
make test-db-start
go test ./scorer -tags=integration -v

# Option 2: Using specialized script (handles its own database)
./scripts/test-scorer-integration.sh

Best Practices

Always prefer make targets over direct go test commands
Use test-unit for rapid iteration during development
Run test-integration before commits that touch database code
Clean up test databases with make test-db-stop when done
Use CI tools for debugging complex test failures
Isolate component testing using specialized scripts only when needed

Key Architecture Components

Core Packages

client/ - Client-side monitoring agent implementation
client/monitor/ - NTP monitoring logic using beevik/ntp library
client/config/ - Configuration management with TLS certificates and hot reloading
server/ - API server with JWT auth and Connect RPC endpoints
api/ - Protocol definitions using Protocol Buffers and Connect RPC
scorer/ - Server performance scoring algorithms
ntpdb/ - Database layer using MySQL with sqlc for type-safe queries

Monitor Types

The system supports two distinct types of monitors with different purposes and lifecycles:

1. Regular Monitors (`type = 'monitor'`)

Purpose: Distributed NTP monitoring clients that test server performance

Implementation: ntppool-agent clients running on user systems
Data Flow: Submit test results via monitor-api gRPC/ConnectRPC endpoints
Status Management: Managed by selector through proper constraint checking
Status Progression: candidate → testing → active
Assignment: Automatically assigned to compatible servers via GetServers API
Location: client/ package and related monitoring code

Lifecycle:

Monitor submits test results → Creates log_scores entries
Scorer processes results → Creates server_scores with candidate status
Selector evaluates server → Promotes based on constraints and health
Constraint violations → Gradual demotion through status hierarchy

2. Scorer Monitors (`type = 'score'`)

Purpose: Meta-monitors that calculate aggregate server performance scores

Implementation: Backend processes that analyze monitoring data
Data Flow: Process log_scores from regular monitors to compute server scores
Status Management: Automatically set to active status when processing scores
Assignment: Manually configured, not subject to selector constraint checking
Location: scorer/ package

Lifecycle:

Scorer processes log_scores entries from regular monitors
Creates/updates server_scores entries with calculated performance metrics
Status automatically forced to active during score calculation (this is correct behavior)
Not subject to selector's constraint checking or promotion logic

Key Differences

Aspect	Regular Monitors	Scorer Monitors
Purpose	Test individual servers	Calculate aggregate scores
Data Source	Direct NTP measurements	Processed monitoring data
Status Flow	`candidate` → `testing` → `active`	Always `active` when processing
Constraint Checking	Full selector constraint validation	Not subject to constraints
Assignment	Automatic via selector logic	Manual configuration
Count Limits	Subject to account/network limits	No limits (system-managed)

Database Identification

Monitors are identified by the type field in the monitors table:

SELECT * FROM monitors WHERE type = 'monitor';  -- Regular monitoring clients
SELECT * FROM monitors WHERE type = 'score';    -- Scorer/meta-monitors

The server_scores table contains entries from both types, but they serve different purposes:

Regular monitors: Status managed by selector for constraint compliance
Scorer monitors: Status automatically managed for operational needs

Configuration Management Architecture

Two separate configuration endpoints with different purposes and frequencies:

HTTP Config Endpoint (/monitor/api/config)
- Frequency: Every 5 minutes + immediate fsnotify triggers on state.json changes
- Purpose: Basic monitor setup, IP assignments, and TLS certificate management
- Location: client/config/appconfig.go:LoadAPIAppConfig()
gRPC Config Endpoint (api.GetConfig)
- Frequency: Every 60 minutes + immediate triggers when HTTP config changes
- Purpose: Monitor-specific operational configuration per IP version
- Location: client/cmd/monitor.go:fetchConfig()

Hot Reloading System:

fsnotify watches state.json for immediate response to setup command changes
HTTP config changes trigger immediate gRPC config refresh for all monitor goroutines
Context-based notification system with proper cleanup to prevent memory leaks
Broadcast mechanism supports multiple concurrent monitor goroutines (one per IP version)

Certificate Management

Certificate Lifecycle and Timing:

Initial Setup: ntppool-agent setup obtains API key but NOT certificates
Activation Required: Monitors must be marked "active" or "testing" in the API before certificates can be requested
First Certificate Request: Happens on first LoadAPIAppConfig() call after monitor activation
Certificate Storage: Stored alongside state.json in the state directory
Hot Reloading: Certificate changes are immediately detected and loaded

Wait Method Usage:

WaitUntilAPIKey(): Use when only API key is needed (e.g., initial setup verification)
WaitUntilConfigured(): Use when both API key AND certificates are required (e.g., API operations)
WaitUntilCertificatesLoaded(): Internal method for waiting specifically for certificates
WaitUntilLive(): Use when monitor must be in active/testing state with valid IP assignment

State Directory Configuration

systemd StateDirectory vs RuntimeDirectory:

StateDirectory (/var/lib/ntppool-agent): Persistent storage that survives reboots
RuntimeDirectory (/var/run/ntppool-agent): Temporary storage cleared on reboot
Migration: Automatic migration from RuntimeDirectory to StateDirectory on startup
Priority Order: $MONITOR_STATE_DIR > $STATE_DIRECTORY > user config directory

State Migration Best Practices:

Check for RUNTIME_DIRECTORY environment variable on startup
Migrate state.json and certificate files if found
Log migration operations for debugging
Handle partial migrations gracefully (e.g., state.json exists but certificates don't)

Configuration Sources and Hierarchy

AppConfig (Local State):

Stored in state.json
Contains: API key, monitor name, TLS name, IP assignments, status per protocol
Updated via HTTP endpoint every 5 minutes
Triggers immediate notifications on changes via WaitForConfigChange()

gRPC Config (Operational Config):

Fetched via Connect RPC from monitor-api
Contains: NTP test parameters, server lists, MQTT settings
Updated every 60 minutes or when AppConfig changes
Requires valid certificates for authentication

Configuration Flow:

Setup command → API key stored in state.json
Monitor activation in web UI → Status changes to "testing" or "active"
First LoadAPIAppConfig() → Receives certificates
Subsequent API calls → Can fetch gRPC config

Monitor Lifecycle and Status Checking

Status Values:

active: Monitor is fully operational
testing: Monitor is in test mode (still operational)
pending: Monitor should gradually phase out (allows clean transitions)
paused: Monitor should stop all work immediately

Status Checking Best Practices:

Check in outer loop: Before spawning monitor goroutines
Use fresh config: Call IPv4()/IPv6() to get current status, not stale captures
Wait for activation: Use WaitForConfigChange() when paused
Avoid inner loop checks: Don't check status inside monitoring loops

Example Pattern:

// Outer loop - check status before starting monitors
ipc := cli.Config.IPv4()
if !ipc.IsLive() {
    // Wait for activation using WaitForConfigChange
    for {
        configChangeCtx := cli.Config.WaitForConfigChange(ctx)
        select {
        case <-configChangeCtx.Done():
            ipc = cli.Config.IPv4() // Get fresh status
            if ipc.IsLive() {
                break
            }
        case <-ctx.Done():
            return nil
        }
    }
}
// Now safe to start monitoring

Communication

Connect RPC (replacing legacy Twirp) for client-server communication
MQTT for real-time messaging and live monitoring updates
TLS certificates for mutual authentication via Vault or API

Database

MySQL backend with sqlc for compile-time verified SQL
ClickHouse support for analytics and traceroute data

Database Schema Management

Schema Changes: Database schema changes are handled automatically by the deployment system
Schema File: schema.sql contains the current database schema
Local Development: Use MySQL 8 in Docker (available via make test-db-start or ./scripts/test-db.sh start)
No Manual Migrations: The codebase handles schema updates automatically during deployment
Version Tracking: Schema versions always increment forward and are managed separately from the code

Concurrent Operations and Race Conditions

When implementing database operations that might be called concurrently:

Check for Duplicate Key Constraints: Review table schemas for unique constraints
Use Idempotent Operations: Prefer INSERT ... ON DUPLICATE KEY UPDATE over plain INSERT
Test Concurrent Scenarios: Integration tests should include concurrent operation tests
Regenerate After SQL Changes: Always run make sqlc after modifying query.sql

Example pattern for safe inserts:

INSERT INTO table (col1, col2) VALUES (?, ?)
ON DUPLICATE KEY UPDATE col2 = VALUES(col2);

Environment Configuration

Key environment variables:

DEPLOYMENT_MODE - Environment (devel/test/prod)
DATABASE_DSN - MySQL connection string
JWT_KEY - JWT signing key for MQTT auth
VAULT_ADDR - Vault server URL for secrets
OTEL_EXPORTER_OTLP_ENDPOINT - OpenTelemetry collector

Database credentials can be provided via database.yaml:

mysql:
  user: some-db-user
  pass: password

Incremental Development Methodology

Phase-Based Development

For complex changes, break work into distinct phases:

Phase 1: Foundation/Bug Fixes

Fix critical bugs, race conditions, and safety issues first
Establish proper synchronization and error handling
Ensure existing functionality remains intact
Complete testing and validation before proceeding

Phase 2: Core Implementation

Add new features and functionality
Implement hot reloading, configuration changes, or new APIs
Maintain backward compatibility throughout
Test each increment independently

Phase 3: Future Considerations

Document potential improvements and optimizations
Plan for scalability and maintenance
Consider deprecation paths for legacy components
Defer non-essential changes to future iterations

Implementation Best Practices

Test each phase independently: Don't proceed until current phase is stable
Maintain rollback capability: Each phase should be independently revertible
Use plan mode for complex changes: Present architectural decisions for approval
Document assumptions and dependencies: Make implicit requirements explicit
Prefer targeted fixes over architectural overhauls: Simple solutions first

Change Management

Incremental commits: Each phase gets its own commit with clear description
Interface stability: Don't break existing APIs without migration paths
Configuration compatibility: Ensure new config works with existing deployments
Monitoring continuity: Verify metrics and logging remain functional

Development Workflow

After schema changes: Run make sqlc to regenerate database code
After protobuf changes: Run make generate to regenerate RPC code
Before commits: Run gofumpt -w on all modified Go files
Testing: Run appropriate make test targets to validate changes
Phase validation: Complete and test each development phase before proceeding
Incremental commits: Commit each stable phase independently

Monitoring Features

Dual-stack IPv4/IPv6 monitoring with automatic IP detection
High-precision NTP accuracy testing with configurable sampling
Network traceroute integration for path analysis
Real-time scoring algorithms for server performance evaluation
OpenTelemetry integration for distributed tracing and metrics
Certificate-based mutual authentication

Additional Guidelines

Documentation:

Maintain cmd/[command]/README.md files for each command
Use terse, standard Go doc comments
Update README.md when user-facing options change

Backwards Compatibility:

If a change affects both agent and monitoring server, update both together
Maintain backwards compatibility with older monitoring server versions
Use version.CheckVersion function and follow existing versioning patterns

Security:

Never commit secrets, credentials, or sensitive data to git
Secrets may exist in working directory but must be excluded from version control

Code Quality:

Flag any existing or new global variables and panics
When introducing third-party packages, ask for confirmation
Discourage global variables and disallow panics unless absolutely necessary

APIs:

Legacy Twirp API is frozen and will be removed after legacy clients are upgraded
Flag any new code using Twirp or modifying the Twirp-based API
Prefer Connect RPC API for all new development

General Guidance:

If context is missing or unclear, ask for clarification before proceeding
When creating new files, place them inside /Users/ask/src/go/ntp/monitor
When editing files, use ...existing code... comments to indicate unchanged regions

API Endpoint Configuration:

Deployment environment set via --env flag or DEPLOYMENT_MODE environment variable
Environment represented by depenv.DeploymentEnvironment type from go.ntppool.org/common/config/depenv
API endpoint resolved by calling depenv.DeploymentEnvironment.APIHost() or MonitorAPIHost()
Override with DEVEL_API_SERVER environment variable
Default endpoints defined in depenv package: prod (https://api.mon.ntppool.dev), test (https://api.test.mon.ntppool.dev), devel (https://api.devel.mon.ntppool.dev)

CI Tools and Testing Infrastructure

Key scripts in scripts/ directory:

test-db.sh - Primary test database management (MySQL 8.0 on port 3308)
test-ci-local.sh - Full CI environment emulation
test-scorer-integration.sh - Component-specific testing
diagnose-ci.sh - CI failure diagnostics

Database port:

3308: All test databases (unified port for consistency)

Local workflow:

make test-db-start (for integration tests)
make test (comprehensive test suite)
make test-db-stop (cleanup when done)

Recent Architecture Changes

Key recent changes (2025):

JWT Authentication: Complete JWT authentication with JWKS support replacing planned API key auth (commits: 10e2a70, deb9a16, 304cc1c)
OpenTelemetry Migration: Client metrics fully migrated to OpenTelemetry from Prometheus (commit: 9aa4d39)
Database Consolidation: Migrated to shared common/database package reducing code duplication (commits: 650aeb9, 393a251, c86adf2)
"New" Status Elimination: Schema updated to remove "new" status entirely (commit: 64416d0)
Performance-Based Rule 5: Candidates can replace worse-performing testing monitors (commit: de5e03a)
Emergency Override Consistency: Fixed candidate→testing promotion gap (commit: b6515b8)
Selector Package Refactoring: Moved to dedicated selector/ package with new constraint validation algorithm
Per-Status-Group Change Limits: Separate limits for each status transition type in selector/process.go
Dynamic Testing Pool Sizing: Testing pool adjusts based on active monitor gap
Monitor Limit Enforcement: Fixed monitor count tracking and rule execution order
Configuration Management: Transitioned to systemd StateDirectory for persistent storage

Monitor Selection Rules (in selector/process.go):

Rule 1: Immediate blocking
Rule 2: Gradual constraint removal
Rule 1.5: Active excess demotion
Rule 3: Testing to active promotion
Rule 5: Candidate to testing promotion
Rule 2.5: Testing pool management
Rule 6: Bootstrap promotion

FilesExpand file tree

CLAUDE.md

Latest commit

History