This file provides guidance to Claude Code (claude.ai/code) and other LLM-based coding agents when working with code in this repository.
The NTP Pool Monitor is a distributed monitoring system for the NTP Pool project. It consists of three main components:
ntppool-agent- Monitoring client that runs on distributed nodesmonitor-api- Central API server for coordination and configurationmonitor-scorer- Processes monitoring results and calculates server scores
Use TodoWrite/TodoRead tools for complex multi-step tasks (3+ steps) or when user explicitly requests todo management.
Best Practices:
- Break down complex tasks into specific, actionable items
- Mark tasks
in_progressbefore starting work (only ONE at a time) - Complete tasks immediately after finishing
- Only mark
completedwhen fully accomplished
MANDATORY before any git commit:
- Run
gofumpt -won all changed.gofiles - Run
make testto ensure all tests pass - Verify compilation with
go buildfor affected packages - Run lint tools if available (
golangci-lint run,go vet ./...) - Check for race conditions and proper error handling
Never commit changes unless explicitly asked by the user.
NEVER USE git add -A - always use explicit, targeted git staging.
make tools # Install required development tools
make generate # Generate all code (runs sqlc then go generate ./...)
make build # Build all components
make test # Run comprehensive test suiteNever edit generated files directly - changes will be lost on regeneration.
Generated file patterns:
*.pb.go- Protocol buffer generated files*.sql.go- sqlc generated database codeotel.go- OpenTelemetry instrumentation wrappers*_string.go- Enum string methods- Files in
api/pb/andgen/directories
Run make generate after:
- Modifying
query.sqlor.protofiles - Adding/modifying
//go:generatedirectives - Use
./scripts/test-db.sh startfor SQL query testing
Systematic debugging approach:
- Understand exact symptoms and failure conditions
- Trace code flow step by step
- Identify state, caching, and persistence points
- Consider simple explanations first (connection pooling, timing, config)
- Verify assumptions before implementing solutions
- Check for race conditions and concurrent operations
- Prefer targeted fixes over architectural changes
Common constraint system bugs:
- Self-reference bugs: Check if entities compare against lists including themselves
- Order-dependent logic: Verify processing order matches business priorities
- State consistency: Ensure constraint checks use correct state (target vs current)
SQL analysis tips:
- Examine ORDER BY clauses for data prioritization
- Check JOIN patterns for constraint relationships
- Use query ordering to understand conflict resolution
Before marking any coding task as complete:
- Code compiles successfully (
go build) - Tests pass (
make test) - Code is formatted (
gofumpt -w) - Basic functionality verified
Never mark completed if: tests fail, implementation is partial, or compilation errors exist.
Common patterns in monitor selection systems:
- Iterative constraint checking: Check constraints after each promotion/demotion
- Emergency override logic: Safety overrides for critical system states
- Self-exclusion: Always exclude entity being evaluated from conflict detection
- Target vs current state: Check constraints against appropriate state
- Grandfathering: Allow existing violations to persist temporarily
Implementation best practices:
- Lazy evaluation, consistent error handling, audit logging
Race condition detection:
- Identify shared state accessed by multiple goroutines
- Ensure write operations use
Lock(), notRLock() - Check atomic operations and channel usage
Mutex best practices:
- Use write locks for modifications, read locks for reads
- Minimize lock scope and avoid nested locks
- Create
methodUnsafe()variants that assume lock is held
Common patterns:
- Configuration hot reloading with mutex protection
- Background goroutines with proper context cancellation
- Atomic file operations with proper locking
Logging:
- Use
*slog.Loggertype for logger fields - Use contextual logging:
log.InfoContext(ctx, ...) - Get logger from context:
log := logger.FromContext(ctx)
Error Handling:
- Always include context in error messages
- Use structured logging for errors with relevant fields
CLI Framework (Kong):
- CLI commands defined as structs with Kong struct tags
- Common tags:
name:"flag-name",short:"x",help:"Description",default:"value" - Follow existing patterns in
client/cmd/cmd.go
Testing:
- Use table-driven tests
- Avoid
testify/assertor similar tools - SQL queries tested through integration tests
- Always use make targets for running tests (see Testing Framework section below)
Test Data Validation:
- Validate constraint mathematics ensure expected outcomes are possible
- Cross-check test logic before writing assertions
- Document mathematical relationships in comments
- Use CI tools (
./scripts/test-ci-local.sh) for debugging
ALWAYS use make targets for running tests. Never run test commands directly unless specifically documented.
make test - Default comprehensive test suite
- Runs all unit tests across the entire codebase
- Use this for general development and pre-commit validation
- Equivalent to
go test ./...but preferred for consistency
make test-unit - Fast unit tests only
- Runs tests with
-shortflag, skipping slower integration tests - Use for rapid development feedback cycles
- No database dependencies required
make test-integration - Integration tests with database
- Automatically starts test database if
TEST_DATABASE_URLnot set - Runs all tests tagged with
integrationbuild tag - Includes database-dependent tests and scorer integration tests
- Use for validating database interactions and full system behavior
make test-all - Complete test suite
- Runs both unit and integration tests sequentially
- Equivalent to
make test-unit test-integration - Use for comprehensive validation before major changes
make test-db-start - Start test database
- Starts MySQL 8.0 container on port 3308
- Loads schema automatically
- Database persists until explicitly stopped
make test-db-stop - Stop test database
- Cleanly shuts down and removes test database container
- Use when finished with integration testing
make test-db-restart - Restart test database
- Equivalent to
make test-db-stop test-db-start - Use to reset database state between test runs
make test-db-reset - Reset database schema
- Drops and recreates database, reloads schema
- Use when schema changes require fresh database state
make test-load - Performance and load tests
- Runs tests tagged with
loadbuild tag - Requires test database, automatically started if needed
- Extended timeout (30 minutes) for long-running tests
make test-ci-local - CI environment emulation
- Replicates CI testing environment locally
- Use for debugging CI-specific test failures
- Includes additional validation and cleanup steps
./scripts/test-scorer-integration.sh - Scorer-specific debugging
- When to use: Debugging scorer-specific issues in isolation
- Port conflict detection: Checks for existing database on port 3308
- Guidance: Use
make test-db-stopfirst if port conflicts occur - Cleanup: Automatically cleans up its own database container
- Alternative: Use
make test-db-start && go test ./scorer -tags=integration -v
Development workflow:
make test-unit # Quick feedback during development
make test-integration # Validate database interactions
make test # Final validation before commitDatabase testing workflow:
make test-db-start # Start persistent test database
go test ./ntpdb -v # Test specific package with database
make test-db-stop # Clean up when doneCI debugging workflow:
make test-ci-local # Replicate CI environment
# If failures occur, use individual targets to isolate issuesScorer debugging workflow:
# Option 1: Using make targets
make test-db-start
go test ./scorer -tags=integration -v
# Option 2: Using specialized script (handles its own database)
./scripts/test-scorer-integration.sh- Always prefer make targets over direct go test commands
- Use test-unit for rapid iteration during development
- Run test-integration before commits that touch database code
- Clean up test databases with
make test-db-stopwhen done - Use CI tools for debugging complex test failures
- Isolate component testing using specialized scripts only when needed
client/- Client-side monitoring agent implementationclient/monitor/- NTP monitoring logic using beevik/ntp libraryclient/config/- Configuration management with TLS certificates and hot reloadingserver/- API server with JWT auth and Connect RPC endpointsapi/- Protocol definitions using Protocol Buffers and Connect RPCscorer/- Server performance scoring algorithmsntpdb/- Database layer using MySQL with sqlc for type-safe queries
The system supports two distinct types of monitors with different purposes and lifecycles:
Purpose: Distributed NTP monitoring clients that test server performance
- Implementation:
ntppool-agentclients running on user systems - Data Flow: Submit test results via monitor-api gRPC/ConnectRPC endpoints
- Status Management: Managed by selector through proper constraint checking
- Status Progression:
candidate→testing→active - Assignment: Automatically assigned to compatible servers via
GetServersAPI - Location:
client/package and related monitoring code
Lifecycle:
- Monitor submits test results → Creates
log_scoresentries - Scorer processes results → Creates
server_scoreswithcandidatestatus - Selector evaluates server → Promotes based on constraints and health
- Constraint violations → Gradual demotion through status hierarchy
Purpose: Meta-monitors that calculate aggregate server performance scores
- Implementation: Backend processes that analyze monitoring data
- Data Flow: Process
log_scoresfrom regular monitors to compute server scores - Status Management: Automatically set to
activestatus when processing scores - Assignment: Manually configured, not subject to selector constraint checking
- Location:
scorer/package
Lifecycle:
- Scorer processes
log_scoresentries from regular monitors - Creates/updates
server_scoresentries with calculated performance metrics - Status automatically forced to
activeduring score calculation (this is correct behavior) - Not subject to selector's constraint checking or promotion logic
| Aspect | Regular Monitors | Scorer Monitors |
|---|---|---|
| Purpose | Test individual servers | Calculate aggregate scores |
| Data Source | Direct NTP measurements | Processed monitoring data |
| Status Flow | candidate → testing → active |
Always active when processing |
| Constraint Checking | Full selector constraint validation | Not subject to constraints |
| Assignment | Automatic via selector logic | Manual configuration |
| Count Limits | Subject to account/network limits | No limits (system-managed) |
Monitors are identified by the type field in the monitors table:
SELECT * FROM monitors WHERE type = 'monitor'; -- Regular monitoring clients
SELECT * FROM monitors WHERE type = 'score'; -- Scorer/meta-monitorsThe server_scores table contains entries from both types, but they serve different purposes:
- Regular monitors: Status managed by selector for constraint compliance
- Scorer monitors: Status automatically managed for operational needs
Two separate configuration endpoints with different purposes and frequencies:
-
HTTP Config Endpoint (
/monitor/api/config)- Frequency: Every 5 minutes + immediate fsnotify triggers on state.json changes
- Purpose: Basic monitor setup, IP assignments, and TLS certificate management
- Location:
client/config/appconfig.go:LoadAPIAppConfig()
-
gRPC Config Endpoint (
api.GetConfig)- Frequency: Every 60 minutes + immediate triggers when HTTP config changes
- Purpose: Monitor-specific operational configuration per IP version
- Location:
client/cmd/monitor.go:fetchConfig()
Hot Reloading System:
fsnotifywatchesstate.jsonfor immediate response to setup command changes- HTTP config changes trigger immediate gRPC config refresh for all monitor goroutines
- Context-based notification system with proper cleanup to prevent memory leaks
- Broadcast mechanism supports multiple concurrent monitor goroutines (one per IP version)
Certificate Lifecycle and Timing:
- Initial Setup:
ntppool-agent setupobtains API key but NOT certificates - Activation Required: Monitors must be marked "active" or "testing" in the API before certificates can be requested
- First Certificate Request: Happens on first
LoadAPIAppConfig()call after monitor activation - Certificate Storage: Stored alongside state.json in the state directory
- Hot Reloading: Certificate changes are immediately detected and loaded
Wait Method Usage:
WaitUntilAPIKey(): Use when only API key is needed (e.g., initial setup verification)WaitUntilConfigured(): Use when both API key AND certificates are required (e.g., API operations)WaitUntilCertificatesLoaded(): Internal method for waiting specifically for certificatesWaitUntilLive(): Use when monitor must be in active/testing state with valid IP assignment
systemd StateDirectory vs RuntimeDirectory:
- StateDirectory (
/var/lib/ntppool-agent): Persistent storage that survives reboots - RuntimeDirectory (
/var/run/ntppool-agent): Temporary storage cleared on reboot - Migration: Automatic migration from RuntimeDirectory to StateDirectory on startup
- Priority Order:
$MONITOR_STATE_DIR>$STATE_DIRECTORY> user config directory
State Migration Best Practices:
- Check for
RUNTIME_DIRECTORYenvironment variable on startup - Migrate state.json and certificate files if found
- Log migration operations for debugging
- Handle partial migrations gracefully (e.g., state.json exists but certificates don't)
AppConfig (Local State):
- Stored in state.json
- Contains: API key, monitor name, TLS name, IP assignments, status per protocol
- Updated via HTTP endpoint every 5 minutes
- Triggers immediate notifications on changes via
WaitForConfigChange()
gRPC Config (Operational Config):
- Fetched via Connect RPC from monitor-api
- Contains: NTP test parameters, server lists, MQTT settings
- Updated every 60 minutes or when AppConfig changes
- Requires valid certificates for authentication
Configuration Flow:
- Setup command → API key stored in state.json
- Monitor activation in web UI → Status changes to "testing" or "active"
- First LoadAPIAppConfig() → Receives certificates
- Subsequent API calls → Can fetch gRPC config
Status Values:
- active: Monitor is fully operational
- testing: Monitor is in test mode (still operational)
- pending: Monitor should gradually phase out (allows clean transitions)
- paused: Monitor should stop all work immediately
Status Checking Best Practices:
- Check in outer loop: Before spawning monitor goroutines
- Use fresh config: Call
IPv4()/IPv6()to get current status, not stale captures - Wait for activation: Use
WaitForConfigChange()when paused - Avoid inner loop checks: Don't check status inside monitoring loops
Example Pattern:
// Outer loop - check status before starting monitors
ipc := cli.Config.IPv4()
if !ipc.IsLive() {
// Wait for activation using WaitForConfigChange
for {
configChangeCtx := cli.Config.WaitForConfigChange(ctx)
select {
case <-configChangeCtx.Done():
ipc = cli.Config.IPv4() // Get fresh status
if ipc.IsLive() {
break
}
case <-ctx.Done():
return nil
}
}
}
// Now safe to start monitoring- Connect RPC (replacing legacy Twirp) for client-server communication
- MQTT for real-time messaging and live monitoring updates
- TLS certificates for mutual authentication via Vault or API
- MySQL backend with sqlc for compile-time verified SQL
- ClickHouse support for analytics and traceroute data
- Schema Changes: Database schema changes are handled automatically by the deployment system
- Schema File:
schema.sqlcontains the current database schema - Local Development: Use MySQL 8 in Docker (available via
make test-db-startor./scripts/test-db.sh start) - No Manual Migrations: The codebase handles schema updates automatically during deployment
- Version Tracking: Schema versions always increment forward and are managed separately from the code
When implementing database operations that might be called concurrently:
- Check for Duplicate Key Constraints: Review table schemas for unique constraints
- Use Idempotent Operations: Prefer
INSERT ... ON DUPLICATE KEY UPDATEover plainINSERT - Test Concurrent Scenarios: Integration tests should include concurrent operation tests
- Regenerate After SQL Changes: Always run
make sqlcafter modifying query.sql
Example pattern for safe inserts:
INSERT INTO table (col1, col2) VALUES (?, ?)
ON DUPLICATE KEY UPDATE col2 = VALUES(col2);Key environment variables:
DEPLOYMENT_MODE- Environment (devel/test/prod)DATABASE_DSN- MySQL connection stringJWT_KEY- JWT signing key for MQTT authVAULT_ADDR- Vault server URL for secretsOTEL_EXPORTER_OTLP_ENDPOINT- OpenTelemetry collector
Database credentials can be provided via database.yaml:
mysql:
user: some-db-user
pass: passwordFor complex changes, break work into distinct phases:
Phase 1: Foundation/Bug Fixes
- Fix critical bugs, race conditions, and safety issues first
- Establish proper synchronization and error handling
- Ensure existing functionality remains intact
- Complete testing and validation before proceeding
Phase 2: Core Implementation
- Add new features and functionality
- Implement hot reloading, configuration changes, or new APIs
- Maintain backward compatibility throughout
- Test each increment independently
Phase 3: Future Considerations
- Document potential improvements and optimizations
- Plan for scalability and maintenance
- Consider deprecation paths for legacy components
- Defer non-essential changes to future iterations
- Test each phase independently: Don't proceed until current phase is stable
- Maintain rollback capability: Each phase should be independently revertible
- Use plan mode for complex changes: Present architectural decisions for approval
- Document assumptions and dependencies: Make implicit requirements explicit
- Prefer targeted fixes over architectural overhauls: Simple solutions first
- Incremental commits: Each phase gets its own commit with clear description
- Interface stability: Don't break existing APIs without migration paths
- Configuration compatibility: Ensure new config works with existing deployments
- Monitoring continuity: Verify metrics and logging remain functional
- After schema changes: Run
make sqlcto regenerate database code - After protobuf changes: Run
make generateto regenerate RPC code - Before commits: Run
gofumpt -won all modified Go files - Testing: Run appropriate make test targets to validate changes
- Phase validation: Complete and test each development phase before proceeding
- Incremental commits: Commit each stable phase independently
- Dual-stack IPv4/IPv6 monitoring with automatic IP detection
- High-precision NTP accuracy testing with configurable sampling
- Network traceroute integration for path analysis
- Real-time scoring algorithms for server performance evaluation
- OpenTelemetry integration for distributed tracing and metrics
- Certificate-based mutual authentication
Documentation:
- Maintain
cmd/[command]/README.mdfiles for each command - Use terse, standard Go doc comments
- Update README.md when user-facing options change
Backwards Compatibility:
- If a change affects both agent and monitoring server, update both together
- Maintain backwards compatibility with older monitoring server versions
- Use
version.CheckVersionfunction and follow existing versioning patterns
Security:
- Never commit secrets, credentials, or sensitive data to git
- Secrets may exist in working directory but must be excluded from version control
Code Quality:
- Flag any existing or new global variables and panics
- When introducing third-party packages, ask for confirmation
- Discourage global variables and disallow panics unless absolutely necessary
APIs:
- Legacy Twirp API is frozen and will be removed after legacy clients are upgraded
- Flag any new code using Twirp or modifying the Twirp-based API
- Prefer Connect RPC API for all new development
General Guidance:
- If context is missing or unclear, ask for clarification before proceeding
- When creating new files, place them inside
/Users/ask/src/go/ntp/monitor - When editing files, use
...existing code...comments to indicate unchanged regions
API Endpoint Configuration:
- Deployment environment set via
--envflag orDEPLOYMENT_MODEenvironment variable - Environment represented by
depenv.DeploymentEnvironmenttype fromgo.ntppool.org/common/config/depenv - API endpoint resolved by calling
depenv.DeploymentEnvironment.APIHost()orMonitorAPIHost() - Override with
DEVEL_API_SERVERenvironment variable - Default endpoints defined in
depenvpackage: prod (https://api.mon.ntppool.dev), test (https://api.test.mon.ntppool.dev), devel (https://api.devel.mon.ntppool.dev)
Key scripts in scripts/ directory:
test-db.sh- Primary test database management (MySQL 8.0 on port 3308)test-ci-local.sh- Full CI environment emulationtest-scorer-integration.sh- Component-specific testingdiagnose-ci.sh- CI failure diagnostics
Database port:
- 3308: All test databases (unified port for consistency)
Local workflow:
make test-db-start(for integration tests)make test(comprehensive test suite)make test-db-stop(cleanup when done)
Key recent changes (2025):
- JWT Authentication: Complete JWT authentication with JWKS support replacing planned API key auth (commits: 10e2a70, deb9a16, 304cc1c)
- OpenTelemetry Migration: Client metrics fully migrated to OpenTelemetry from Prometheus (commit: 9aa4d39)
- Database Consolidation: Migrated to shared common/database package reducing code duplication (commits: 650aeb9, 393a251, c86adf2)
- "New" Status Elimination: Schema updated to remove "new" status entirely (commit: 64416d0)
- Performance-Based Rule 5: Candidates can replace worse-performing testing monitors (commit: de5e03a)
- Emergency Override Consistency: Fixed candidate→testing promotion gap (commit: b6515b8)
- Selector Package Refactoring: Moved to dedicated
selector/package with new constraint validation algorithm - Per-Status-Group Change Limits: Separate limits for each status transition type in
selector/process.go - Dynamic Testing Pool Sizing: Testing pool adjusts based on active monitor gap
- Monitor Limit Enforcement: Fixed monitor count tracking and rule execution order
- Configuration Management: Transitioned to systemd StateDirectory for persistent storage
Monitor Selection Rules (in selector/process.go):
- Rule 1: Immediate blocking
- Rule 2: Gradual constraint removal
- Rule 1.5: Active excess demotion
- Rule 3: Testing to active promotion
- Rule 5: Candidate to testing promotion
- Rule 2.5: Testing pool management
- Rule 6: Bootstrap promotion