Skip to content

Latest commit

 

History

History
319 lines (266 loc) · 22.2 KB

File metadata and controls

319 lines (266 loc) · 22.2 KB

Primer Roadmap

Primer is the harness intelligence layer for agentic engineering. Research and industry evidence converge on a single insight: outcome quality is determined more by the agent harness — tool design, context management, caching, orchestration, and permission boundaries — than by model capability alone. Primer captures session telemetry across agents, decomposes it into harness dimensions, and measures which configurations actually improve outcomes. It turns that data into harness attribution, coaching, enablement, and operational decisions.

This roadmap is organized in two layers:

  1. Strategy and priorities at the top.
  2. Detailed shipped and planned capabilities underneath.

Items marked with [x] are shipped. Items marked with [ ] are planned. Planned items include rough priority tags:

  • P0 - foundational and near-term
  • P1 - important follow-on work
  • P2 - valuable expansion work

What Primer Should Help Teams Answer

  • Harness effectiveness: Which harness configurations (tool designs, caching strategies, context management, orchestration patterns, permission boundaries) correlate with better outcomes?
  • Harness attribution: When a session succeeds or fails, which harness components contributed? What's the per-step compound reliability?
  • Harness evolution: How have harness configurations changed over time, and did those changes improve outcomes?
  • Harnessability: Are codebases and teams structurally ready for effective agent harnesses (documentation quality, typing, module boundaries, data governance)?
  • Dead weight: Which harness configurations are outdated compensations for older model limitations that now bottleneck performance?
  • Environment effectiveness: Where is the project context failing the engineer, and what harness or repository changes will unblock them?

Product Goals

  • Measure harness effectiveness, not just usage — decompose outcomes to the harness component level.
  • Build the "code coverage for harnesses" that the industry is asking for (per-component reliability, compound failure math).
  • Track longitudinal harness evolution so teams can see how configuration changes correlate with outcome changes over time.
  • Make harnessability scoring a first-class product surface (documentation quality, context freshness, guide/sensor coverage).
  • Position Primer as the engineer's ally—proving when a codebase is not "AI-ready" rather than focusing on engineer inefficiency or surveillance.
  • Close the loop from harness insight to auto-remediation — where Primer acts as an agent itself to fix broken repository context.
  • Bring harness intelligence into the engineer workflow via MCP sidecar, not only after the fact.

Strategic Themes

P0: Enterprise Security & Trust

Telemetry capture cannot mean IP leakage. Primer must implement robust secret, PII, and IP scrubbing at the capture layer before any data is persisted, alongside transcript cold storage strategies for scale. Security teams must trust Primer by default.

P0: Harness Attribution and Compound Reliability

Primer's moat is decomposing outcomes to the harness component level. Per-tool success rates, compound reliability math (10 steps at 99% = 90.4% end-to-end), and harness configuration fingerprinting from session telemetry. This is the "code coverage for harnesses" that the industry is asking for.

P0: Measurement Integrity

Trustworthy semantics remain foundational. Clean taxonomy for outcomes, goals, friction, and success, plus reprocessing and coverage tooling so every downstream metric is credible.

P0: Harness Evolution Tracking

Longitudinal correlation of harness configuration changes with outcome changes over time. LangChain rewrote their harness 4x in one year; Vercel removed 80% of tools and improved. No tool tracks this — Primer's team-level time-series data is uniquely positioned.

P1: Harnessability Scoring

Measure whether codebases and teams have the structural properties (documentation quality, context freshness, module boundaries, guide/sensor coverage) that make agent harnesses effective. Extends existing project readiness into a full harnessability assessment.

P1: Closed-Loop Enablement and Auto-Remediation

Recommendations should become measurable interventions, shifting away from "engineer coaching" and toward "environment fixing." Primer should begin acting as an agent itself, automatically opening PRs to update outdated AGENTS.md or remove dead-weight MCP tools.

P1: In-Workflow Guidance

The most valuable insights should show up during the session via MCP sidecar: harness health scores, context quality warnings, dead weight alerts, and configuration recommendations.

P2: Harness Simulation & Backtesting

If an organization changes permission boundaries or MCP tools, they should know if it works. Primer will enable "backtesting"—running past failed sessions through new harness configurations to mathematically prove the change improves outcomes.

P2: Operational Scale and Enterprise Readiness

Derived data pipelines, performance optimization, durable background jobs, enterprise identity, and observability.

Near-Term Priorities

  • P0 Local secret, PII, and IP redaction pipeline at the capture layer before database insertion.
  • P0 Per-tool success rate tracking with compound reliability computation — decompose session outcomes to the tool/step level.
  • P0 Harness configuration fingerprinting — extract and catalog the actual harness configuration (tools, context files, permissions, customizations) from session telemetry.
  • P0 Context quality scoring — measure AGENTS.md freshness, token efficiency, and guide/sensor coverage per project.
  • P1 Harness evolution timeline — before/after correlation of configuration changes with outcome changes.
  • P1 Harnessability scoring per project — documentation quality, typing strength, module boundaries, data governance readiness.
  • P1 Issue tracker integration (Linear/Jira) to connect session success to ticket-to-merge cycle time.
  • P1 Paragon's 4-dimension evaluation — tool correctness, tool usage accuracy, task completion, task efficiency.
  • P1 Semantic search over sessions via pgvector — exemplar discovery and cross-engineer pattern matching.
  • P2 Primer-as-Agent auto-remediation — Primer automatically generates PRs for environment fixes.
  • P2 Harness backtesting — simulate past sessions against new configurations.

Detailed Roadmap

Measurement Integrity & Data Foundation

  • Facet taxonomy alignment across extraction, schemas, analytics, and UI
  • Outcome normalization and historical backfill for previously ingested sessions
  • Coverage dashboard for facet extraction, transcript completeness, GitHub sync, and repository metadata
  • Confidence scoring for extracted facets and downstream recommendations
  • Cross-agent schema parity matrix so Primer knows which session fields are required, optional, or unavailable per source
  • Partial-telemetry handling for IDE-native agents like Cursor so missing transcript, tool, or model fields do not distort org-wide metrics
  • [P1] Execution evidence capture: lint, test, build, and verification signals per session
  • [P1] Change-shape capture: files touched, diff size, churn, and rewrite/revert indicators
  • [P1] Recovery-path tracking: detect whether engineers recover after friction or abandon the attempt
  • [P1] Derived analytics tables and materialized rollups for heavy longitudinal queries
  • Source-quality dashboard by agent type, including capture coverage and telemetry completeness for Cursor
  • [P0] Secret, PII, and IP redaction pipeline at the capture layer (e.g., Presidio integration) before persistence
  • [P2] Transcript cold storage and blob offloading (S3/GCS) to keep the operational database fast
  • [P2] Data-quality anomaly detection for broken ingestion, sparse transcripts, or stale integrations

Session Intelligence

  • Session search with full-text, outcome, type, model, branch filters
  • Transcript viewer with message-level detail
  • Session health scoring (outcome + friction + duration + satisfaction composite)
  • LLM-powered facet extraction (goals, friction types, satisfaction signals)
  • End reason breakdown with success rate per reason
  • Goal analytics (session type and goal category breakdown)
  • Permission mode analysis (success rate by permission level)
  • Satisfaction trend tracking (satisfied / neutral / dissatisfied over time)
  • Similar sessions panel with 3-tier relevance matching
  • [P0] Cursor session ingestion and discovery pipeline
  • [P0] Cursor transcript and tool-call extraction mapped onto the normalized session model
  • [P1] Cursor native telemetry enrichment for approvals, change shape, and context-usage signals
  • [P1] Cursor reliable token and model-usage extraction once source telemetry is trustworthy
  • [P1] Workflow fingerprinting: infer common sequences like search -> read -> edit -> test -> fix
  • [P1] Cursor-specific workflow fingerprinting and session archetype mapping
  • [P1] Session archetype detection: debugging, feature delivery, refactor, migration, docs, investigation
  • [P1] Delegation graph capture for multi-agent and subagent workflows
  • [P1] Issue Tracker Integration (Jira/Linear) to correlate AI session efficiency with actual business sprint velocity
  • [P1] Copilot Integration Strategy: explore telemetry extraction for Copilot Chat alongside current CLI agents
  • [P2] Exemplar session library for high-value workflows and onboarding examples
  • [P2] Skill, command, and template reuse analytics by workflow and outcome
  • [P2] Prompt reuse analytics by workflow and outcome

Friction & Bottleneck Analysis

  • Friction type classification (permission denied, timeout, context limit, edit conflict, tool error, exec error)
  • Friction impact scoring (occurrence count x success rate penalty)
  • Friction trend chart (count + rate over time)
  • Project-level friction breakdown
  • Friction cluster analysis with sample details
  • Anomaly detection for friction spikes
  • [P0] Root-cause clustering from transcripts, tool traces, and repeated failure motifs
  • [P1] Time-lost estimation per friction type, engineer, and project
  • [P1] Toolchain reliability analytics for MCP servers, built-in tools, and external services
  • [P1] Friction recovery analysis: what engineers tried after failure and which recoveries worked
  • [P2] Real-time friction detection for in-session intervention

Developer Enablement & Analytics

Note: We are explicitly shifting our product vernacular and UI away from "Engineer Coaching/Surveillance" toward "Harness & Project Enablement" to ensure Primer acts as a developer ally. Legacy views will be reframed to highlight repo enablement.

  • Engineer leaderboard with multi-dimensional ranking
  • Personal trajectory dashboard with weekly sparklines
  • Strengths and friction breakdown per engineer
  • Peer benchmarking (percentile ranking, vs-team-average deltas)
  • AI-generated narrative insights per engineer
  • Personalized tips based on friction patterns and tool gaps
  • Config optimization suggestions from team benchmark comparison
  • Skill inventory with proficiency levels per tool
  • Learning paths generated from high-performer patterns
  • [P0] Effectiveness score: success rate, cost efficiency, quality outcomes, and follow-through
  • [P0] Workflow playbooks derived from high-performing peer patterns
  • [P1] Plugin and tool recommendation engine based on task type, project context, and similar successful sessions
  • [P1] Model selection coach for cost-appropriate model choice by task
  • [P1] Personal impact review that combines trajectory, quality, cost, and workflow maturity
  • [P2] Longitudinal growth view across quarters, role changes, and team moves

Growth & Onboarding

  • Cohort comparison (new hire / ramping / experienced)
  • Time-to-team-average tracking for new hires
  • Onboarding velocity scoring
  • Onboarding recommendations
  • Shared behavior pattern discovery with approach comparison
  • [P1] Bright spot detection: explicitly surface high performers and cross-pollinate their patterns
  • [P1] Exemplar-session-to-learning-path pipeline
  • [P1] Team skill gap mapping by workflow, tool category, and project context
  • [P2] Coaching program measurement: which onboarding or training changes improved outcomes

Project Intelligence

  • Dedicated project workspace with readiness, friction, quality, cost, and enablement views
  • Project AI-readiness scoring (CLAUDE.md, AGENTS.md, .claude/ detection)
  • Project scorecard that combines adoption, effectiveness, quality, and cost efficiency
  • [P0] Project-level workflow fingerprints and friction hotspots
  • [P1] Project-level agent mix comparison, including Cursor sessions alongside CLI agents
  • [P1] Repository context model: language mix, test maturity, repo size, and AI-enablement signals
  • [P1] Project enablement recommendations tied to observed bottlenecks
  • [P1] Cross-project comparison: which repos are easiest or hardest to use AI effectively in
  • [P2] Project playbook templates for greenfield, legacy, high-compliance, and test-poor repos

Harness Intelligence

  • Tool leverage scoring (0-100 composite per engineer)
  • Tool category classification (core, search, orchestration, skill, MCP)
  • Orchestration adoption rate tracking
  • Agent and skill usage analytics (invocation patterns, delegation depth)
  • Tool adoption rates and trend charts
  • Engineer tool proficiency table
  • Daily leverage trend tracking
  • [P0] 5-factor harness maturity score: tool design, orchestration, caching, context hygiene, boundary design
  • [P0] Dead weight detection: flag zero-invocation and no-outcome-lift customizations
  • [P0] Subtractive coaching: "what you can stop doing" section in coaching briefs
  • [P0] GET /api/v1/harness/deadweight endpoint with auth-scoped access
  • [P1] Model diversity factor in leverage scoring
  • [P1] Agent team detection for coordinated multi-agent orchestration
  • [P1] Session customization snapshot: capture enabled MCP servers, subagents, skills, commands, and templates alongside what was actually invoked
  • [P1] Tool source classification: built-in vs marketplace vs custom
  • [P1] Skill provenance + baseline filtering so recommendations and reuse analytics suppress built-in/default skills and focus on explicit user or repo-configured choices
  • [P1] Cross-agent customization normalization so Claude, Cursor, Codex, and Gemini plugin surfaces map into one shared model
  • [P1] Customization state model: available vs enabled vs invoked for MCPs, subagents, skills, commands, and templates
  • [P1] Outcome attribution for customizations: which MCPs, skills, commands, and subagents improve workflow, quality, cost, and friction outcomes
  • [P1] Cross-team tooling landscape: overlap, reuse, and local best-of-breed tools
  • [P1] High-performer agent stack analysis: which combinations of MCPs, skills, commands, and subagents differentiate top performers
  • [P0] Per-tool success rate tracking with compound reliability computation (10 steps at 99% = 90.4% end-to-end)
  • [P0] Harness configuration fingerprinting from session telemetry (tools, context files, permissions, customizations)
  • [P1] Context quality scoring: AGENTS.md freshness, token efficiency, guide/sensor coverage
  • [P1] Harness evolution timeline: before/after correlation of configuration changes with outcome changes
  • [P1] Harnessability scoring per project: documentation quality, typing strength, module boundaries
  • [P1] Paragon's 4-dimension evaluation: tool correctness, tool usage accuracy, task completion, task efficiency
  • [P2] Prompt, skill, and template maturity scoring
  • [P2] Harness Simulation & Backtesting: Run past failed sessions through proposed harness configs to validate improvements before deployment.
  • [P2] Automated harness optimization suggestions
  • [P2] Dead weight dashboard tab with per-customization detail and removal actions

Code Quality

  • GitHub OAuth SSO
  • Pull request sync via GitHub App
  • Commit correlation with sessions
  • Claude-assisted vs non-Claude PR comparison (merge rate, review comments, time to merge)
  • Quality by session type (debugging, feature, refactoring)
  • Code volume tracking (daily lines added/deleted)
  • Engineer quality ranking table
  • Repository AI-readiness scoring
  • Automated review findings tracker (BugBot parser, severity breakdown, fix rate)
  • Review findings overview in quality dashboard and engineer profile
  • GET /api/v1/analytics/review-findings endpoint with source/severity/status filters
  • Quality attribution layer linking session behavior to PR outcomes and review findings
  • [P1] Additional review bot parsers: CodeRabbit, SonarQube, and other automated review tools
  • [P1] Post-merge outcome tracking: reverts, hotfixes, and follow-up bug volume
  • [P1] Change-quality analysis by workflow fingerprint and session archetype
  • [P2] Review remediation tracking from finding creation to fix completion

FinOps & Cost Management

  • Per-model spend tracking with daily cost chart
  • Cost breakdown by model
  • Cache efficiency analytics (hit rates, savings, per-engineer potential)
  • Billing mode detection (API vs subscription)
  • Subscription vs API cost modeling with optimal plan recommendations
  • 30-day cost forecasting (linear regression with confidence bands)
  • Budget tracking with burn-rate alerts and projected overrun warnings
  • Cost per successful outcome metric
  • [P1] Break-even analysis for API vs seat-based pricing with per-engineer recommendations
  • [P1] Cost per workflow archetype and cost per engineering outcome
  • [P1] Workflow compare mode for archetype and fingerprint performance
  • [P1] Model-choice opportunity scoring for overspend reduction
  • [P2] Budget policy simulation by team, project, and billing model

AI Synthesis & Explorer

  • AI-generated narrative reports (engineer, team, org scope)
  • Narrative caching with TTL-based expiry
  • Auto-refresh via lifespan task
  • Conversational data explorer (SSE-streamed tool-use chat)
  • AI-powered recommendations panel
  • [P1] Saved explorer prompts and reusable report cards
  • [P1] Compare mode for engineer, team, project, and time-period analysis
  • [P2] Weekly manager review packs that combine quality, friction, growth, and cost
  • [P2] Recommendation narratives that explain why a workflow is likely to help

Website & Positioning

  • [P1] Reposition the website around harness intelligence for agentic engineering
  • [P1] Showcase harness effectiveness, cost attribution, quality, and exemplar sessions as the core proof points

Interventions & Experimentation

  • [P0] Recommendation-to-intervention workflow with owner, status, due date, and linked evidence
  • [P0] Before-and-after measurement for coaching, tooling, or repo changes
  • [P1] Experimentation layer for training rollouts, tool changes, and enablement playbooks
  • [P1] Intervention effectiveness reporting by team, project, and engineer cohort
  • [P1] Primer-as-Agent Auto-Remediation: Automatically generate pull requests to fix outdated context (e.g., AGENTS.md) or dead-weight tools
  • [P2] Auto-generated next-step plans from alerts, narratives, and project findings

Real-Time Engineer Experience

  • MCP sidecar with on-demand stats, friction reports, and recommendations
  • [P0] Proactive coaching skill that activates at session start with contextual suggestions
  • [P0] Live session signals that stream friction, satisfaction, and risk as work happens
  • [P1] In-session workflow nudges based on project playbooks and prior failures
  • [P1] Daily and weekly personal recaps inside the sidecar
  • [P2] Lightweight session planning prompts before complex work begins

Organization & Administration

  • Hub-and-spoke dashboard with KPI strip, activity section, attention alerts, deep-dive cards
  • Custom date range picker (7d / 30d / 90d / 1y presets + custom)
  • Team management with member stats
  • Role-based access control (engineer, team lead, admin)
  • Admin panel (engineer/team management, audit log, system stats)
  • Alert system with configurable thresholds, acknowledge/dismiss workflow
  • Slack notification integration
  • CSV and PDF export
  • API rate limiting
  • Dark mode with system preference detection
  • [P1] Activation and setup hub for GitHub, budgets, alerts, narrative readiness, and data freshness
  • [P1] Performance measurement views for leadership across productivity, quality, cost, and adoption
  • [P1] Threshold resolution and policy management that matches actual alerting behavior
  • [P1] Device-scoped ingest tokens for hooks and sidecar, backed by authenticated engineer identity instead of long-lived engineer API keys
  • [P1] One-time setup codes that exchange browser-authenticated engineers into local device tokens
  • [P2] Multi-tenant workspace isolation for multiple organizations on a shared Primer instance
  • [P2] Enterprise IdP support with SAML and OIDC for provisioning and SSO

Platform & Infrastructure

  • Multi-agent support (Claude Code, Codex CLI, Gemini CLI, Cursor)
  • SessionEnd hook system with agent-specific installers
  • primer sync --watch for agents without hook systems
  • Docker Compose and Kubernetes Helm deployment
  • PostgreSQL and SQLite support
  • Alembic migration bundling in pip package
  • [P0] Cursor agent_type support across capture, sync, ingest, and analytics filters
  • [P0] Durable background job system for sync, facet extraction, narratives, and alerts
  • [P0] Scalable API key lookup and verification strategy
  • [P1] Source-capability registry so Primer can safely gate analytics by what each agent source actually provides
  • [P1] OpenTelemetry integration for metrics, traces, and logs
  • [P1] Redis-backed caching for analytics query results and high-read metadata
  • [P1] Analytics performance work for large orgs and concurrent dashboard usage
  • [P2] Pluggable warehouse export for long-horizon analysis in external BI tools