Skip to content

RLCR loop methodology: add convergence controls for long-running sessions #174

@dzwduan

Description

@dzwduan

RLCR Methodology Analysis Report

Session Overview

  • Total rounds: 43 (Round 0 through Round 42)
  • Duration: ~8 hours (single session)
  • Final verdict: Every round received "ADVANCED" — no round was ever marked "COMPLETE" or "STALLED"
  • Test progression: 43 tests (Round 0) → 128 tests (Round 42)
  • Acceptance criteria addressed: 8/8 partially, 0/8 fully complete at session end
  • Major unaddressed work items: 4+ acceptance criteria were listed as "NOT MET" from the first review through the final round with zero progress

Key Findings

1. Severe Granularity Collapse Over Time

Pattern: The amount of meaningful work per round decreased dramatically as the session progressed.

Phase Rounds Typical Output Per Round
Foundation 0–1 Entire system skeleton, build system, core pipeline, dozens of tests
Feature Implementation 2–10 One complete subsystem feature with tests
Incremental Credit Work 10–20 One mechanism (counter + consume/return + 2-3 tests)
Reviewer Fix Loops 20–30 1–3 tests fixing issues from previous review
Micro-Corrections 30–42 Often a single test assertion change or one field addition

By the final phase, entire rounds were consumed by adding a single comparison field to a test utility or converting a default function parameter to an explicit one. The cost-per-round (reviewer invocation, contract creation, summary writing, tracker updates) remained constant while productive output shrank to near zero.

2. Perpetual Incompleteness — No Convergence Mechanism

Pattern: Multiple major work items were listed as "NOT MET" or "blockers" in every single review from Round 0 through Round 42, yet were never worked on.

At least four significant acceptance criteria were mentioned as incomplete in every review without any implementation attempt across 43 rounds. The reviews faithfully recorded them as "blocking" and "not accepted as deferrals" — but the methodology provided no mechanism to force the implementer to prioritize them over local refinements.

The result: the session ended with the same major gaps it started with, while investing 20+ rounds polishing narrow slices of already-working features.

3. Reviewer as Scope Expander (Asymmetric Feedback Loop)

Pattern: Reviews consistently raised the bar rather than converging toward closure.

Each review:

  • Accepted the round's work as "ADVANCED" (never "COMPLETE")
  • Found 1–3 new issues in the implementation just delivered
  • Maintained or expanded a long "Mainline Gaps" list (typically 4–10 items)
  • Appended a "Directive Next Implementation Plan" that mixed small fixes with large untouched features

The reviewer never acknowledged that a slice was "done enough" or recommended moving on. This created an infinite refinement loop where each fix exposed new review findings. Example: a credit mechanism took 5 rounds to implement, 3 additional rounds to fix reviewer-identified timing issues, then 2 more rounds to add reviewer-requested edge tests — totaling 10 rounds for what was architecturally one feature.

4. Local Optimization Trap

Pattern: The implementer gravitated toward small, safe tasks rather than large, uncertain ones.

When given a directive plan listing both "fix this test assertion" and "implement a major unstarted subsystem," the implementer consistently chose the small fix. This is rational from a round-success perspective (small fixes are easy to validate) but catastrophic for overall project progress.

Rounds 26–42 (17 rounds) produced incrementally better test coverage for already-working memory stubs and a trace comparator, while four major acceptance criteria received zero implementation effort. The methodology's per-round contract structure inadvertently rewarded "safely advanced" verdicts over risky feature work.

5. Vacuous "ADVANCED" Verdicts Masked Stagnation

Pattern: The progress tracking system could not distinguish between meaningful and trivial advancement.

Every round received "ADVANCED" because any non-zero change satisfied the criterion. A round that built an entire subsystem and a round that renamed a function parameter both earned the same verdict. The mainline_stall_count: 0 in the session state confirms no stall was ever detected, despite the session clearly stagnating in its final third.

The methodology lacked a velocity metric or diminishing-returns detector.

6. Repetitive Review Boilerplate Dominated Content

Pattern: Reviews contained large amounts of repeated structural content that consumed reviewer effort without providing new signal.

Each review included:

  • A full "Goal Alignment Summary" repeating AC status (nearly identical across 40+ reviews)
  • A full "Blocking Side Issues" section listing the same 5–7 items every round
  • A full "Queued Side Issues" section listing the same 3–4 items every round
  • An "Unjustified deferrals" count that was the same every round

This repetition was ~60% of each review's word count. The delta-only information (what actually changed this round) was a small fraction of the output.

7. Contract-to-Implementation Gap (Partial Completion Accepted)

Pattern: Rounds frequently claimed success on criteria that were later found incomplete.

In at least 8 rounds, the implementer's summary claimed all success criteria were met, but the subsequent review found 1–3 criteria were not actually satisfied. Examples include:

  • Claiming "exact count enforcement" while still using >= assertions
  • Claiming "trace classification complete" while invalid encodings still hit generic fallback
  • Claiming "non-vacuous test" while conditional guards could skip all assertions

This suggests contracts need pre-defined, machine-verifiable acceptance conditions rather than prose descriptions that allow subjective interpretation.


Efficiency Metrics

Metric Value Assessment
Rounds to complete all ACs 42+ (never completed) Poor — no convergence
Rounds spent on credit mechanisms ~15 (Rounds 10–25) Moderate — reasonable for 5 credit points
Rounds spent on trace comparator ~8 (Rounds 35–42) Poor — 8 rounds for a test utility
Rounds with purely reviewer-fix work ~15 Poor — 35% of session spent on rework
Major features never attempted 4+ Critical — scope items ignored entirely
Average tests added per round (Rounds 30–42) ~2.3 Very low productivity

Methodology Improvement Suggestions

Suggestion 1: Add a Velocity Gate with Mandatory Topic Rotation

Observed Pattern: The loop got stuck in diminishing-returns refinement of a single area for 10+ consecutive rounds while ignoring major unstarted work.

Improvement: After N consecutive rounds (e.g., 3–4) on the same acceptance criterion or feature area, force a mandatory rotation to the most critical unstarted/blocked area. The velocity gate should trigger when the ratio of "new feature code" to "test fix / assertion polish" drops below a threshold.

Suggestion 2: Implement a "Good Enough" Completion Tier

Observed Pattern: No slice was ever called "complete." Reviews always found further refinement opportunities, creating an infinite loop.

Improvement: Define three tiers per acceptance criterion: (1) "Skeleton" — interface exists, basic happy path works; (2) "Functional" — positive and negative tests pass, edge cases documented; (3) "Production" — all edge cases covered, trace-visible, fully integrated. Allow the loop to mark an AC as "Functional" and move on, with "Production" deferred to a later pass. The current binary "NOT MET / PARTIALLY MET" provides no intermediate graduation.

Suggestion 3: Weight Reviews by Impact, Not Exhaustiveness

Observed Pattern: Reviews spent equal effort on "a function parameter has a default value" and "an entire subsystem is unimplemented." Both generated directive plans, but only one materially affected project outcomes.

Improvement: Reviews should explicitly rank findings by impact (Critical / Moderate / Low) and direct that only Critical items block the next round. Moderate/Low items should be batched and addressed in periodic "hardening rounds" rather than consuming one round each.

Suggestion 4: Add Diminishing-Returns Detection

Observed Pattern: The session produced 128 tests over 42 rounds. The first 5 rounds produced 61 tests (~12/round). The last 12 rounds produced 17 tests (~1.4/round). No mechanism detected this 8x productivity collapse.

Improvement: Track a rolling average of "new meaningful artifacts" (tests, modules, interfaces) per round. When the average drops below 50% of the session mean, trigger an escalation: either expand scope to a new area, batch remaining small fixes, or end the session with a documented remainder list.

Suggestion 5: Separate "Build" Rounds from "Polish" Rounds

Observed Pattern: The methodology treated every round identically regardless of whether it was creating new functionality or fixing reviewer nits. This made it impossible to distinguish productive exploration from rework drag.

Improvement: Explicitly label rounds as "Build" (new feature/subsystem) or "Polish" (fixing existing issues). Enforce a ratio (e.g., no more than 2 Polish rounds per Build round). If Polish rounds accumulate, the methodology should ask whether the underlying design is flawed rather than continuing incremental fixes.

Suggestion 6: Limit Review Scope to Delta-Only After Stabilization

Observed Pattern: Reviews repeated the same "Goal Alignment Summary," "Blocking Side Issues," and "Queued Side Issues" sections verbatim across 40+ rounds, consuming reviewer tokens without new information.

Improvement: After a configurable stabilization point (e.g., 10 rounds), switch to delta-only reviews that cover: (a) what changed this round, (b) whether it satisfies the contract, (c) what the next round should do. Full-scope reviews should happen only at milestone boundaries (e.g., every 5th round or when a major AC transitions state).

Suggestion 7: Make Contracts Machine-Verifiable

Observed Pattern: Success criteria were written in prose and frequently subject to disagreement between implementer and reviewer. The implementer claimed "done" while the reviewer found gaps, wasting both a submission round and a review round.

Improvement: Require contracts to include at least one verifiable check per criterion (e.g., "test X must assert EXPECT_EQ(a, b) not EXPECT_GE" or "output of grep -c 'pattern' file must equal N"). This eliminates ambiguity and enables pre-submission self-verification.

Suggestion 8: Add an Exit/Pause Criterion

Observed Pattern: The session ran 43 rounds without any mechanism to pause, re-plan, or declare "diminishing returns — stop here."

Improvement: Define explicit session exit conditions: (a) all ACs reach "Functional" tier, (b) velocity drops below threshold for N consecutive rounds, or (c) round count exceeds a configured maximum (e.g., 25) at which point the session pauses for human re-prioritization. The current configuration (max_iterations: 42) appears to have been a hard stop rather than a planned exit.


Summary Assessment

The RLCR methodology demonstrated clear value in the first ~20 rounds: the reviewer caught real bugs (vacuous tests, incorrect timing, missing coverage), maintained goal alignment, and drove the implementer toward correct behavior. The structured contract/review cycle prevented goal drift and ensured each round had clear success criteria.

However, the methodology lacks mechanisms for:

  1. Convergence — no way to graduate from a topic
  2. Velocity tracking — no detection of diminishing returns
  3. Scope prioritization — small fixes and large features get equal treatment
  4. Session termination — no exit condition beyond a hard round cap

The net result was a session that did excellent work in its first half but spent its second half in an increasingly tight refinement spiral, ending with the same major gaps it started with while achieving micro-perfection on a narrow slice. The 43-round investment would likely have produced better overall outcomes if capped at ~20 rounds with mandatory topic rotation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions