Skip to content

Add dedicated outcome evaluators for create_pull_request and push_to_pull_request_branch#35100

Open
Copilot wants to merge 13 commits into
mainfrom
copilot/implement-outcome-evaluators
Open

Add dedicated outcome evaluators for create_pull_request and push_to_pull_request_branch#35100
Copilot wants to merge 13 commits into
mainfrom
copilot/implement-outcome-evaluators

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 27, 2026

This change replaces generic/fallback outcome checks for the two highest-signal safe-output types with type-specific evaluators. The new logic classifies PR creation and PR-branch pushes using post-action GitHub state, review signals, staleness, and commit-retention evidence instead of existence-style acceptance.

  • create_pull_request: typed outcome classification

    • Added a dedicated evaluator path for create_pull_request.
    • Classifies:
      • accepted (strong): merged PR
      • accepted (medium): approved PR with no requested changes
      • rejected (strong/medium): closed without merge, with stronger signal when closing labels/comments indicate explicit rejection
      • pending: open PR with no reviews, or open with only CHANGES_REQUESTED
      • ignored: stale open PR (configurable stale threshold)
      • unknown: mixed/ambiguous states or API degradation (including review API failure)
  • push_to_pull_request_branch: commit-retention based evaluation

    • Added a dedicated evaluator path for push_to_pull_request_branch.
    • Uses persisted commit metadata (commit_sha, before-head variants) plus branch history comparison to distinguish:
      • retained pushed commits
      • force-pushed/reset-away commits
      • closed-without-merge outcomes
      • pending review states for open PRs
      • ignored for stale open PRs with no review on pushed commits
      • unknown fallback states (including review API failure)
  • Evaluator architecture and maintainability

    • evaluateItem now dispatches by safe-output type to dedicated evaluator functions.
    • Introduced API-function injection in evaluation calls for deterministic unit tests (no mutable global test hook).
    • Tightened closing-comment keyword matching to reduce false-positive strong rejections.
    • Improved SHA matching to correctly handle short-vs-full SHA forms in retention/head comparisons.
    • Excluded ambiguous head_sha from pushed-commit extraction to avoid misclassifying retained pushes.
    • Added inline semantics comments for compare-status interpretation.
  • Focused test coverage

    • Added actions/setup/js/evaluate_outcomes.test.cjs with branch-complete coverage for both evaluators:
      • accepted/rejected/pending/ignored/unknown classification paths
      • merge/close/review/stale cases
      • pushed-commit retained vs reverted/force-pushed cases
      • mixed review-state handling, review API failure handling, no-commit-SHA fallback, short/full SHA matching, and head_sha ambiguity guard
// Type-dispatch in evaluator entrypoint
if (type === "create_pull_request") {
  return evaluateCreatePullRequestOutcome(item, itemRepo, out, ghAPIFn);
}
if (type === "push_to_pull_request_branch") {
  return evaluatePushToPullRequestBranchOutcome(item, itemRepo, out, ghAPIFn);
}

Copilot AI and others added 4 commits May 27, 2026 02:46
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement dedicated outcome evaluators for create_pull_request and push_to_pull_request_branch Add dedicated outcome evaluators for create_pull_request and push_to_pull_request_branch May 27, 2026
Copilot AI requested a review from gh-aw-bot May 27, 2026 03:00
@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot resolve the merge conflicts in this pull request

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 27, 2026

@copilot resolve the merge conflicts in this pull request

Conflicts are resolved in merge commit cc28ec3.

Copilot AI requested a review from pelikhan May 27, 2026 11:56
@pelikhan pelikhan marked this pull request as ready for review May 27, 2026 12:04
Copilot AI review requested due to automatic review settings May 27, 2026 12:04
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🧪 Test Quality Sentinel completed test quality analysis.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

PR Code Quality Reviewer completed the code quality review.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #35100 does not have the 'implementation' label and has 0 new lines of code in default business logic directories (well below the 100-line threshold).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces type-specific outcome evaluators for the two highest-signal safe-output item types (create_pull_request and push_to_pull_request_branch) so outcome classification is based on post-action GitHub state (merge/close/reviews/staleness/commit retention) rather than generic existence-style checks.

Changes:

  • Added evaluator dispatch in evaluateItem() to route create_pull_request and push_to_pull_request_branch items to dedicated evaluators.
  • Implemented PR-state/review-based outcome logic for create_pull_request and commit-retention-based outcome logic for push_to_pull_request_branch.
  • Added focused Vitest coverage for the new evaluator branches and introduced API-function injection for deterministic tests.
Show a summary per file
File Description
actions/setup/js/evaluate_outcomes.cjs Adds dedicated evaluators + dispatch and helper utilities for PR-number/commit retention/closing-signal evaluation.
actions/setup/js/evaluate_outcomes.test.cjs Adds unit tests covering the new evaluator classification paths with a mockable GitHub API function.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 6


if (data.merged === true) {
out.result = "accepted";
out.detail = "merged (strong)";

if (data.merged === true) {
out.result = "accepted";
out.detail = pushedIncluded ? "merged with pushed commit retained (strong)" : "merged";
Comment thread actions/setup/js/evaluate_outcomes.cjs Outdated
Comment on lines +476 to +483
const pushedStillHead = currentHead ? pushedShas.some(sha => sha === currentHead) : false;
const pushedIncluded =
currentHead && pushedShas.length > 0
? pushedShas.some(sha => {
const inHistory = isCommitInBranchHistory(itemRepo, sha, currentHead, ghAPIFn);
return inHistory === true;
})
: false;
Comment thread actions/setup/js/evaluate_outcomes.cjs Outdated

// A strong rejection requires before-head metadata from execution time so we
// can distinguish "commit not retained" from "insufficient history context".
if (pushedShas.length > 0 && !pushedIncluded && beforeHead) {
Comment thread actions/setup/js/evaluate_outcomes.cjs Outdated
*/
function resolvePRNumber(item) {
if (typeof item.number === "number" && item.number > 0) return item.number;
const candidates = [item.pull_request_number, item.pr_number, item.item_number];

const result = evaluateItem({ type: "create_pull_request", repo: "owner/repo", url: "https://github.com/owner/repo/pull/12", timestamp: "2026-05-20T00:00:00Z" }, "owner/repo", { ghAPI });
expect(result.result).toBe("accepted");
expect(result.detail).toContain("strong");
@pelikhan
Copy link
Copy Markdown
Collaborator

@github-actions
Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

Test Quality Score: 100/100 — Excellent

Analyzed 13 test(s): 13 design, 0 implementation, 0 guideline violation(s).

📊 Metrics & Test Classification (13 tests analyzed)
Metric Value
New/modified tests analyzed 13
✅ Design tests (behavioral contracts) 13 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 13 (100%)
Duplicate test clusters 0
Test inflation detected No (test +251 lines vs production +333 lines, ratio 0.75)
🚨 Coding-guideline violations 0

Test Classification Details

Test File Classification Issues Detected
classifies merged PR as accepted (strong) evaluate_outcomes.test.cjs ✅ Design None
classifies approved open PR as accepted (medium) evaluate_outcomes.test.cjs ✅ Design None
classifies closed-with-label PR as rejected (strong) evaluate_outcomes.test.cjs ✅ Design None
classifies closed-without-merge PR as rejected evaluate_outcomes.test.cjs ✅ Design None
classifies open PR with no reviews as pending evaluate_outcomes.test.cjs ✅ Design None
classifies stale open PR with no reviews as ignored evaluate_outcomes.test.cjs ✅ Design None
classifies open PR with mixed review state as unknown evaluate_outcomes.test.cjs ✅ Design None
classifies merged PR with pushed commit included as accepted (strong) evaluate_outcomes.test.cjs ✅ Design None
classifies open PR with pushed commit still at HEAD as accepted (medium) evaluate_outcomes.test.cjs ✅ Design None
classifies force-pushed-away commits as rejected (strong) evaluate_outcomes.test.cjs ✅ Design None
classifies closed-without-merge PR as rejected (medium) evaluate_outcomes.test.cjs ✅ Design None
classifies open PR with no reviews on pushed commits as pending evaluate_outcomes.test.cjs ✅ Design None
falls back to unknown when pushed commits are reviewed but not head/merged evaluate_outcomes.test.cjs ✅ Design None

Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 0 tests
  • 🟨 JavaScript (*.test.cjs): 13 tests (vitest)

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). All 13 tests verify observable behavioral contracts of evaluateItem() across the full outcome state machine (accepted/rejected/pending/ignored/unknown). The mockAPI helper correctly targets only external GitHub API calls — a legitimate mock boundary. Stale-detection tests appropriately use vi.useFakeTimers(). Test-to-production line ratio is 0.75 (well under the 2:1 inflation threshold).

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

🧪 Test quality analysis by Test Quality Sentinel · sonnet46 1.2M ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 100/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%).

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /tdd, /zoom-out, and /grill-with-docs — no blocking correctness bugs found, but several targeted improvements worth addressing.

📋 Key Themes & Highlights

Key Themes

  • CLOSING_COMMENT_KEYWORDS over-breadth: "won't" and "closing" as bare substrings will produce false-positive rejected classifications on perfectly ordinary review comments. Worth tightening before this evaluator is in production.
  • Unnecessary reviews API call for closed PRs: evaluateCreatePullRequestOutcome fetches reviews unconditionally for all non-merged PRs, but only uses the result in the open branch. Easy fix: move the fetch inside the open guard.
  • Test precision: The "mixed review state" test actually only tests CHANGES_REQUESTED alone — the true APPROVED+CHANGES_REQUESTED mixed case is untested and has different semantics worth pinning.
  • Architectural asymmetry: push_to_pull_request_branch has no ignored/stale path while create_pull_request does. Either add it or document why pushes are intentionally exempt from stale degradation.
  • Missing degraded-metadata test: Push items without any commit SHA field are an important real-world failure mode (the agent failed to record metadata) and the code path is uncovered.

Positive Highlights

  • ✅ API-injection pattern (options.ghAPI) is clean and makes the new evaluators deterministically testable without mutable global hooks — great design decision.
  • resolvePRNumber covers multiple field-name conventions (number, pull_request_number, pr_number, URL fallback) — robust against different safe-output shapes.
  • isCommitInBranchHistory semantics are correct and the inline comments explaining base...head compare status are helpful.
  • ✅ Both evaluators are exported, making them independently callable in integration tests if needed.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · sonnet46 1.9M

@@ -41,6 +41,8 @@ const SUMMARY_PATH = "/tmp/gh-aw/outcome-summary.json";
// Noop types that are tracked but not counted as actionable
// ---------------------------------------------------------------------------
const NOOP_TYPES = new Set(["noop", "missing_tool", "missing_data", "report_incomplete"]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/grill-with-docs] CLOSING_COMMENT_KEYWORDS contains two over-broad entries — "won't" and "closing" — that will produce false-positive rejections on ordinary review comments.

💡 Why this matters and suggested tightening

Substring matching means:

  • "won't" matches "This won't break anything", "I won't be available to review", etc.
  • "closing" matches "closing price", "This is a closing brace", and even common polite endings like "We're closing in on a solution".

Both entries are much looser than the corresponding entries in CLOSING_LABEL_KEYWORDS, which at least requires full label-name matches.

Consider replacing them with more specific patterns:

const CLOSING_COMMENT_KEYWORDS = [
  "not planned",
  "won't fix",
  "wontfix",
  "duplicate",
  "invalid",
  "declined",
  "rejected",
  "closing as",     // "closing as won't fix", "closing as duplicate"
  "closing this",   // "closing this PR"
];

Or switch to whole-phrase matching rather than substring inclusion.

if (out.review_comments === 0 && out.comments === 0) {
out.zero_touch = true;
}
return out;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The reviews API call is made unconditionally for all non-merged PRs, but the result is only consumed in the open branch — so every closed (non-merged) PR evaluation incurs a wasted network round-trip.

💡 Suggested fix

Move the reviews fetch inside the open state block:

if (data.state === "closed") {
  const closingSignal = hasClosingSignal(itemRepo, num, data, ghAPIFn);
  out.result = "rejected";
  out.detail = closingSignal ? "closed without merge (strong)" : "closed without merge";
  if (data.created_at && data.closed_at) {
    out.resolution_sec = secondsBetween(data.created_at, data.closed_at);
  }
  return out;
}

// Only reached for open PRs — fetch reviews here
const reviewsRaw = ghAPIFn(`repos/${itemRepo}/pulls/${num}/reviews`);
...

This also makes the data-dependency structure clearer to future readers.


if (data.merged === true) {
out.result = "accepted";
out.detail = pushedIncluded ? "merged with pushed commit retained (strong)" : "merged";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/zoom-out] push_to_pull_request_branch never classifies as ignored/stale, while create_pull_request does. This asymmetry means a push whose PR sits untouched for months is classified pending indefinitely instead of eventually degrading to ignored.

💡 Suggestion

If the asymmetry is intentional (e.g. pushes are considered actionable regardless of staleness), document it with a brief comment here so future maintainers don't wonder if it was accidentally omitted.

If it should mirror create_pull_request, add the same stale-threshold check before returning pending:

if (!hasReviewOnPushedCommit) {
  setPendingAge(out, timestamp);
  if (isStalePending(out.pending_age_sec)) {
    out.result = "ignored";
    out.detail = "open and stale with no review on pushed commits";
  } else {
    out.result = "pending";
    out.detail = "open with no review on pushed commits";
  }
  return out;
}

it("classifies closed-without-merge PR as rejected (medium)", () => {
const ghAPI = mockAPI({
"repos/owner/repo/pulls/24": { state: "closed", merged: false, head: { sha: "bbbbbbb" } },
"repos/owner/repo/compare/aaaaaaa...bbbbbbb": { status: "ahead" },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The test is named "classifies open PR with mixed review state as unknown" but the mock only has a single CHANGES_REQUESTED review — that's not a truly mixed state. A real mixed state (APPROVED + CHANGES_REQUESTED simultaneously) is untested.

💡 Suggested additional test case
it("classifies open PR with both APPROVED and CHANGES_REQUESTED reviews as unknown", () => {
  const ghAPI = mockAPI({
    "repos/owner/repo/pulls/19": { state: "open", merged: false },
    "repos/owner/repo/pulls/19/reviews": [
      { state: "APPROVED" },
      { state: "CHANGES_REQUESTED" },
    ],
  });

  const result = evaluateItem(
    { type: "create_pull_request", repo: "owner/repo", url: "https://github.com/owner/repo/pull/19", timestamp: "2026-05-20T00:00:00Z" },
    "owner/repo",
    { ghAPI }
  );
  expect(result.result).toBe("unknown");
  expect(result.detail).toContain("mixed");
});

Also consider renaming the existing test to "classifies open PR with only CHANGES_REQUESTED as unknown" to reflect what it actually tests.

);
expect(result.result).toBe("unknown");
});
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Missing test: push_to_pull_request_branch item with no commit SHA fields at all. In that case extractPushedCommitSHAs returns [], pushedStillHead is false, pushedIncluded is false, and the force-push rejection guard (pushedShas.length > 0) skips, landing in the reviews path. This whole degraded-metadata code path has no coverage.

💡 Suggested test
it("falls back to pending when push item has no commit SHA", () => {
  const ghAPI = mockAPI({
    "repos/owner/repo/pulls/27": { state: "open", merged: false, head: { sha: "bbbbbbb" } },
    "repos/owner/repo/pulls/27/reviews": [],
  });

  const result = evaluateItem(
    {
      type: "push_to_pull_request_branch",
      repo: "owner/repo",
      pull_request_number: 27,
      // no commit_sha, no before_head_sha
      timestamp: "2026-05-20T00:00:00Z",
    },
    "owner/repo",
    { ghAPI }
  );
  expect(result.result).toBe("pending");
});

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three blocking issues found

🔎 Findings summary

1. SHA length mismatch silently degrades pushedStillHead in production (high)

normalizeCommitSHA does not pad short SHAs to 40 chars. GitHub API always returns full 40-char SHAs for head.sha; workflow items often store 7-char short SHAs. The === comparison in pushedStillHead is always false in production, silently bypassing the fast path. Tests don't catch this because all test SHAs are uniformly 7 chars.

2. "closing" keyword in CLOSING_COMMENT_KEYWORDS causes false-positive strong rejections (medium)

Any comment body containing the word "closing" (e.g. auto-generated summaries, PR descriptions with "closing remarks", words like "disclosing") will trigger a "strong" rejection classification. Replace with more specific phrases like "closing as" or "closed as".

3. Open PR with only CHANGES_REQUESTED classified as unknown instead of pending (medium)

The open branch in evaluateCreatePullRequestOutcome only handles approved+no-CR and no-reviews. A PR with only CHANGES_REQUESTED falls to unknown, discarding the reviewer's explicit signal.

4. isCommitInBranchHistory behind semantics and misleading comment (low)

The inline comment conflates behind and diverged. For behind, HEAD is behind the pushed commit (branch was reset), not the same as "commit not retained" from a rebase. The false return is a safe-default but the comment needs clarification.

🔎 Code quality review by PR Code Quality Reviewer · sonnet46 3.1M

// Noop types that are tracked but not counted as actionable
// ---------------------------------------------------------------------------
const NOOP_TYPES = new Set(["noop", "missing_tool", "missing_data", "report_incomplete"]);
const CLOSING_LABEL_KEYWORDS = ["not planned", "not_planned", "wontfix", "won't fix", "duplicate", "invalid", "declined", "rejected"];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overly broad "closing" keyword causes false positives in rejection classification: any comment body containing the word "closing" (e.g., "closing remarks", "disclosing", "I am closing this PR for now to revisit later") will count as a closing signal and upgrade a rejection to "strong".

💡 Suggested fix

Replace the vague "closing" keyword with a more intentional phrase:

const CLOSING_COMMENT_KEYWORDS = [
  "not planned", "wontfix", "duplicate", "invalid",
  "declin", "reject",
  "closing as", "closed as",  // ← instead of bare "closing"
];

A standalone "closing" substring is far too permissive. Any English sentence that uses the word will trip it, including auto-generated bot summaries or release notes that happen to contain the word.

if (data.state === "closed") {
out.result = "rejected";
out.detail = "closed without merge";
if (data.created_at && data.closed_at) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHA length mismatch silently breaks pushedStillHead in production: normalizeCommitSHA keeps the original length (7–40 hex chars) — it does not pad short SHAs to 40 chars. In production, data.head.sha from the GitHub API is always the full 40-char SHA, while the workflow item's commit_sha is often a 7-char short SHA stored at push time. The === comparison therefore always returns false and pushedStillHead is always false, silently falling through to the more expensive compare-API path (which may still resolve correctly via GitHub's server-side short-SHA expansion, but only by accident).

💡 Suggested fix

Normalize all SHAs to a common prefix length before comparison, or use startsWith / includes to handle short-vs-full mismatches:

function shaMatches(a, b) {
  if (!a || !b) return false;
  const short = a.length < b.length ? a : b;
  const long  = a.length < b.length ? b : a;
  // short SHAs are at least 7 chars; treat as matching if long starts with short
  return short.length >= 7 && long.startsWith(short);
}

Then replace:

const pushedStillHead = currentHead ? pushedShas.some(sha => sha === currentHead) : false;

with:

const pushedStillHead = currentHead ? pushedShas.some(sha => shaMatches(sha, currentHead)) : false;

The same fix applies to the commitSHA === branchHeadSHA early return in isCommitInBranchHistory.

Note: the test suite avoids this entirely because all test SHAs are uniformly 7 chars ("aaaaaaa", "bbbbbbb"), so the regression is invisible in tests.

const compareData = ghAPIFn(`repos/${repo}/compare/${commitSHA}...${branchHeadSHA}`);
if (!compareData || typeof compareData.status !== "string") return null;
const status = compareData.status.toLowerCase();
// compare base...head semantics:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isCommitInBranchHistory compare-endpoint direction comment is misleading and the semantics are subtly inverted for behind: the comment says "behind/diverged => base commit is not retained" which is accurate for diverged but for behind it means the branch HEAD is behind the pushed commit — i.e., the pushed commit is actually ahead of (newer than) the current HEAD, which is a different situation from "not retained". A rebased branch where the pushed commit was dropped could appear as either diverged or behind depending on the rebase shape, making false an under-precise return for behind.

💡 Details

GitHub compare base...head (where base = pushed commit, head = current branch HEAD):

  • identical — pushed commit is the current head ✓
  • ahead — HEAD is ahead of pushed commit; pushed commit is an ancestor of HEAD ✓
  • behind — HEAD is behind the pushed commit; the pushed commit is newer/further than HEAD — this doesn't mean the commit was "dropped", it means the branch hasn't caught up yet. Could happen if HEAD was reset to before the push.
  • diverged — branches have diverged; pushed commit is definitively not in HEAD ancestry ✓

For the behind case, returning false ("not in history") is probably the correct safe-default, but the inline comment says "not retained on the evaluated branch tip" which skips this nuance. At minimum, treat behind and diverged distinctly in the comment, or consider returning null for behind to avoid a false-negative rejection classification.

if (data.state === "open") {
if (hasApproved && !hasChangesRequested) {
out.result = "accepted";
out.detail = "approved without requested changes";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An open PR with only CHANGES_REQUESTED reviews falls through to unknown, losing signal: the evaluation for open PRs checks hasApproved && !hasChangesRequested (accepted) and reviews.length === 0 (pending/ignored). Any other review state — including a PR that has only CHANGES_REQUESTED — hits the final unknown branch. A PR where the reviewer explicitly asked for changes is actionable signal (more like rejected-pending-fix than unknown), and classifying it as unknown discards that.

💡 Suggested fix
if (hasChangesRequested && !hasApproved) {
  out.result = "pending";
  out.detail = "open with changes requested";
  setPendingAge(out, timestamp);
  return out;
}
// existing unknown fallback below
out.result = "unknown";
out.detail = "open with mixed review state";

This keeps the semantic accurate: the output was not accepted, but it is still alive and actionable, just waiting on a fix cycle.

Comment thread actions/setup/js/evaluate_outcomes.cjs Outdated
}

const reviewsRaw = ghAPIFn(`repos/${itemRepo}/pulls/${num}/reviews`);
const reviews = Array.isArray(reviewsRaw) ? reviewsRaw : [];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviews API null return silently coerces to [], causing wrong pending/ignored classification on API failure: Array.isArray(null) is false, so reviewsRaw null → reviews = []reviews.length === 0 → the PR is classified as pending or ignored instead of surfacing the API error. A transient reviews API failure permanently misfires the outcome.

💡 Suggested fix
const reviewsRaw = ghAPIFn(`repos/${itemRepo}/pulls/${num}/reviews`);
if (reviewsRaw === null) {
  out.result = "unknown";
  out.detail = "reviews api error";
  setPendingAge(out, timestamp);
  return out;
}
const reviews = Array.isArray(reviewsRaw) ? reviewsRaw : [];

Same pattern applies in evaluatePushToPullRequestBranchOutcome (the reviews fetch near the end of that function has the same null-coercion bug).

Comment thread actions/setup/js/evaluate_outcomes.cjs Outdated

// A strong rejection requires before-head metadata from execution time so we
// can distinguish "commit not retained" from "insufficient history context".
if (pushedShas.length > 0 && !pushedIncluded && beforeHead) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare API failure (null return) produces a false rejected / force-pushed outcome: isCommitInBranchHistory returns null when the API call fails or the status is unrecognised. In pushedIncluded, null === true is false, so pushedIncluded = false. When beforeHead is also set, the condition pushedShas.length > 0 && !pushedIncluded && beforeHead evaluates to true and the item is classified as rejected (force-pushed away) — a hard negative outcome caused entirely by a transient API error.

💡 Suggested fix

Track whether history lookup was indeterminate, and gate the rejection on an explicit false rather than "not true":

let pushedIncluded = false;
let historyCheckIndeterminate = false;
if (currentHead && pushedShas.length > 0) {
  for (const sha of pushedShas) {
    const inHistory = isCommitInBranchHistory(itemRepo, sha, currentHead, ghAPIFn);
    if (inHistory === true) { pushedIncluded = true; break; }
    if (inHistory === null) { historyCheckIndeterminate = true; }
  }
}

// later:
if (pushedShas.length > 0 && !pushedIncluded && beforeHead) {
  if (historyCheckIndeterminate) {
    out.result = "unknown";
    out.detail = "commit history check indeterminate (api error)";
    setPendingAge(out, timestamp);
    return out;
  }
  out.result = "rejected";
  out.detail = "pushed commits were force-pushed away or branch reset";
  return out;
}

Comment thread actions/setup/js/evaluate_outcomes.cjs Outdated
function extractPushedCommitSHAs(item) {
/** @type {string[]} */
const shas = [];
const candidates = [item.commit_sha, item.pushed_commit_sha, item.head_sha, item?.metadata?.commit_sha, item?.metadata?.pushed_commit_sha];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item.head_sha is ambiguous and should not be used as a pushed-commit SHA: head_sha typically denotes the current branch tip at observation time, not a commit that was actually pushed in this operation. Including it in extractPushedCommitSHAs means pushedStillHead will match trivially whenever item.head_sha equals the current data.head.sha — even if the actual pushed commits were completely replaced by a force-push that left the branch tip at the same SHA by coincidence.

💡 Suggested fix

Remove item.head_sha from the scalar candidates:

const candidates = [
  item.commit_sha,
  item.pushed_commit_sha,
  // item.head_sha  ← removed: ambiguous, not a pushed-commit identifier
  item?.metadata?.commit_sha,
  item?.metadata?.pushed_commit_sha,
];

If a caller genuinely stores the pushed commit SHA in head_sha, they should rename the field to commit_sha or pushed_commit_sha to be unambiguous.

Copilot AI and others added 2 commits May 27, 2026 12:26
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 27, 2026

@copilot fix https://github.com/github/gh-aw/actions/runs/26509978987/job/78071970079?pr=35100
review all comments

Addressed and pushed in commits 3422c05 and fe07a0b.

I reviewed and fixed the evaluator review threads:

  • merged detail normalization now preserves strong accepted evidence
  • commit-retention unknown vs definitive-false is handled distinctly before force-push rejection
  • force-push rejection now requires definitive non-retention evidence
  • PR number aliases now include pr and pull_number
  • tests expanded for normalized strength, unknown-retention behavior, and alias resolution

I also reproduced the linked CI failure set locally; the previously timed-out safe_outputs_handlers tests pass in focused runs.

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot review all pull request review comments from the github-actions user

Copilot AI and others added 3 commits May 27, 2026 12:40
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 27, 2026

@copilot review all pull request review comments from the github-actions user

Reviewed all github-actions review comments and addressed the actionable ones in commits f4c432e, 9db9231, and 62ee25f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[plan] Implement dedicated outcome evaluators for create_pull_request and push_to_pull_request_branch

4 participants