Add dedicated outcome evaluators for create_pull_request and push_to_pull_request_branch#35100
Add dedicated outcome evaluators for create_pull_request and push_to_pull_request_branch#35100Copilot wants to merge 13 commits into
create_pull_request and push_to_pull_request_branch#35100Conversation
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
create_pull_request and push_to_pull_request_branch
|
@copilot resolve the merge conflicts in this pull request |
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Conflicts are resolved in merge commit |
|
🧪 Test Quality Sentinel completed test quality analysis. |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #35100 does not have the 'implementation' label and has 0 new lines of code in default business logic directories (well below the 100-line threshold). |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
There was a problem hiding this comment.
Pull request overview
This PR introduces type-specific outcome evaluators for the two highest-signal safe-output item types (create_pull_request and push_to_pull_request_branch) so outcome classification is based on post-action GitHub state (merge/close/reviews/staleness/commit retention) rather than generic existence-style checks.
Changes:
- Added evaluator dispatch in
evaluateItem()to routecreate_pull_requestandpush_to_pull_request_branchitems to dedicated evaluators. - Implemented PR-state/review-based outcome logic for
create_pull_requestand commit-retention-based outcome logic forpush_to_pull_request_branch. - Added focused Vitest coverage for the new evaluator branches and introduced API-function injection for deterministic tests.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/evaluate_outcomes.cjs | Adds dedicated evaluators + dispatch and helper utilities for PR-number/commit retention/closing-signal evaluation. |
| actions/setup/js/evaluate_outcomes.test.cjs | Adds unit tests covering the new evaluator classification paths with a mockable GitHub API function. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 6
|
|
||
| if (data.merged === true) { | ||
| out.result = "accepted"; | ||
| out.detail = "merged (strong)"; |
|
|
||
| if (data.merged === true) { | ||
| out.result = "accepted"; | ||
| out.detail = pushedIncluded ? "merged with pushed commit retained (strong)" : "merged"; |
| const pushedStillHead = currentHead ? pushedShas.some(sha => sha === currentHead) : false; | ||
| const pushedIncluded = | ||
| currentHead && pushedShas.length > 0 | ||
| ? pushedShas.some(sha => { | ||
| const inHistory = isCommitInBranchHistory(itemRepo, sha, currentHead, ghAPIFn); | ||
| return inHistory === true; | ||
| }) | ||
| : false; |
|
|
||
| // A strong rejection requires before-head metadata from execution time so we | ||
| // can distinguish "commit not retained" from "insufficient history context". | ||
| if (pushedShas.length > 0 && !pushedIncluded && beforeHead) { |
| */ | ||
| function resolvePRNumber(item) { | ||
| if (typeof item.number === "number" && item.number > 0) return item.number; | ||
| const candidates = [item.pull_request_number, item.pr_number, item.item_number]; |
|
|
||
| const result = evaluateItem({ type: "create_pull_request", repo: "owner/repo", url: "https://github.com/owner/repo/pull/12", timestamp: "2026-05-20T00:00:00Z" }, "owner/repo", { ghAPI }); | ||
| expect(result.result).toBe("accepted"); | ||
| expect(result.detail).toContain("strong"); |
|
@copilot fix https://github.com/github/gh-aw/actions/runs/26509978987/job/78071970079?pr=35100 |
🧪 Test Quality Sentinel Report✅ Test Quality Score: 100/100 — Excellent
📊 Metrics & Test Classification (13 tests analyzed)
Test Classification Details
Language SupportTests analyzed:
Verdict
📖 Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.
|
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /tdd, /zoom-out, and /grill-with-docs — no blocking correctness bugs found, but several targeted improvements worth addressing.
📋 Key Themes & Highlights
Key Themes
CLOSING_COMMENT_KEYWORDSover-breadth:"won't"and"closing"as bare substrings will produce false-positiverejectedclassifications on perfectly ordinary review comments. Worth tightening before this evaluator is in production.- Unnecessary
reviewsAPI call for closed PRs:evaluateCreatePullRequestOutcomefetches reviews unconditionally for all non-merged PRs, but only uses the result in theopenbranch. Easy fix: move the fetch inside theopenguard. - Test precision: The
"mixed review state"test actually only testsCHANGES_REQUESTEDalone — the true APPROVED+CHANGES_REQUESTED mixed case is untested and has different semantics worth pinning. - Architectural asymmetry:
push_to_pull_request_branchhas noignored/stale path whilecreate_pull_requestdoes. Either add it or document why pushes are intentionally exempt from stale degradation. - Missing degraded-metadata test: Push items without any commit SHA field are an important real-world failure mode (the agent failed to record metadata) and the code path is uncovered.
Positive Highlights
- ✅ API-injection pattern (
options.ghAPI) is clean and makes the new evaluators deterministically testable without mutable global hooks — great design decision. - ✅
resolvePRNumbercovers multiple field-name conventions (number,pull_request_number,pr_number, URL fallback) — robust against different safe-output shapes. - ✅
isCommitInBranchHistorysemantics are correct and the inline comments explaining base...head compare status are helpful. - ✅ Both evaluators are exported, making them independently callable in integration tests if needed.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · sonnet46 1.9M
| @@ -41,6 +41,8 @@ const SUMMARY_PATH = "/tmp/gh-aw/outcome-summary.json"; | |||
| // Noop types that are tracked but not counted as actionable | |||
| // --------------------------------------------------------------------------- | |||
| const NOOP_TYPES = new Set(["noop", "missing_tool", "missing_data", "report_incomplete"]); | |||
There was a problem hiding this comment.
[/grill-with-docs] CLOSING_COMMENT_KEYWORDS contains two over-broad entries — "won't" and "closing" — that will produce false-positive rejections on ordinary review comments.
💡 Why this matters and suggested tightening
Substring matching means:
"won't"matches "This won't break anything", "I won't be available to review", etc."closing"matches "closing price", "This is a closing brace", and even common polite endings like "We're closing in on a solution".
Both entries are much looser than the corresponding entries in CLOSING_LABEL_KEYWORDS, which at least requires full label-name matches.
Consider replacing them with more specific patterns:
const CLOSING_COMMENT_KEYWORDS = [
"not planned",
"won't fix",
"wontfix",
"duplicate",
"invalid",
"declined",
"rejected",
"closing as", // "closing as won't fix", "closing as duplicate"
"closing this", // "closing this PR"
];Or switch to whole-phrase matching rather than substring inclusion.
| if (out.review_comments === 0 && out.comments === 0) { | ||
| out.zero_touch = true; | ||
| } | ||
| return out; |
There was a problem hiding this comment.
[/tdd] The reviews API call is made unconditionally for all non-merged PRs, but the result is only consumed in the open branch — so every closed (non-merged) PR evaluation incurs a wasted network round-trip.
💡 Suggested fix
Move the reviews fetch inside the open state block:
if (data.state === "closed") {
const closingSignal = hasClosingSignal(itemRepo, num, data, ghAPIFn);
out.result = "rejected";
out.detail = closingSignal ? "closed without merge (strong)" : "closed without merge";
if (data.created_at && data.closed_at) {
out.resolution_sec = secondsBetween(data.created_at, data.closed_at);
}
return out;
}
// Only reached for open PRs — fetch reviews here
const reviewsRaw = ghAPIFn(`repos/${itemRepo}/pulls/${num}/reviews`);
...This also makes the data-dependency structure clearer to future readers.
|
|
||
| if (data.merged === true) { | ||
| out.result = "accepted"; | ||
| out.detail = pushedIncluded ? "merged with pushed commit retained (strong)" : "merged"; |
There was a problem hiding this comment.
[/zoom-out] push_to_pull_request_branch never classifies as ignored/stale, while create_pull_request does. This asymmetry means a push whose PR sits untouched for months is classified pending indefinitely instead of eventually degrading to ignored.
💡 Suggestion
If the asymmetry is intentional (e.g. pushes are considered actionable regardless of staleness), document it with a brief comment here so future maintainers don't wonder if it was accidentally omitted.
If it should mirror create_pull_request, add the same stale-threshold check before returning pending:
if (!hasReviewOnPushedCommit) {
setPendingAge(out, timestamp);
if (isStalePending(out.pending_age_sec)) {
out.result = "ignored";
out.detail = "open and stale with no review on pushed commits";
} else {
out.result = "pending";
out.detail = "open with no review on pushed commits";
}
return out;
}| it("classifies closed-without-merge PR as rejected (medium)", () => { | ||
| const ghAPI = mockAPI({ | ||
| "repos/owner/repo/pulls/24": { state: "closed", merged: false, head: { sha: "bbbbbbb" } }, | ||
| "repos/owner/repo/compare/aaaaaaa...bbbbbbb": { status: "ahead" }, |
There was a problem hiding this comment.
[/tdd] The test is named "classifies open PR with mixed review state as unknown" but the mock only has a single CHANGES_REQUESTED review — that's not a truly mixed state. A real mixed state (APPROVED + CHANGES_REQUESTED simultaneously) is untested.
💡 Suggested additional test case
it("classifies open PR with both APPROVED and CHANGES_REQUESTED reviews as unknown", () => {
const ghAPI = mockAPI({
"repos/owner/repo/pulls/19": { state: "open", merged: false },
"repos/owner/repo/pulls/19/reviews": [
{ state: "APPROVED" },
{ state: "CHANGES_REQUESTED" },
],
});
const result = evaluateItem(
{ type: "create_pull_request", repo: "owner/repo", url: "https://github.com/owner/repo/pull/19", timestamp: "2026-05-20T00:00:00Z" },
"owner/repo",
{ ghAPI }
);
expect(result.result).toBe("unknown");
expect(result.detail).toContain("mixed");
});Also consider renaming the existing test to "classifies open PR with only CHANGES_REQUESTED as unknown" to reflect what it actually tests.
| ); | ||
| expect(result.result).toBe("unknown"); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
[/tdd] Missing test: push_to_pull_request_branch item with no commit SHA fields at all. In that case extractPushedCommitSHAs returns [], pushedStillHead is false, pushedIncluded is false, and the force-push rejection guard (pushedShas.length > 0) skips, landing in the reviews path. This whole degraded-metadata code path has no coverage.
💡 Suggested test
it("falls back to pending when push item has no commit SHA", () => {
const ghAPI = mockAPI({
"repos/owner/repo/pulls/27": { state: "open", merged: false, head: { sha: "bbbbbbb" } },
"repos/owner/repo/pulls/27/reviews": [],
});
const result = evaluateItem(
{
type: "push_to_pull_request_branch",
repo: "owner/repo",
pull_request_number: 27,
// no commit_sha, no before_head_sha
timestamp: "2026-05-20T00:00:00Z",
},
"owner/repo",
{ ghAPI }
);
expect(result.result).toBe("pending");
});There was a problem hiding this comment.
Three blocking issues found
🔎 Findings summary
1. SHA length mismatch silently degrades pushedStillHead in production (high)
normalizeCommitSHA does not pad short SHAs to 40 chars. GitHub API always returns full 40-char SHAs for head.sha; workflow items often store 7-char short SHAs. The === comparison in pushedStillHead is always false in production, silently bypassing the fast path. Tests don't catch this because all test SHAs are uniformly 7 chars.
2. "closing" keyword in CLOSING_COMMENT_KEYWORDS causes false-positive strong rejections (medium)
Any comment body containing the word "closing" (e.g. auto-generated summaries, PR descriptions with "closing remarks", words like "disclosing") will trigger a "strong" rejection classification. Replace with more specific phrases like "closing as" or "closed as".
3. Open PR with only CHANGES_REQUESTED classified as unknown instead of pending (medium)
The open branch in evaluateCreatePullRequestOutcome only handles approved+no-CR and no-reviews. A PR with only CHANGES_REQUESTED falls to unknown, discarding the reviewer's explicit signal.
4. isCommitInBranchHistory behind semantics and misleading comment (low)
The inline comment conflates behind and diverged. For behind, HEAD is behind the pushed commit (branch was reset), not the same as "commit not retained" from a rebase. The false return is a safe-default but the comment needs clarification.
🔎 Code quality review by PR Code Quality Reviewer · sonnet46 3.1M
| // Noop types that are tracked but not counted as actionable | ||
| // --------------------------------------------------------------------------- | ||
| const NOOP_TYPES = new Set(["noop", "missing_tool", "missing_data", "report_incomplete"]); | ||
| const CLOSING_LABEL_KEYWORDS = ["not planned", "not_planned", "wontfix", "won't fix", "duplicate", "invalid", "declined", "rejected"]; |
There was a problem hiding this comment.
Overly broad "closing" keyword causes false positives in rejection classification: any comment body containing the word "closing" (e.g., "closing remarks", "disclosing", "I am closing this PR for now to revisit later") will count as a closing signal and upgrade a rejection to "strong".
💡 Suggested fix
Replace the vague "closing" keyword with a more intentional phrase:
const CLOSING_COMMENT_KEYWORDS = [
"not planned", "wontfix", "duplicate", "invalid",
"declin", "reject",
"closing as", "closed as", // ← instead of bare "closing"
];A standalone "closing" substring is far too permissive. Any English sentence that uses the word will trip it, including auto-generated bot summaries or release notes that happen to contain the word.
| if (data.state === "closed") { | ||
| out.result = "rejected"; | ||
| out.detail = "closed without merge"; | ||
| if (data.created_at && data.closed_at) { |
There was a problem hiding this comment.
SHA length mismatch silently breaks pushedStillHead in production: normalizeCommitSHA keeps the original length (7–40 hex chars) — it does not pad short SHAs to 40 chars. In production, data.head.sha from the GitHub API is always the full 40-char SHA, while the workflow item's commit_sha is often a 7-char short SHA stored at push time. The === comparison therefore always returns false and pushedStillHead is always false, silently falling through to the more expensive compare-API path (which may still resolve correctly via GitHub's server-side short-SHA expansion, but only by accident).
💡 Suggested fix
Normalize all SHAs to a common prefix length before comparison, or use startsWith / includes to handle short-vs-full mismatches:
function shaMatches(a, b) {
if (!a || !b) return false;
const short = a.length < b.length ? a : b;
const long = a.length < b.length ? b : a;
// short SHAs are at least 7 chars; treat as matching if long starts with short
return short.length >= 7 && long.startsWith(short);
}Then replace:
const pushedStillHead = currentHead ? pushedShas.some(sha => sha === currentHead) : false;with:
const pushedStillHead = currentHead ? pushedShas.some(sha => shaMatches(sha, currentHead)) : false;The same fix applies to the commitSHA === branchHeadSHA early return in isCommitInBranchHistory.
Note: the test suite avoids this entirely because all test SHAs are uniformly 7 chars ("aaaaaaa", "bbbbbbb"), so the regression is invisible in tests.
| const compareData = ghAPIFn(`repos/${repo}/compare/${commitSHA}...${branchHeadSHA}`); | ||
| if (!compareData || typeof compareData.status !== "string") return null; | ||
| const status = compareData.status.toLowerCase(); | ||
| // compare base...head semantics: |
There was a problem hiding this comment.
isCommitInBranchHistory compare-endpoint direction comment is misleading and the semantics are subtly inverted for behind: the comment says "behind/diverged => base commit is not retained" which is accurate for diverged but for behind it means the branch HEAD is behind the pushed commit — i.e., the pushed commit is actually ahead of (newer than) the current HEAD, which is a different situation from "not retained". A rebased branch where the pushed commit was dropped could appear as either diverged or behind depending on the rebase shape, making false an under-precise return for behind.
💡 Details
GitHub compare base...head (where base = pushed commit, head = current branch HEAD):
identical— pushed commit is the current head ✓ahead— HEAD is ahead of pushed commit; pushed commit is an ancestor of HEAD ✓behind— HEAD is behind the pushed commit; the pushed commit is newer/further than HEAD — this doesn't mean the commit was "dropped", it means the branch hasn't caught up yet. Could happen if HEAD was reset to before the push.diverged— branches have diverged; pushed commit is definitively not in HEAD ancestry ✓
For the behind case, returning false ("not in history") is probably the correct safe-default, but the inline comment says "not retained on the evaluated branch tip" which skips this nuance. At minimum, treat behind and diverged distinctly in the comment, or consider returning null for behind to avoid a false-negative rejection classification.
| if (data.state === "open") { | ||
| if (hasApproved && !hasChangesRequested) { | ||
| out.result = "accepted"; | ||
| out.detail = "approved without requested changes"; |
There was a problem hiding this comment.
An open PR with only CHANGES_REQUESTED reviews falls through to unknown, losing signal: the evaluation for open PRs checks hasApproved && !hasChangesRequested (accepted) and reviews.length === 0 (pending/ignored). Any other review state — including a PR that has only CHANGES_REQUESTED — hits the final unknown branch. A PR where the reviewer explicitly asked for changes is actionable signal (more like rejected-pending-fix than unknown), and classifying it as unknown discards that.
💡 Suggested fix
if (hasChangesRequested && !hasApproved) {
out.result = "pending";
out.detail = "open with changes requested";
setPendingAge(out, timestamp);
return out;
}
// existing unknown fallback below
out.result = "unknown";
out.detail = "open with mixed review state";This keeps the semantic accurate: the output was not accepted, but it is still alive and actionable, just waiting on a fix cycle.
| } | ||
|
|
||
| const reviewsRaw = ghAPIFn(`repos/${itemRepo}/pulls/${num}/reviews`); | ||
| const reviews = Array.isArray(reviewsRaw) ? reviewsRaw : []; |
There was a problem hiding this comment.
Reviews API null return silently coerces to [], causing wrong pending/ignored classification on API failure: Array.isArray(null) is false, so reviewsRaw null → reviews = [] → reviews.length === 0 → the PR is classified as pending or ignored instead of surfacing the API error. A transient reviews API failure permanently misfires the outcome.
💡 Suggested fix
const reviewsRaw = ghAPIFn(`repos/${itemRepo}/pulls/${num}/reviews`);
if (reviewsRaw === null) {
out.result = "unknown";
out.detail = "reviews api error";
setPendingAge(out, timestamp);
return out;
}
const reviews = Array.isArray(reviewsRaw) ? reviewsRaw : [];Same pattern applies in evaluatePushToPullRequestBranchOutcome (the reviews fetch near the end of that function has the same null-coercion bug).
|
|
||
| // A strong rejection requires before-head metadata from execution time so we | ||
| // can distinguish "commit not retained" from "insufficient history context". | ||
| if (pushedShas.length > 0 && !pushedIncluded && beforeHead) { |
There was a problem hiding this comment.
Compare API failure (null return) produces a false rejected / force-pushed outcome: isCommitInBranchHistory returns null when the API call fails or the status is unrecognised. In pushedIncluded, null === true is false, so pushedIncluded = false. When beforeHead is also set, the condition pushedShas.length > 0 && !pushedIncluded && beforeHead evaluates to true and the item is classified as rejected (force-pushed away) — a hard negative outcome caused entirely by a transient API error.
💡 Suggested fix
Track whether history lookup was indeterminate, and gate the rejection on an explicit false rather than "not true":
let pushedIncluded = false;
let historyCheckIndeterminate = false;
if (currentHead && pushedShas.length > 0) {
for (const sha of pushedShas) {
const inHistory = isCommitInBranchHistory(itemRepo, sha, currentHead, ghAPIFn);
if (inHistory === true) { pushedIncluded = true; break; }
if (inHistory === null) { historyCheckIndeterminate = true; }
}
}
// later:
if (pushedShas.length > 0 && !pushedIncluded && beforeHead) {
if (historyCheckIndeterminate) {
out.result = "unknown";
out.detail = "commit history check indeterminate (api error)";
setPendingAge(out, timestamp);
return out;
}
out.result = "rejected";
out.detail = "pushed commits were force-pushed away or branch reset";
return out;
}| function extractPushedCommitSHAs(item) { | ||
| /** @type {string[]} */ | ||
| const shas = []; | ||
| const candidates = [item.commit_sha, item.pushed_commit_sha, item.head_sha, item?.metadata?.commit_sha, item?.metadata?.pushed_commit_sha]; |
There was a problem hiding this comment.
item.head_sha is ambiguous and should not be used as a pushed-commit SHA: head_sha typically denotes the current branch tip at observation time, not a commit that was actually pushed in this operation. Including it in extractPushedCommitSHAs means pushedStillHead will match trivially whenever item.head_sha equals the current data.head.sha — even if the actual pushed commits were completely replaced by a force-push that left the branch tip at the same SHA by coincidence.
💡 Suggested fix
Remove item.head_sha from the scalar candidates:
const candidates = [
item.commit_sha,
item.pushed_commit_sha,
// item.head_sha ← removed: ambiguous, not a pushed-commit identifier
item?.metadata?.commit_sha,
item?.metadata?.pushed_commit_sha,
];If a caller genuinely stores the pushed commit SHA in head_sha, they should rename the field to commit_sha or pushed_commit_sha to be unambiguous.
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Addressed and pushed in commits I reviewed and fixed the evaluator review threads:
I also reproduced the linked CI failure set locally; the previously timed-out |
|
@copilot review all pull request review comments from the github-actions user |
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Reviewed all github-actions review comments and addressed the actionable ones in commits |
This change replaces generic/fallback outcome checks for the two highest-signal safe-output types with type-specific evaluators. The new logic classifies PR creation and PR-branch pushes using post-action GitHub state, review signals, staleness, and commit-retention evidence instead of existence-style acceptance.
create_pull_request: typed outcome classificationcreate_pull_request.accepted(strong): merged PRaccepted(medium): approved PR with no requested changesrejected(strong/medium): closed without merge, with stronger signal when closing labels/comments indicate explicit rejectionpending: open PR with no reviews, or open with onlyCHANGES_REQUESTEDignored: stale open PR (configurable stale threshold)unknown: mixed/ambiguous states or API degradation (including review API failure)push_to_pull_request_branch: commit-retention based evaluationpush_to_pull_request_branch.commit_sha, before-head variants) plus branch history comparison to distinguish:ignoredfor stale open PRs with no review on pushed commitsEvaluator architecture and maintainability
evaluateItemnow dispatches by safe-output type to dedicated evaluator functions.head_shafrom pushed-commit extraction to avoid misclassifying retained pushes.Focused test coverage
actions/setup/js/evaluate_outcomes.test.cjswith branch-complete coverage for both evaluators:head_shaambiguity guard