Add blog post: The Coding Harness Behind GitHub Copilot in VS Code by jukasper · Pull Request #9740 · microsoft/vscode-docs

jukasper · 2026-05-06T02:00:52Z

This PR adds a new blog post explaining the coding harness architecture behind GitHub Copilot's agent mode in VS Code.

The post covers:

What a coding harness is and why it matters
The model-harness interaction loop
How the harness provides tools and context to the model
Evaluation and benchmarking approach

ntrogh

@jukasper Great post - very well explained and good level of depth! Also like the additional of diagrams.

I left some feedback, but nothing major.

ntrogh · 2026-05-06T13:01:17Z

+May 7, 2026 by [Julia Kasper](https://github.com/jukasper)
+
+Every few months, a new model drops and the conversation resets. Which one is smartest? Which one is fastest? Which one should we ship? Those are useful questions, but for a product like Visual Studio Code they are incomplete. A model is only one part of the experience. What developers actually feel is the coding harness: the layer that assembles context, exposes tools, runs the agent loop, interprets tool calls, and turns a model's output into something useful inside the editor.
+


Add an intent statement to indicate to readers what this blog post is about.

ntrogh · 2026-05-06T13:02:10Z

+
+![Diagram showing that an agent is made up of a model plus a harness. The harness includes the agent loop, tools, context management, and system prompt.](agent_model_harness.png)
+
+## What We Mean by the Coding Harness


Suggested change

## What We Mean by the Coding Harness

## What is the coding harness?

ntrogh · 2026-05-06T13:05:34Z

+
+## What We Mean by the Coding Harness
+
+That distinction matters because language models do not edit files, execute commands, or run tests by themselves. They produce text. The harness is the system that turns that text into action and feeds the results back so the model can decide what to do next.


Often readers ignore the above text and/or heading. Make sure to have each section stand by itself. "That distinction.." is not clear when you start to read from that point on.

Suggested change

That distinction matters because language models do not edit files, execute commands, or run tests by themselves. They produce text. The harness is the system that turns that text into action and feeds the results back so the model can decide what to do next.

Language models do not edit files, execute commands, or run tests by themselves. They can only produce text. The coding harness is the system that acts as a bridge between the code editor and the language model. It turns that text into action and feeds the results back so the model can decide what to do next.

ntrogh · 2026-05-06T13:15:07Z

+
+![Diagram of the agent loop showing the cycle: build prompt, send to model, check response type, execute tools, record results, and loop back.](agentloop.png)
+
+Each pass through this loop is called a round. A single user message might trigger dozens of rounds as the model reads and searches files, edits code, runs tests, reads the output, and iterates on failures.


You might want to clarify the difference between a turn and a round.

ntrogh · 2026-05-06T13:16:18Z

+
+The loop is not unbounded. The harness enforces a tool-call limit, checks for cancellation between rounds, and runs stop hooks, extension points that can inspect the model's state and either allow it to finish or push it to keep working ("you were about to stop, but the tests still fail").
+
+Within the loop, the prompt is rebuilt on every iteration. That means the model always sees the latest state of the workspace: if it edited a file three rounds ago, the current prompt reflects that edit. The harness also manages conversation summarization. When the accumulated history grows too large, it compresses earlier rounds into a summary so the model can keep working without hitting the context window ceiling.


On first read, I was wondering why all of a sudden the harness came up here, and thought you meant the agent loop. Consider if we should include this in the harness section.

ntrogh · 2026-05-06T14:40:24Z

+
+When a new model ships, it needs to fit into an existing harness. The system prompt, the tool definitions, the loop logic, the context assembly, all of it was built and tuned over many months of real-world use. The model gets better at filling in the blanks, but the harness defines what the blanks are.
+
+This matters even more because GitHub Copilot spans model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install provider extensions. The editor deals with a moving ecosystem, not a single stable API.


Suggested change

This matters even more because GitHub Copilot spans model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install provider extensions. The editor deals with a moving ecosystem, not a single stable API.

This matters even more because GitHub Copilot spans multiple model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install extra providers via extensions. This means that VS Code has to deal with broad and continuously evolving ecosystem, not a single stable API.

ntrogh · 2026-05-06T14:47:26Z

+
+Different models need different harness behavior. Claude models use `replace_string_in_file` for edits; GPT models use `apply_patch`. Gemini needs reminders to use tool-calling instead of narrating it, and breaks on orphaned tool calls in history. Some models support extended thinking and need reasoning-effort controls. Some work best with a concise system prompt; others need verbose, structured instructions to stay on track. The harness selects different system prompts per model - Claude Sonnet 4 gets a different prompt than Claude 4.5, which gets a different one than Opus.
+
+These aren't abstract differences. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. When a new model ships, we don't just flip a switch. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better.


Suggested change

These aren't abstract differences. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. When a new model ships, we don't just flip a switch. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better.

All these per-model differences aren't trivial. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. This means that when a new model ships, we don't just flip a switch but we need to validate its behavior. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better.

ntrogh · 2026-05-06T14:50:43Z

+
+## Evaluation keeps the harness honest
+
+That's where evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After launch, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows.


Suggested change

That's where evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After launch, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows.

Just like you need to test a new feature before you ship it, models also need to be tested. That's where model evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After the model is live, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows.

ntrogh · 2026-05-06T14:52:07Z

+
+![Diagram showing an overview of the VS Code evaluation pipeline.](evaluations.png)
+
+Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.


Suggested change

Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.

There are multiple public model benchmarks, which are useful as a shared reference point. We use these benchmarks to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.

ntrogh · 2026-05-06T14:53:13Z

+
+Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.
+
+Part of the issue is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers.


Suggested change

Part of the issue is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers.

One of the issues with the public benchmarks is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real-world coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers.

jukasper added 2 commits May 5, 2026 18:56

Add blog post: The Coding Harness Behind GitHub Copilot in VS Code

e9845ad

Update blog post and replace images

940a23a

ntrogh requested changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blog post: The Coding Harness Behind GitHub Copilot in VS Code#9740

Add blog post: The Coding Harness Behind GitHub Copilot in VS Code#9740
jukasper wants to merge 2 commits intomainfrom
blog/agent-harness-github-copilot-vscode

jukasper commented May 6, 2026

Uh oh!

ntrogh left a comment

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

ntrogh May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		May 7, 2026 by [Julia Kasper](https://github.com/jukasper)

		Every few months, a new model drops and the conversation resets. Which one is smartest? Which one is fastest? Which one should we ship? Those are useful questions, but for a product like Visual Studio Code they are incomplete. A model is only one part of the experience. What developers actually feel is the coding harness: the layer that assembles context, exposes tools, runs the agent loop, interprets tool calls, and turns a model's output into something useful inside the editor.


		![Diagram showing that an agent is made up of a model plus a harness. The harness includes the agent loop, tools, context management, and system prompt.](agent_model_harness.png)

		## What We Mean by the Coding Harness

	## What We Mean by the Coding Harness
	## What is the coding harness?


		## What We Mean by the Coding Harness

		That distinction matters because language models do not edit files, execute commands, or run tests by themselves. They produce text. The harness is the system that turns that text into action and feeds the results back so the model can decide what to do next.

	That distinction matters because language models do not edit files, execute commands, or run tests by themselves. They produce text. The harness is the system that turns that text into action and feeds the results back so the model can decide what to do next.
	Language models do not edit files, execute commands, or run tests by themselves. They can only produce text. The coding harness is the system that acts as a bridge between the code editor and the language model. It turns that text into action and feeds the results back so the model can decide what to do next.


		![Diagram of the agent loop showing the cycle: build prompt, send to model, check response type, execute tools, record results, and loop back.](agentloop.png)

		Each pass through this loop is called a round. A single user message might trigger dozens of rounds as the model reads and searches files, edits code, runs tests, reads the output, and iterates on failures.


		The loop is not unbounded. The harness enforces a tool-call limit, checks for cancellation between rounds, and runs stop hooks, extension points that can inspect the model's state and either allow it to finish or push it to keep working ("you were about to stop, but the tests still fail").

		Within the loop, the prompt is rebuilt on every iteration. That means the model always sees the latest state of the workspace: if it edited a file three rounds ago, the current prompt reflects that edit. The harness also manages conversation summarization. When the accumulated history grows too large, it compresses earlier rounds into a summary so the model can keep working without hitting the context window ceiling.


		When a new model ships, it needs to fit into an existing harness. The system prompt, the tool definitions, the loop logic, the context assembly, all of it was built and tuned over many months of real-world use. The model gets better at filling in the blanks, but the harness defines what the blanks are.

		This matters even more because GitHub Copilot spans model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install provider extensions. The editor deals with a moving ecosystem, not a single stable API.


		Different models need different harness behavior. Claude models use `replace_string_in_file` for edits; GPT models use `apply_patch`. Gemini needs reminders to use tool-calling instead of narrating it, and breaks on orphaned tool calls in history. Some models support extended thinking and need reasoning-effort controls. Some work best with a concise system prompt; others need verbose, structured instructions to stay on track. The harness selects different system prompts per model - Claude Sonnet 4 gets a different prompt than Claude 4.5, which gets a different one than Opus.

		These aren't abstract differences. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. When a new model ships, we don't just flip a switch. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better.


		## Evaluation keeps the harness honest

		That's where evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After launch, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows.


		![Diagram showing an overview of the VS Code evaluation pipeline.](evaluations.png)

		Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.

	Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.
	There are multiple public model benchmarks, which are useful as a shared reference point. We use these benchmarks to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.


		Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own.

		Part of the issue is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers.

Conversation

jukasper commented May 6, 2026

Uh oh!

ntrogh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants