Add blog post: The Coding Harness Behind GitHub Copilot in VS Code#9740
Add blog post: The Coding Harness Behind GitHub Copilot in VS Code#9740
Conversation
| May 7, 2026 by [Julia Kasper](https://github.com/jukasper) | ||
|
|
||
| Every few months, a new model drops and the conversation resets. Which one is smartest? Which one is fastest? Which one should we ship? Those are useful questions, but for a product like Visual Studio Code they are incomplete. A model is only one part of the experience. What developers actually feel is the coding harness: the layer that assembles context, exposes tools, runs the agent loop, interprets tool calls, and turns a model's output into something useful inside the editor. | ||
|
|
There was a problem hiding this comment.
Add an intent statement to indicate to readers what this blog post is about.
|
|
||
|  | ||
|
|
||
| ## What We Mean by the Coding Harness |
There was a problem hiding this comment.
| ## What We Mean by the Coding Harness | |
| ## What is the coding harness? |
|
|
||
| ## What We Mean by the Coding Harness | ||
|
|
||
| That distinction matters because language models do not edit files, execute commands, or run tests by themselves. They produce text. The harness is the system that turns that text into action and feeds the results back so the model can decide what to do next. |
There was a problem hiding this comment.
Often readers ignore the above text and/or heading. Make sure to have each section stand by itself. "That distinction.." is not clear when you start to read from that point on.
There was a problem hiding this comment.
| That distinction matters because language models do not edit files, execute commands, or run tests by themselves. They produce text. The harness is the system that turns that text into action and feeds the results back so the model can decide what to do next. | |
| Language models do not edit files, execute commands, or run tests by themselves. They can only produce text. The coding harness is the system that acts as a bridge between the code editor and the language model. It turns that text into action and feeds the results back so the model can decide what to do next. |
|
|
||
|  | ||
|
|
||
| Each pass through this loop is called a round. A single user message might trigger dozens of rounds as the model reads and searches files, edits code, runs tests, reads the output, and iterates on failures. |
There was a problem hiding this comment.
You might want to clarify the difference between a turn and a round.
|
|
||
| The loop is not unbounded. The harness enforces a tool-call limit, checks for cancellation between rounds, and runs stop hooks, extension points that can inspect the model's state and either allow it to finish or push it to keep working ("you were about to stop, but the tests still fail"). | ||
|
|
||
| Within the loop, the prompt is rebuilt on every iteration. That means the model always sees the latest state of the workspace: if it edited a file three rounds ago, the current prompt reflects that edit. The harness also manages conversation summarization. When the accumulated history grows too large, it compresses earlier rounds into a summary so the model can keep working without hitting the context window ceiling. |
There was a problem hiding this comment.
On first read, I was wondering why all of a sudden the harness came up here, and thought you meant the agent loop. Consider if we should include this in the harness section.
|
|
||
| When a new model ships, it needs to fit into an existing harness. The system prompt, the tool definitions, the loop logic, the context assembly, all of it was built and tuned over many months of real-world use. The model gets better at filling in the blanks, but the harness defines what the blanks are. | ||
|
|
||
| This matters even more because GitHub Copilot spans model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install provider extensions. The editor deals with a moving ecosystem, not a single stable API. |
There was a problem hiding this comment.
| This matters even more because GitHub Copilot spans model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install provider extensions. The editor deals with a moving ecosystem, not a single stable API. | |
| This matters even more because GitHub Copilot spans multiple model providers. GitHub Copilot in VS Code supports a growing model ecosystem. Developers can switch between models, use auto-selection, bring their own keys, or install extra providers via extensions. This means that VS Code has to deal with broad and continuously evolving ecosystem, not a single stable API. |
|
|
||
| Different models need different harness behavior. Claude models use `replace_string_in_file` for edits; GPT models use `apply_patch`. Gemini needs reminders to use tool-calling instead of narrating it, and breaks on orphaned tool calls in history. Some models support extended thinking and need reasoning-effort controls. Some work best with a concise system prompt; others need verbose, structured instructions to stay on track. The harness selects different system prompts per model - Claude Sonnet 4 gets a different prompt than Claude 4.5, which gets a different one than Opus. | ||
|
|
||
| These aren't abstract differences. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. When a new model ships, we don't just flip a switch. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better. |
There was a problem hiding this comment.
| These aren't abstract differences. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. When a new model ships, we don't just flip a switch. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better. | |
| All these per-model differences aren't trivial. They translate into per-model system prompts, per-model tool sets, and per-model conversation management. This means that when a new model ships, we don't just flip a switch but we need to validate its behavior. We validate tool schemas, retune defaults, and re-run full agent sessions before anything ships. The harder question is how we know those changes actually made things better. |
|
|
||
| ## Evaluation keeps the harness honest | ||
|
|
||
| That's where evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After launch, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows. |
There was a problem hiding this comment.
| That's where evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After launch, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows. | |
| Just like you need to test a new feature before you ship it, models also need to be tested. That's where model evaluation comes in. Before a model ships in VS Code, we evaluate it from multiple angles. We run offline benchmarks, test it internally, and compare it against the models already available in the product. After the model is live, we keep measuring: A/B tests, aggregate usage signals, and weekly reporting help us understand how the model behaves in real developer workflows. |
|
|
||
|  | ||
|
|
||
| Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own. |
There was a problem hiding this comment.
| Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own. | |
| There are multiple public model benchmarks, which are useful as a shared reference point. We use these benchmarks to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own. |
|
|
||
| Public benchmarks are useful as shared reference points. We use them to compare against the broader model ecosystem and to catch obvious regressions. But at frontier levels, they are no longer enough on their own. | ||
|
|
||
| Part of the issue is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers. |
There was a problem hiding this comment.
| Part of the issue is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers. | |
| One of the issues with the public benchmarks is coverage. SWE-bench is valuable, but it is still centered on public bug-fixing tasks. Terminal-Bench is useful for measuring command-line competence, but many tasks look more like isolated terminal puzzles than the kinds of workflows developers actually bring to an editor. Real-world coding agents need to do more than patch a known bug or solve a shell challenge. They need to scaffold projects, migrate codebases, refactor across files, follow instructions, and handle terminals and browsers. |
This PR adds a new blog post explaining the coding harness architecture behind GitHub Copilot's agent mode in VS Code.
The post covers: