|
| 1 | +# Sampling |
| 2 | + |
| 3 | +Sampling is a powerful MCP feature that allows servers to request LLM completions from the client. Instead of the server needing its own LLM access, it can "borrow" the client's language model to generate text, analyze content, or perform any LLM task. |
| 4 | + |
| 5 | +## How It Works |
| 6 | + |
| 7 | +In a typical MCP interaction, the client calls tools on the server. With sampling, the flow is reversed for part of the interaction: |
| 8 | + |
| 9 | +``` |
| 10 | +Client Server |
| 11 | + │ │ |
| 12 | + │ call_tool("summarize") │ |
| 13 | + │──────────────────────────────>│ |
| 14 | + │ │ |
| 15 | + │ sampling/createMessage │ |
| 16 | + │<──────────────────────────────│ |
| 17 | + │ │ |
| 18 | + │ (client calls LLM) │ |
| 19 | + │ │ |
| 20 | + │ CreateMessageResult │ |
| 21 | + │──────────────────────────────>│ |
| 22 | + │ │ |
| 23 | + │ tool result │ |
| 24 | + │<──────────────────────────────│ |
| 25 | +``` |
| 26 | + |
| 27 | +1. The client calls a tool on the server. |
| 28 | +2. The server's tool handler sends a `sampling/createMessage` request back to the client. |
| 29 | +3. The client's sampling callback processes the request (typically by calling an LLM). |
| 30 | +4. The client returns the LLM response to the server. |
| 31 | +5. The server uses the response to complete the tool execution. |
| 32 | + |
| 33 | +## Server Side |
| 34 | + |
| 35 | +On the server side, use `ctx.session.create_message()` inside a tool handler to request a completion: |
| 36 | + |
| 37 | +--8<-- "examples/snippets/servers/sampling.py" |
| 38 | + |
| 39 | +The `create_message` method accepts these parameters: |
| 40 | + |
| 41 | +| Parameter | Type | Description | |
| 42 | +|-----------|------|-------------| |
| 43 | +| `messages` | `list[SamplingMessage]` | The conversation messages to send | |
| 44 | +| `max_tokens` | `int` | Maximum tokens in the response | |
| 45 | +| `system_prompt` | `str \| None` | Optional system prompt | |
| 46 | +| `temperature` | `float \| None` | Sampling temperature (0.0 = deterministic) | |
| 47 | +| `stop_sequences` | `list[str] \| None` | Sequences that stop generation | |
| 48 | +| `model_preferences` | `ModelPreferences \| None` | Hints about which model to use | |
| 49 | + |
| 50 | +## Client Side |
| 51 | + |
| 52 | +On the client side, provide a `sampling_callback` when creating the session. This callback handles `sampling/createMessage` requests from the server: |
| 53 | + |
| 54 | +```python |
| 55 | +from mcp import ClientSession, StdioServerParameters, types |
| 56 | +from mcp.client.context import ClientRequestContext |
| 57 | +from mcp.client.stdio import stdio_client |
| 58 | + |
| 59 | + |
| 60 | +async def handle_sampling( |
| 61 | + context: ClientRequestContext, |
| 62 | + params: types.CreateMessageRequestParams, |
| 63 | +) -> types.CreateMessageResult: |
| 64 | + # Forward the request to your LLM |
| 65 | + # ... call OpenAI, Anthropic, Azure OpenAI, etc. |
| 66 | + return types.CreateMessageResult( |
| 67 | + role="assistant", |
| 68 | + content=types.TextContent(type="text", text="LLM response here"), |
| 69 | + model="your-model-name", |
| 70 | + stop_reason="endTurn", |
| 71 | + ) |
| 72 | + |
| 73 | + |
| 74 | +async def main(): |
| 75 | + server_params = StdioServerParameters(command="your-server-command") |
| 76 | + |
| 77 | + async with stdio_client(server_params) as (read, write): |
| 78 | + async with ClientSession( |
| 79 | + read, |
| 80 | + write, |
| 81 | + sampling_callback=handle_sampling, |
| 82 | + ) as session: |
| 83 | + await session.initialize() |
| 84 | + |
| 85 | + # Now when you call a tool that uses sampling, |
| 86 | + # your callback will be invoked automatically |
| 87 | + result = await session.call_tool("summarize", {"text": "..."}) |
| 88 | +``` |
| 89 | + |
| 90 | +### Using the High-Level Client |
| 91 | + |
| 92 | +The `Client` class also supports sampling callbacks: |
| 93 | + |
| 94 | +```python |
| 95 | +from mcp import Client |
| 96 | + |
| 97 | +async with Client(server, sampling_callback=handle_sampling) as client: |
| 98 | + result = await client.call_tool("summarize", {"text": "..."}) |
| 99 | +``` |
| 100 | + |
| 101 | +## Integrating with LLM Providers |
| 102 | + |
| 103 | +Here is how to connect the sampling callback to popular LLM providers: |
| 104 | + |
| 105 | +### OpenAI |
| 106 | + |
| 107 | +```python |
| 108 | +from openai import AsyncOpenAI |
| 109 | + |
| 110 | +openai_client = AsyncOpenAI() |
| 111 | + |
| 112 | + |
| 113 | +async def handle_sampling( |
| 114 | + context: ClientRequestContext, |
| 115 | + params: types.CreateMessageRequestParams, |
| 116 | +) -> types.CreateMessageResult: |
| 117 | + messages = [] |
| 118 | + if params.system_prompt: |
| 119 | + messages.append({"role": "system", "content": params.system_prompt}) |
| 120 | + for msg in params.messages: |
| 121 | + if isinstance(msg.content, types.TextContent): |
| 122 | + messages.append({"role": msg.role, "content": msg.content.text}) |
| 123 | + |
| 124 | + response = await openai_client.chat.completions.create( |
| 125 | + model="gpt-4o", |
| 126 | + messages=messages, |
| 127 | + max_tokens=params.max_tokens, |
| 128 | + temperature=params.temperature, |
| 129 | + ) |
| 130 | + return types.CreateMessageResult( |
| 131 | + role="assistant", |
| 132 | + content=types.TextContent( |
| 133 | + type="text", text=response.choices[0].message.content or "" |
| 134 | + ), |
| 135 | + model=response.model, |
| 136 | + stop_reason="endTurn", |
| 137 | + ) |
| 138 | +``` |
| 139 | + |
| 140 | +### Anthropic |
| 141 | + |
| 142 | +```python |
| 143 | +from anthropic import AsyncAnthropic |
| 144 | + |
| 145 | +anthropic_client = AsyncAnthropic() |
| 146 | + |
| 147 | + |
| 148 | +async def handle_sampling( |
| 149 | + context: ClientRequestContext, |
| 150 | + params: types.CreateMessageRequestParams, |
| 151 | +) -> types.CreateMessageResult: |
| 152 | + messages = [ |
| 153 | + {"role": msg.role, "content": msg.content.text} |
| 154 | + for msg in params.messages |
| 155 | + if isinstance(msg.content, types.TextContent) |
| 156 | + ] |
| 157 | + |
| 158 | + response = await anthropic_client.messages.create( |
| 159 | + model="claude-sonnet-4-20250514", |
| 160 | + messages=messages, |
| 161 | + max_tokens=params.max_tokens or 1024, |
| 162 | + system=params.system_prompt or "", |
| 163 | + ) |
| 164 | + return types.CreateMessageResult( |
| 165 | + role="assistant", |
| 166 | + content=types.TextContent( |
| 167 | + type="text", text=response.content[0].text |
| 168 | + ), |
| 169 | + model=response.model, |
| 170 | + stop_reason="endTurn", |
| 171 | + ) |
| 172 | +``` |
| 173 | + |
| 174 | +## Complete Example |
| 175 | + |
| 176 | +For a complete working example with both server and client, see: |
| 177 | + |
| 178 | +- **Server**: [`examples/servers/simple-sampling`](https://github.com/modelcontextprotocol/python-sdk/tree/main/examples/servers/simple-sampling) |
| 179 | +- **Client**: [`examples/clients/simple-sampling-client`](https://github.com/modelcontextprotocol/python-sdk/tree/main/examples/clients/simple-sampling-client) |
| 180 | + |
| 181 | +## Model Preferences |
| 182 | + |
| 183 | +Servers can provide hints about which model to use via `model_preferences`: |
| 184 | + |
| 185 | +```python |
| 186 | +from mcp.types import ModelPreferences, ModelHint |
| 187 | + |
| 188 | +result = await ctx.session.create_message( |
| 189 | + messages=[...], |
| 190 | + max_tokens=100, |
| 191 | + model_preferences=ModelPreferences( |
| 192 | + hints=[ModelHint(name="claude-sonnet-4-20250514")], |
| 193 | + cost_priority=0.5, |
| 194 | + speed_priority=0.8, |
| 195 | + intelligence_priority=0.7, |
| 196 | + ), |
| 197 | +) |
| 198 | +``` |
| 199 | + |
| 200 | +The client can use these hints to select an appropriate model, but is not required to follow them. |
0 commit comments