braintrustdata · Abhijeet Prasad (AbhiPrasad) · Mar 27, 2026 · Mar 27, 2026
diff --git a/.agents/skills/sdk-integrations/SKILL.md b/.agents/skills/sdk-integrations/SKILL.md
@@ -13,12 +13,19 @@ If the provider already has a real implementation under `py/src/braintrust/wrapp
 
 Start from one structural reference and one patching reference instead of designing from scratch:
 
-- ADK (`py/src/braintrust/integrations/adk/`) for direct method patching, `target_module`, `CompositeFunctionWrapperPatcher`, manual `wrap_*()` helpers, and priority-based context propagation.
+- ADK (`py/src/braintrust/integrations/adk/`) for direct method patching, `target_module`, `CompositeFunctionWrapperPatcher`, manual `wrap_*()` helpers, priority-based context propagation, and input-side `inline_data` to `Attachment` conversion.
 - Agno (`py/src/braintrust/integrations/agno/`) for multi-target patching, version-conditional fallbacks with `superseded_by`, and providers that need several related patchers.
 - Anthropic (`py/src/braintrust/integrations/anthropic/`) for constructor patching and a compact provider package with a small public surface.
+- Google GenAI (`py/src/braintrust/integrations/google_genai/`) for multimodal serialization, generated media outputs, and output-side `Attachment` handling.
 
 Match an existing repo pattern unless the target provider forces a different shape.
 
+Choose the example based on the hardest part of the task, not just provider similarity:
+
+- If the task is mostly about patcher topology, copy the closest patcher layout first.
+- If the task is mostly about traced payload shaping, copy the closest tracing implementation first.
+- If the task involves generated media or multimodal payloads, start from ADK or Google GenAI before looking at simpler text-only integrations.
+
 ## Read First
 
 Always read:
@@ -42,6 +49,25 @@ Read when relevant:
 - `py/src/braintrust/conftest.py` for VCR behavior
 - `py/src/braintrust/integrations/auto_test_scripts/` for subprocess auto-instrument tests
 - `py/src/braintrust/integrations/adk/test_adk.py` and `py/src/braintrust/integrations/anthropic/test_anthropic.py` for test layout patterns
+- `py/src/braintrust/integrations/adk/tracing.py` and `py/src/braintrust/integrations/google_genai/tracing.py` when the provider accepts binary inputs, emits generated files, or otherwise needs `Attachment` objects in traced input/output
+
+## Working Sequence
+
+Use this order unless the task is obviously narrower:
+
+1. Read the nearest provider package and the shared integration primitives.
+2. Decide which public surface is being patched: constructor, top-level function, client method, stream method, or manual `wrap_*()` helper.
+3. Decide what the span should look like before writing patchers:
+   - what belongs in `input`
+   - what belongs in `output`
+   - what belongs in `metadata`
+   - what belongs in `metrics`
+4. Implement or update patchers.
+5. Implement or update tracing helpers.
+6. Add or update focused tests in the provider package.
+7. Run the narrowest nox session first, then expand only if shared code changed.
+
+Do not start by wiring patchers and only later asking what the logged span should contain. The traced shape should drive the tracing helper design from the start.
 
 ## Route The Task
 
@@ -66,6 +92,7 @@ Read when relevant:
 2. Change only the affected patchers, tracing helpers, exports, tests, and cassettes.
 3. Preserve the provider's public setup and `wrap_*()` surface unless the task explicitly changes it.
 4. Keep repo-level changes narrow; do not touch `auto.py`, `integrations/__init__.py`, or `py/noxfile.py` unless the task actually requires it.
+5. Preserve existing span shape conventions unless the task is intentionally improving or correcting them.
 
 ### `auto_instrument()` only
 
@@ -102,6 +129,77 @@ Keep span creation, metadata extraction, stream aggregation, error logging, and
 
 Preserve provider behavior. Do not let tracing-only code change provider return values, control flow, or error behavior except where the task explicitly requires it.
 
+Generate structured spans. Do not pass raw `args` and `kwargs` straight into traced spans unless the provider API already exposes the exact stable schema you want to log. Instead:
+
+- Build a provider-shaped `input` object that names the important request fields explicitly, for example model, messages/contents, prompt, config, tools, or options.
+- Build an `output` object that captures the useful response payload in normalized form instead of logging opaque SDK objects.
+- Put secondary facts in `metadata`, such as provider ids, finish reasons, model versions, safety attributes, or normalized request/response annotations that are useful but not the primary payload.
+- Put timings and token/accounting values in `metrics`, such as `start`, `end`, `duration`, `time_to_first_token`, `prompt_tokens`, `completion_tokens`, and `tokens`.
+- Drop noisy transport-level or duplicate fields rather than mirroring the full raw call surface.
+- Add small provider-local helpers in `tracing.py` to extract `input`, `output`, `metadata`, and `metrics` before opening or closing spans.
+
+Aim for spans that are readable in the UI without requiring someone to reverse-engineer the provider SDK's calling convention from positional arguments.
+
+Shape spans by semantics, not by the provider SDK object model:
+
+- `input` is the meaningful request a human would describe, not the raw Python call signature.
+- `output` is the meaningful result, not a provider response class dumped wholesale.
+- `metadata` is for supporting context that helps interpretation but is not the main payload.
+- `metrics` is for timings, token counts, and similar numeric accounting.
+
+Good span shaping usually means:
+
+- flattening positional arguments into named fields
+- omitting duplicate values that appear in both request and response objects
+- normalizing provider-specific classes into dicts/lists/scalars
+- aggregating streaming chunks into one final `output` plus stream-specific `metrics`
+- preserving useful provider identifiers without leaking transport noise
+
+Use provider-local helper functions instead of building spans inline inside wrappers. A good pattern is:
+
+```python
+def _prepare_traced_call(args: list[Any], kwargs: dict[str, Any]) -> tuple[dict[str, Any], dict[str, Any]]:
+    ...
+
+
+def _process_result(result: Any, start: float) -> tuple[dict[str, Any], dict[str, Any], dict[str, Any]]:
+    ...
+```
+
+Keep wrapper bodies thin: prepare input, open the span, call the provider, normalize the result, then log `output`, `metadata`, and `metrics`.
+
+When deciding where a field belongs:
+
+- Put it in `input` if the caller intentionally supplied it.
+- Put it in `output` if it is the core result a user would care about.
+- Put it in `metadata` if it explains the result but is not the result itself.
+- Put it in `metrics` if it is numeric operational accounting or timing.
+- Put it in `error` if the call failed and you want the span to record the exception or failure message instead of pretending the failure is ordinary output.
+
+Distinguish span payload fields from span setup fields:
+
+- Treat `input`, `output`, `metadata`, `metrics`, and `error` as the main logged payload fields.
+- Treat `name` plus `type` or `span_attributes` as span identity/classification, not as payload.
+- Use `parent` only when you need to attach the span to an explicit exported parent instead of relying on current-span context.
+- Use `start_time` when the true start happened before the wrapper got control and you need accurate duration or time-to-first-token accounting.
+
+Examples:
+
+- prompt/model/tools/config belong in `input`
+- generated text, tool calls, embeddings summary, generated images summary, or normalized message content belong in `output`
+- provider request ids, finish reasons, safety annotations, cached-hit indicators, or model revision identifiers belong in `metadata`
+- token counts, elapsed time, time-to-first-token, retry counts, or billable character counts belong in `metrics`
+- exceptions, provider errors, and wrapper failures belong in `error`
+
+Across the current integrations, `input`/`output`/`metadata`/`metrics` are the common structured logging fields, and `error` is the main additional event field used during failures. Other values should usually live inside one of those containers unless they are truly span-level controls like `name`, `type`, `span_attributes`, `parent`, or `start_time`.
+
+Treat provider-owned binary payloads as attachments, not raw logged bytes. When traced input or output contains inline media, generated files, or other uploadable content:
+
+- Convert raw `bytes` into `braintrust.logger.Attachment` objects in provider-local tracing helpers instead of logging raw bytes or large base64 blobs.
+- Use the repo's existing message/content shapes when embedding attachments in traced payloads. For multimodal content this is often `{"image_url": {"url": attachment}}`, even when the MIME type is not literally an image.
+- Preserve ordinary remote URLs as strings. Only convert provider-owned binary content or data-URL style payloads that Braintrust should upload to object storage.
+- Keep structured metadata alongside the attachment, such as MIME type, size, safety attributes, or provider ids, so spans stay inspectable without reading the blob.
+
 Prefer feature detection first and version checks second. Use:
 
 - `detect_module_version(...)`
@@ -110,6 +208,8 @@ Prefer feature detection first and version checks second. Use:
 
 Let `BaseIntegration.resolve_patchers()` reject duplicate patcher ids; do not silently paper over duplicates.
 
+If a provider surface has both sync and async variants, try to keep the traced schema aligned across both paths. Differences in implementation are fine; differences in logged shape should be intentional.
+
 ## Patcher Rules
 
 Create one patcher per coherent patch target. Split unrelated targets into separate patchers.
@@ -154,6 +254,8 @@ Use `@pytest.mark.vcr` for real provider network behavior. Prefer recorded provi
 - local version-routing logic
 - patcher existence checks
 
+Write tests against the emitted span shape, not just the provider return value. A tracing change is incomplete if the provider call still works but the logged span becomes noisy, incomplete, or inconsistent.
+
 Cover the surfaces that changed:
 
 - direct `wrap_*()` behavior
@@ -164,9 +266,26 @@ Cover the surfaces that changed:
 - idempotence
 - failure and error logging
 - patcher resolution and duplicate detection
+- attachment conversion for binary inputs or generated media, including assertions that traced payloads contain `Attachment` objects rather than raw bytes
+- span structure, including assertions that the traced span exposes meaningful `input`, `output`, `metadata`, and `metrics` rather than opaque raw call arguments
+
+For span assertions, prefer checking the specific normalized fields that matter:
+
+- the `input` contains the expected model/messages/prompt/config fields
+- the `output` contains normalized provider results rather than opaque SDK instances
+- the `metadata` contains finish reasons, ids, or annotations in the expected place
+- the `metrics` contain the expected timing or token fields when the provider returns them
+- binary payloads are represented as `Attachment` objects where applicable
+
+If a change affects streaming, verify both:
+
+- intermediate behavior still returns the provider's expected iterator or async iterator
+- the final logged span contains the aggregated `output` and stream-specific `metrics`
 
 Keep VCR cassettes in `py/src/braintrust/integrations/<provider>/cassettes/`. Re-record only when the behavior change is intentional.
 
+When the provider returns binary HTTP responses or generated media, make cassette sanitization part of the change if needed so recorded fixtures do not store raw file bytes.
+
 When choosing commands, confirm the real session name in `py/noxfile.py` instead of assuming it matches the provider folder. Examples in this repo include `test_agno`, `test_anthropic`, and `test_google_adk`.
 
 ## Commands

diff --git a/py/src/braintrust/integrations/google_genai/cassettes/test_generate_images.yaml b/py/src/braintrust/integrations/google_genai/cassettes/test_generate_images.yaml
@@ -0,0 +1,57 @@
+interactions:
+- request:
+    body: '{"instances": [{"prompt": "A watercolor fox in a forest"}], "parameters":
+      {"sampleCount": 1, "aspectRatio": "1:1", "safetySetting": "BLOCK_LOW_AND_ABOVE",
+      "includeRaiReason": true}}'
+    headers:
+      Accept:
+      - '*/*'
+      Accept-Encoding:
+      - gzip, deflate, zstd
+      Connection:
+      - keep-alive
+      Content-Length:
+      - '181'
+      Content-Type:
+      - application/json
+      Host:
+      - generativelanguage.googleapis.com
+      user-agent:
+      - google-genai-sdk/1.66.0 gl-python/3.13.3
+      x-goog-api-client:
+      - google-genai-sdk/1.66.0 gl-python/3.13.3
+    method: POST
+    uri: https://generativelanguage.googleapis.com/v1beta/models/imagen-4.0-fast-generate-001:predict
+  response:
+    body:
+      string: '{"predictions": [{"bytesBase64Encoded": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8z8DwHwAFBQIAX8jx0gAAAABJRU5ErkJggg==",
+        "mimeType": "image/png"}]}'
+    headers:
+      Alt-Svc:
+      - h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
+      Content-Type:
+      - application/json; charset=UTF-8
+      Date:
+      - Fri, 27 Mar 2026 01:18:45 GMT
+      Server:
+      - scaffolding on HTTPServer2
+      Server-Timing:
+      - gfet4t7; dur=4159
+      Transfer-Encoding:
+      - chunked
+      Vary:
+      - Origin
+      - X-Origin
+      - Referer
+      X-Content-Type-Options:
+      - nosniff
+      X-Frame-Options:
+      - SAMEORIGIN
+      X-XSS-Protection:
+      - '0'
+      content-length:
+      - '2622509'
+    status:
+      code: 200
+      message: OK
+version: 1
diff --git a/py/src/braintrust/integrations/google_genai/cassettes/test_generate_images_async.yaml b/py/src/braintrust/integrations/google_genai/cassettes/test_generate_images_async.yaml
@@ -0,0 +1,47 @@
+interactions:
+- request:
+    body: '{"instances": [{"prompt": "A watercolor fox in a forest"}], "parameters":
+      {"sampleCount": 1, "aspectRatio": "1:1", "safetySetting": "BLOCK_LOW_AND_ABOVE",
+      "includeRaiReason": true}}'
+    headers:
+      Content-Type:
+      - application/json
+      user-agent:
+      - google-genai-sdk/1.66.0 gl-python/3.13.3
+      x-goog-api-client:
+      - google-genai-sdk/1.66.0 gl-python/3.13.3
+    method: POST
+    uri: https://generativelanguage.googleapis.com/v1beta/models/imagen-4.0-fast-generate-001:predict
+  response:
+    body:
+      string: '{"predictions": [{"bytesBase64Encoded": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8z8DwHwAFBQIAX8jx0gAAAABJRU5ErkJggg==",
+        "mimeType": "image/png"}]}'
+    headers:
+      Alt-Svc:
+      - h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
+      Content-Type:
+      - application/json; charset=UTF-8
+      Date:
+      - Fri, 27 Mar 2026 01:18:49 GMT
+      Server:
+      - scaffolding on HTTPServer2
+      Server-Timing:
+      - gfet4t7; dur=3224
+      Transfer-Encoding:
+      - chunked
+      Vary:
+      - Origin
+      - X-Origin
+      - Referer
+      X-Content-Type-Options:
+      - nosniff
+      X-Frame-Options:
+      - SAMEORIGIN
+      X-XSS-Protection:
+      - '0'
+      content-length:
+      - '2750553'
+    status:
+      code: 200
+      message: OK
+version: 1
diff --git a/py/src/braintrust/integrations/google_genai/integration.py b/py/src/braintrust/integrations/google_genai/integration.py
@@ -8,9 +8,11 @@
     AsyncModelsEmbedContentPatcher,
     AsyncModelsGenerateContentPatcher,
     AsyncModelsGenerateContentStreamPatcher,
+    AsyncModelsGenerateImagesPatcher,
     ModelsEmbedContentPatcher,
     ModelsGenerateContentPatcher,
     ModelsGenerateContentStreamPatcher,
+    ModelsGenerateImagesPatcher,
 )
 
 
@@ -26,7 +28,9 @@ class GoogleGenAIIntegration(BaseIntegration):
         ModelsGenerateContentPatcher,
         ModelsGenerateContentStreamPatcher,
         ModelsEmbedContentPatcher,
+        ModelsGenerateImagesPatcher,
         AsyncModelsGenerateContentPatcher,
         AsyncModelsGenerateContentStreamPatcher,
         AsyncModelsEmbedContentPatcher,
+        AsyncModelsGenerateImagesPatcher,
     )
diff --git a/py/src/braintrust/integrations/google_genai/patchers.py b/py/src/braintrust/integrations/google_genai/patchers.py
@@ -6,9 +6,11 @@
     _async_embed_content_wrapper,
     _async_generate_content_stream_wrapper,
     _async_generate_content_wrapper,
+    _async_generate_images_wrapper,
     _embed_content_wrapper,
     _generate_content_stream_wrapper,
     _generate_content_wrapper,
+    _generate_images_wrapper,
 )
 
 
@@ -44,6 +46,15 @@ class ModelsEmbedContentPatcher(FunctionWrapperPatcher):
     wrapper = _embed_content_wrapper
 
 
+class ModelsGenerateImagesPatcher(FunctionWrapperPatcher):
+    """Patch ``Models.generate_images`` for tracing."""
+
+    name = "google_genai.models.generate_images"
+    target_module = "google.genai.models"
+    target_path = "Models.generate_images"
+    wrapper = _generate_images_wrapper
+
+
 # ---------------------------------------------------------------------------
 # Async Models patchers
 # ---------------------------------------------------------------------------
@@ -74,3 +85,12 @@ class AsyncModelsEmbedContentPatcher(FunctionWrapperPatcher):
     target_module = "google.genai.models"
     target_path = "AsyncModels.embed_content"
     wrapper = _async_embed_content_wrapper
+
+
+class AsyncModelsGenerateImagesPatcher(FunctionWrapperPatcher):
+    """Patch ``AsyncModels.generate_images`` for tracing."""
+
+    name = "google_genai.async_models.generate_images"
+    target_module = "google.genai.models"
+    target_path = "AsyncModels.generate_images"
+    wrapper = _async_generate_images_wrapper