Skip to content

OpenTelemetry metrics and span events for federation health #619

@dahlia

Description

@dahlia

Summary

Fedify already instruments various operations as OpenTelemetry spans via the tracerProvider option. This issue proposes extending that instrumentation to cover the other two OpenTelemetry signal types—metrics and events—focusing on federation health data that is difficult or impossible to express as spans alone.

Problem

Spans are well-suited for capturing the lifecycle of individual operations, but they are not the right primitive for answering questions like:

  • How many activities have failed to deliver to a given remote server over the past hour?
  • What is the p99 delivery latency to a specific instance?
  • How often are incoming activities rejected due to signature verification failures, and for what reasons?

Currently, none of this data is surfaced in a way that integrates with standard observability stacks. Users who want to monitor federation health have to instrument their own application code, and even then they cannot easily observe what Fedify is doing internally.

Proposed solution

meterProvider option

Add a meterProvider option to createFederation(), following the same pattern as the existing tracerProvider option. When omitted, Fedify falls back to the global default MeterProvider.

import { createFederation } from "@fedify/fedify";
import { metrics } from "@opentelemetry/api";

const federation = createFederation<void>({
  kv: ...,
  queue: ...,
  tracerProvider: ...,  // existing
  meterProvider: metrics.getMeterProvider(),  // new
});

Metrics to instrument

Counters

  • activitypub.delivery.sent — incremented on each outbound delivery attempt, with attributes:
    • activitypub.remote.host — hostname of the recipient inbox
    • activitypub.activity.type — e.g. Create, Follow, Undo
    • activitypub.delivery.success — boolean
  • activitypub.delivery.permanent_failure — incremented when delivery is abandoned as a permanent failure, with attributes:
    • activitypub.remote.host
    • http.response.status_code
  • activitypub.signature.verification_failure — incremented when an incoming activity fails signature verification, with attributes:
    • activitypub.remote.host
    • activitypub.verification.failure_reason — maps to UnverifiedActivityReason variants

Histograms

  • activitypub.delivery.duration — time elapsed from delivery attempt to response, in milliseconds, with attribute activitypub.remote.host
  • activitypub.inbox.processing_duration — time spent dispatching an incoming activity to its handler, in milliseconds, with attribute activitypub.activity.type

Span events

Circuit breaker state changes (tracked in a separate issue) will emit span events when implemented. However, even before that, delivery failures that occur within an existing outbox span should emit a span event with structured attributes (remote host, status code, attempt number) rather than just setting span status to error. This makes the failure information queryable without requiring a separate metrics backend.

Cardinality considerations

The activitypub.remote.host attribute is bounded to the hostname only (no path, no port). This keeps cardinality manageable even for servers with large follower bases spread across many instances.

Why not callbacks?

An alternative would be to add onDeliveryFailure, onVerificationFailure, etc. callbacks to createFederation(). This would work, but it creates a parallel observability mechanism that sits outside the OpenTelemetry pipeline, making it harder to correlate with existing span data. The OpenTelemetry metrics approach means users who already have a Prometheus or OTLP backend get federation health data for free, without wiring up additional callbacks.

Relation to other issues

A circuit breaker feature (to be filed separately) would build on these metrics—both for its own internal state tracking and to give users visibility into when and why circuit breakers are opening. The metrics introduced here are a prerequisite for that work.

Scope

Changes are limited to @fedify/fedify. No changes to @fedify/cli or other packages are required.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions