From 1454e78eccaef21979c6c554b2df9f2aaf1a1afd Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Mon, 18 May 2026 15:40:08 +0000 Subject: [PATCH 01/13] initial commit --- A118-authentication-telemetry.md | 48 ++++++++++++++++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 A118-authentication-telemetry.md diff --git a/A118-authentication-telemetry.md b/A118-authentication-telemetry.md new file mode 100644 index 000000000..24c3edfed --- /dev/null +++ b/A118-authentication-telemetry.md @@ -0,0 +1,48 @@ +Title +---- +* Author(s): [Author Name, Co-Author Name ...] +* Approver: markdroth +* Implemented in: +* Last updated: [YYYY-MM-DD] +* Discussion at: (filled after thread exists) + +## Abstract + +[A short summary of the proposal.] + +## Background + +[An introduction of the necessary background and the problem being solved by the +proposed change.] + +### Related Proposals: +* A list of proposals this proposal builds on or supersedes. + +## Proposal + +[A precise statement of the proposed change.] + +### Temporary environment variable protection + +[Name the environment variable(s) used to enable/disable the feature(s) this +proposal introduces and their default(s). Generally, features that are enabled +by I/O should include this type of control until they have passed some testing +criteria, which should also be detailed here. This section may be omitted if +there are none.] + +## Rationale + +[A discussion of alternate approaches and the trade offs, advantages, and +disadvantages of the specified approach.] + + +## Implementation + +[A description of the steps in the implementation, who will do them, and when. +If a particular language is going to get the implementation first, this section +should list the proposed order.] + +## Open issues (if applicable) + +[A discussion of issues relating to this proposal for which the author does not +know the solution. This section may be omitted if there are none.] From c134e60d08b809f19e6a8df930f8dd38b2b1bf99 Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Mon, 18 May 2026 16:01:55 +0000 Subject: [PATCH 02/13] Add content to template. --- A118-authentication-telemetry.md | 252 +++++++++++++++++++++++++++---- 1 file changed, 226 insertions(+), 26 deletions(-) diff --git a/A118-authentication-telemetry.md b/A118-authentication-telemetry.md index 24c3edfed..f4f7517cd 100644 --- a/A118-authentication-telemetry.md +++ b/A118-authentication-telemetry.md @@ -1,48 +1,248 @@ -Title +A118: Authentication Telemetry ---- -* Author(s): [Author Name, Co-Author Name ...] -* Approver: markdroth -* Implemented in: -* Last updated: [YYYY-MM-DD] +* Author(s): @gtcooke94 +* Approver: @dfawley, @easwars, @ejona86, @mattstev, @markdroth +* Status: In Review +* Implemented in: +* Last updated: 2026-05-18 * Discussion at: (filled after thread exists) ## Abstract -[A short summary of the proposal.] +gRPC's authentication stack has no telemetry. This document details adding +non-per-call metrics to the TLS authentication stack. ## Background -[An introduction of the necessary background and the problem being solved by the -proposed change.] +gRPC's authentication stack currently lacks telemetry. This document outlines +the addition of authentication-related non-per-call metrics, leveraging the +existing telemetry infrastructure defined in +[A66](https://github.com/grpc/proposal/blob/master/A66-otel-stats.md) and the +specific architecture for non-per-call metrics established in +[A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md). +Because authentication and handshakers are connection-level abstractions, they +inherently require non-per-call instrumentation. Currently, application owners +face challenges in diagnosing handshake failures, relying on verbose logging +without a structured mechanism for aggregation and analysis. ### Related Proposals: -* A list of proposals this proposal builds on or supersedes. +* [A66: OpenTelemetry Metrics/Stats](https://github.com/grpc/proposal/blob/master/A66-otel-stats.md) +* [A79: OpenTelemetry Non-Per-Call Metrics Architecture](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md) ## Proposal -[A precise statement of the proposed change.] +All metrics will be scoped to TLS exclusively. + +### TLS Telemetry Status Enum + +The handshaker status will be represented by an enum that indicates success or +provides information on why the handshake failed. This value must manage a +balance of low-cardinality while being fine-grained enough to be useful; +therefore, an enum containing subdomains of authentication errors will be +created. This is presented as a C++ enum below, but will be identical in all +languages. In cases where we cannot categorize an error or cannot get enough +granularity in a given implementation and/or language, `UNKNOWN_FAILURE` will be +the catch-all error code. + +```c++ +enum class TlsTelemetryStatus { + UNKNOWN_FAILURE, + SUCCESS, + // Peer certificate verification failures. + CERTIFICATE_VERIFICATION_FAILED, + CERTIFICATE_REVOKED, + CERTIFICATE_EXPIRED, + CERTIFICATE_NOT_YET_VALID, + CERTIFICATE_AUTHORITY_INVALID, + // TLS negotiation mismatch failures + CERTIFICATE_HOSTNAME_MISMATCH, + CERTIFICATE_MALFORMED, + CIPHER_SUITE_MISMATCH, + PROTOCOL_VERSION_UNSUPPORTED, + INAPPROPRIATE_FALLBACK, + NO_APPLICATION_PROTOCOL, + // Cryptographic failures + SIGNATURE_VERIFICATION_FAILED, + DECRYPTION_FAILED, + KEY_EXCHANGE_FAILURE, + // Other failures + UNEXPECTED_MESSAGE, + HANDSHAKE_TIMEOUT, + PEER_CONNECTION_CLOSED +}; +``` + +### TLS Resumption Type Enum + +Further, whether the handshake is resumed or not is also critical for +understanding authentication behavior. An enum describing the type of resumption +used (or none) will be created - it can be extended in the future to make +finer-grained distinctions between the type of resumption that is used (e.g., +ticket-based resumption vs. session-based resumption). This is presented as a +C++ enum below, but will be identical in all languages. + +```c++ +enum class TlsResumptionType { + FULL_HANDSHAKE, + RESUMED_HANDSHAKE, +}; +``` + +### Metrics Definitions + +The following metrics are count metrics, with the primary information coming +from the labels. + +* `grpc.client.tls.handshakes` + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.tls.handshake.status` | Required | The `TlsTelemetryStatus` enum indicating success or the reason for handshake failure | +| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | +| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | + +* `grpc.server.tls.handshakes` + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.tls.handshake.status` | Required | The `TlsTelemetryStatus` enum indicating success or the reason for handshake failure | +| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | + +### TLS Offload Specific Metrics + +The following metrics are non-per-call bucketed latency metrics that report the duration of offloaded cryptographic operations. + +* `grpc.client.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | + +* `grpc.server.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | + +* `grpc.client.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | +| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key operation was done, e.g. “RsaPkcs1Sha256”. | + +* `grpc.server.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key operation was done, e.g. “RsaPkcs1Sha256”. | ### Temporary environment variable protection -[Name the environment variable(s) used to enable/disable the feature(s) this -proposal introduces and their default(s). Generally, features that are enabled -by I/O should include this type of control until they have passed some testing -criteria, which should also be detailed here. This section may be omitted if -there are none.] +This feature will be explicitly configured by users, thus no environment +variable protection is needed. If a user does not configure TLS metrics or +offloading, telemetry won't be collected under this mechanism unless general +telemetry/stats plugins are active. ## Rationale -[A discussion of alternate approaches and the trade offs, advantages, and -disadvantages of the specified approach.] - +The alternative of splitting generic handshaker metrics and TLS-specific metrics +was considered, but opened many rabbit holes as to what would qualify as a +handshaker (e.g. are HTTP Connect, TCP, RPC Switch, etc. all in scope, or just +alternative protocols to TLS such as ALTS). Thus, the decision was made to scope +the metrics to TLS. ## Implementation -[A description of the steps in the implementation, who will do them, and when. -If a particular language is going to get the implementation first, this section -should list the proposed order.] - -## Open issues (if applicable) - -[A discussion of issues relating to this proposal for which the author does not -know the solution. This section may be omitted if there are none.] +### C++ + +In the C++ implementation, the transport security interface (TSI) has +historically been decoupled from gRPC. However, this design choice has been +broken over time, and the two have been coupled for years now. The reasons +behind decoupling TSI and gRPC are no longer relevant, therefore we will fully +accept this coupling. Thus, the TSI code can contain gRPC monitoring specifics. +In the few use-cases where TSI is not called via gRPC, we will ensure that +metric incrementation is not performed. + +We will add the Channel's `StatsPluginGroup` as an optional argument to TSI +handshaker creation functions. This ensures that we don't break any existing +users of TSI and that we never increment metrics when TSI is used outside of +gRPC. When TSI is called from gRPC, we will pass this argument. This will be +stored on the handshaker, and in `ssl_transport_security.cc` we will access this +from the handshaker to increment metrics. + +```diff + tsi_result tsi_ssl_client_handshaker_factory_create_handshaker( + tsi_ssl_client_handshaker_factory* factory, + const char* server_name_indication, size_t network_bio_buf_size, + size_t ssl_bio_buf_size, + std::optional alpn_preferred_protocol_list, +- tsi_handshaker** handshaker); ++ tsi_handshaker** handshaker, ++ std::shared_ptr stats_plugin_group = nullptr); + +``` + +```diff + tsi_result tsi_ssl_server_handshaker_factory_create_handshaker( + tsi_ssl_server_handshaker_factory* factory, size_t network_bio_buf_size, +- size_t ssl_bio_buf_size, tsi_handshaker** handshaker); ++ size_t ssl_bio_buf_size, tsi_handshaker** handshaker, ++ std::shared_ptr stats_plugin_group = nullptr); +``` +### Golang + +For the implementation of the general handshake metric, we leverage gRPC-Go's existing transport-level architecture. + +The abstraction layer for general handshake information is the Transport +Connection Layer (`internal/transport`). Client Handshakes are performed in a +single place: `internal/transport/http2_client.go` inside `NewHTTP2Client` via +`transportCreds.ClientHandshake`. Server Handshakes are also performed in a +single place: `internal/transport/http2_server.go` inside `NewServerTransport` +via `config.Credentials.ServerHandshake`. We can incremement the metrics here +only in the case where TLS is the protocol being used. + +```diff +--- a/internal/transport/http2_client.go ++++ b/internal/transport/http2_client.go +@@ -294,7 +294,23 @@ + if transportCreds != nil { ++ isTLS := transportCreds.Info().SecurityProtocol == "tls" ++ var startTime time.Time ++ if isTLS { ++ startTime = time.Now() ++ } + conn, authInfo, err = transportCreds.ClientHandshake(connectCtx, addr.ServerName, conn) ++ if isTLS { ++ duration := time.Since(startTime).Seconds() ++ ++ } ++ + if err != nil { + return nil, connectionErrorf(isTemporary(err), err, "transport: authentication handshake failed: %v", err) + } + +``` + +To pass out resumption information, we will need to augment the [`TLSInfo`](https://github.com/grpc/grpc-go/blob/660208049b96ff6232e8c7212905b3e357b5bf42/credentials/tls.go#L41) to include [`ConnectionState.DidResume`](https://github.com/golang/go/blob/e62d3e6e897175a07aa44a7b2c7f99700072f22f/src/crypto/tls/common.go#L257). + +The TLS specific offload metrics will go with their implementation. This feature is not yet written in Go, so we cannot discuss specific metric implementation details. + +### Java + +In gRPC-Java, to support shading where core classes are relocated per-consumer, all metric instruments used in core/ or transport modules (like netty/) must be defined within the api/ module. + +Following the pattern established by `MIN_RTT_INSTRUMENT` in `InternalTcpMetrics.java`, we will: + +1. Define a new `InternalSecurityMetrics.java` class in the api/ module under the `io.grpc` package. +2. Expose `MetricRecorder` from `GrpcHttp2ConnectionHandler` in the netty/ module. +3. Instrument the Netty `ClientTlsHandler` and `ServerTlsHandler` in `ProtocolNegotiators.java` to track handshake duration and record it. + +The TLS specific offload metrics will go with their implementation. This feature is not yet written in Java, so we cannot discuss specific metric implementation details. + +### Wrapped Languages + +Wrapped languages that support non-per-call metrics should be able to get this "for free" from the Core implementation. However, Python for example, does not currently support non-per-call metrics. From da198cbd299bf0d5d1721cd2883bc64274b888fe Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Wed, 20 May 2026 14:27:49 +0000 Subject: [PATCH 03/13] change filename --- A118-tls-telemetry.md | 248 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 248 insertions(+) create mode 100644 A118-tls-telemetry.md diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md new file mode 100644 index 000000000..f4f7517cd --- /dev/null +++ b/A118-tls-telemetry.md @@ -0,0 +1,248 @@ +A118: Authentication Telemetry +---- +* Author(s): @gtcooke94 +* Approver: @dfawley, @easwars, @ejona86, @mattstev, @markdroth +* Status: In Review +* Implemented in: +* Last updated: 2026-05-18 +* Discussion at: (filled after thread exists) + +## Abstract + +gRPC's authentication stack has no telemetry. This document details adding +non-per-call metrics to the TLS authentication stack. + +## Background + +gRPC's authentication stack currently lacks telemetry. This document outlines +the addition of authentication-related non-per-call metrics, leveraging the +existing telemetry infrastructure defined in +[A66](https://github.com/grpc/proposal/blob/master/A66-otel-stats.md) and the +specific architecture for non-per-call metrics established in +[A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md). +Because authentication and handshakers are connection-level abstractions, they +inherently require non-per-call instrumentation. Currently, application owners +face challenges in diagnosing handshake failures, relying on verbose logging +without a structured mechanism for aggregation and analysis. + +### Related Proposals: +* [A66: OpenTelemetry Metrics/Stats](https://github.com/grpc/proposal/blob/master/A66-otel-stats.md) +* [A79: OpenTelemetry Non-Per-Call Metrics Architecture](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md) + +## Proposal + +All metrics will be scoped to TLS exclusively. + +### TLS Telemetry Status Enum + +The handshaker status will be represented by an enum that indicates success or +provides information on why the handshake failed. This value must manage a +balance of low-cardinality while being fine-grained enough to be useful; +therefore, an enum containing subdomains of authentication errors will be +created. This is presented as a C++ enum below, but will be identical in all +languages. In cases where we cannot categorize an error or cannot get enough +granularity in a given implementation and/or language, `UNKNOWN_FAILURE` will be +the catch-all error code. + +```c++ +enum class TlsTelemetryStatus { + UNKNOWN_FAILURE, + SUCCESS, + // Peer certificate verification failures. + CERTIFICATE_VERIFICATION_FAILED, + CERTIFICATE_REVOKED, + CERTIFICATE_EXPIRED, + CERTIFICATE_NOT_YET_VALID, + CERTIFICATE_AUTHORITY_INVALID, + // TLS negotiation mismatch failures + CERTIFICATE_HOSTNAME_MISMATCH, + CERTIFICATE_MALFORMED, + CIPHER_SUITE_MISMATCH, + PROTOCOL_VERSION_UNSUPPORTED, + INAPPROPRIATE_FALLBACK, + NO_APPLICATION_PROTOCOL, + // Cryptographic failures + SIGNATURE_VERIFICATION_FAILED, + DECRYPTION_FAILED, + KEY_EXCHANGE_FAILURE, + // Other failures + UNEXPECTED_MESSAGE, + HANDSHAKE_TIMEOUT, + PEER_CONNECTION_CLOSED +}; +``` + +### TLS Resumption Type Enum + +Further, whether the handshake is resumed or not is also critical for +understanding authentication behavior. An enum describing the type of resumption +used (or none) will be created - it can be extended in the future to make +finer-grained distinctions between the type of resumption that is used (e.g., +ticket-based resumption vs. session-based resumption). This is presented as a +C++ enum below, but will be identical in all languages. + +```c++ +enum class TlsResumptionType { + FULL_HANDSHAKE, + RESUMED_HANDSHAKE, +}; +``` + +### Metrics Definitions + +The following metrics are count metrics, with the primary information coming +from the labels. + +* `grpc.client.tls.handshakes` + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.tls.handshake.status` | Required | The `TlsTelemetryStatus` enum indicating success or the reason for handshake failure | +| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | +| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | + +* `grpc.server.tls.handshakes` + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.tls.handshake.status` | Required | The `TlsTelemetryStatus` enum indicating success or the reason for handshake failure | +| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | + +### TLS Offload Specific Metrics + +The following metrics are non-per-call bucketed latency metrics that report the duration of offloaded cryptographic operations. + +* `grpc.client.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | + +* `grpc.server.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | + +* `grpc.client.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | +| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key operation was done, e.g. “RsaPkcs1Sha256”. | + +* `grpc.server.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) + +| Label Name | Required/Optional | Description | +| :--- | :--- | :--- | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key operation was done, e.g. “RsaPkcs1Sha256”. | + +### Temporary environment variable protection + +This feature will be explicitly configured by users, thus no environment +variable protection is needed. If a user does not configure TLS metrics or +offloading, telemetry won't be collected under this mechanism unless general +telemetry/stats plugins are active. + +## Rationale + +The alternative of splitting generic handshaker metrics and TLS-specific metrics +was considered, but opened many rabbit holes as to what would qualify as a +handshaker (e.g. are HTTP Connect, TCP, RPC Switch, etc. all in scope, or just +alternative protocols to TLS such as ALTS). Thus, the decision was made to scope +the metrics to TLS. + +## Implementation + +### C++ + +In the C++ implementation, the transport security interface (TSI) has +historically been decoupled from gRPC. However, this design choice has been +broken over time, and the two have been coupled for years now. The reasons +behind decoupling TSI and gRPC are no longer relevant, therefore we will fully +accept this coupling. Thus, the TSI code can contain gRPC monitoring specifics. +In the few use-cases where TSI is not called via gRPC, we will ensure that +metric incrementation is not performed. + +We will add the Channel's `StatsPluginGroup` as an optional argument to TSI +handshaker creation functions. This ensures that we don't break any existing +users of TSI and that we never increment metrics when TSI is used outside of +gRPC. When TSI is called from gRPC, we will pass this argument. This will be +stored on the handshaker, and in `ssl_transport_security.cc` we will access this +from the handshaker to increment metrics. + +```diff + tsi_result tsi_ssl_client_handshaker_factory_create_handshaker( + tsi_ssl_client_handshaker_factory* factory, + const char* server_name_indication, size_t network_bio_buf_size, + size_t ssl_bio_buf_size, + std::optional alpn_preferred_protocol_list, +- tsi_handshaker** handshaker); ++ tsi_handshaker** handshaker, ++ std::shared_ptr stats_plugin_group = nullptr); + +``` + +```diff + tsi_result tsi_ssl_server_handshaker_factory_create_handshaker( + tsi_ssl_server_handshaker_factory* factory, size_t network_bio_buf_size, +- size_t ssl_bio_buf_size, tsi_handshaker** handshaker); ++ size_t ssl_bio_buf_size, tsi_handshaker** handshaker, ++ std::shared_ptr stats_plugin_group = nullptr); +``` +### Golang + +For the implementation of the general handshake metric, we leverage gRPC-Go's existing transport-level architecture. + +The abstraction layer for general handshake information is the Transport +Connection Layer (`internal/transport`). Client Handshakes are performed in a +single place: `internal/transport/http2_client.go` inside `NewHTTP2Client` via +`transportCreds.ClientHandshake`. Server Handshakes are also performed in a +single place: `internal/transport/http2_server.go` inside `NewServerTransport` +via `config.Credentials.ServerHandshake`. We can incremement the metrics here +only in the case where TLS is the protocol being used. + +```diff +--- a/internal/transport/http2_client.go ++++ b/internal/transport/http2_client.go +@@ -294,7 +294,23 @@ + if transportCreds != nil { ++ isTLS := transportCreds.Info().SecurityProtocol == "tls" ++ var startTime time.Time ++ if isTLS { ++ startTime = time.Now() ++ } + conn, authInfo, err = transportCreds.ClientHandshake(connectCtx, addr.ServerName, conn) ++ if isTLS { ++ duration := time.Since(startTime).Seconds() ++ ++ } ++ + if err != nil { + return nil, connectionErrorf(isTemporary(err), err, "transport: authentication handshake failed: %v", err) + } + +``` + +To pass out resumption information, we will need to augment the [`TLSInfo`](https://github.com/grpc/grpc-go/blob/660208049b96ff6232e8c7212905b3e357b5bf42/credentials/tls.go#L41) to include [`ConnectionState.DidResume`](https://github.com/golang/go/blob/e62d3e6e897175a07aa44a7b2c7f99700072f22f/src/crypto/tls/common.go#L257). + +The TLS specific offload metrics will go with their implementation. This feature is not yet written in Go, so we cannot discuss specific metric implementation details. + +### Java + +In gRPC-Java, to support shading where core classes are relocated per-consumer, all metric instruments used in core/ or transport modules (like netty/) must be defined within the api/ module. + +Following the pattern established by `MIN_RTT_INSTRUMENT` in `InternalTcpMetrics.java`, we will: + +1. Define a new `InternalSecurityMetrics.java` class in the api/ module under the `io.grpc` package. +2. Expose `MetricRecorder` from `GrpcHttp2ConnectionHandler` in the netty/ module. +3. Instrument the Netty `ClientTlsHandler` and `ServerTlsHandler` in `ProtocolNegotiators.java` to track handshake duration and record it. + +The TLS specific offload metrics will go with their implementation. This feature is not yet written in Java, so we cannot discuss specific metric implementation details. + +### Wrapped Languages + +Wrapped languages that support non-per-call metrics should be able to get this "for free" from the Core implementation. However, Python for example, does not currently support non-per-call metrics. From 26aab2a1a57dade21ed2738fe537ecb68cc9b1e3 Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Wed, 20 May 2026 14:30:42 +0000 Subject: [PATCH 04/13] rename to result --- A118-authentication-telemetry.md | 248 ------------------------------- A118-tls-telemetry.md | 18 +-- 2 files changed, 7 insertions(+), 259 deletions(-) delete mode 100644 A118-authentication-telemetry.md diff --git a/A118-authentication-telemetry.md b/A118-authentication-telemetry.md deleted file mode 100644 index f4f7517cd..000000000 --- a/A118-authentication-telemetry.md +++ /dev/null @@ -1,248 +0,0 @@ -A118: Authentication Telemetry ----- -* Author(s): @gtcooke94 -* Approver: @dfawley, @easwars, @ejona86, @mattstev, @markdroth -* Status: In Review -* Implemented in: -* Last updated: 2026-05-18 -* Discussion at: (filled after thread exists) - -## Abstract - -gRPC's authentication stack has no telemetry. This document details adding -non-per-call metrics to the TLS authentication stack. - -## Background - -gRPC's authentication stack currently lacks telemetry. This document outlines -the addition of authentication-related non-per-call metrics, leveraging the -existing telemetry infrastructure defined in -[A66](https://github.com/grpc/proposal/blob/master/A66-otel-stats.md) and the -specific architecture for non-per-call metrics established in -[A79](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md). -Because authentication and handshakers are connection-level abstractions, they -inherently require non-per-call instrumentation. Currently, application owners -face challenges in diagnosing handshake failures, relying on verbose logging -without a structured mechanism for aggregation and analysis. - -### Related Proposals: -* [A66: OpenTelemetry Metrics/Stats](https://github.com/grpc/proposal/blob/master/A66-otel-stats.md) -* [A79: OpenTelemetry Non-Per-Call Metrics Architecture](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md) - -## Proposal - -All metrics will be scoped to TLS exclusively. - -### TLS Telemetry Status Enum - -The handshaker status will be represented by an enum that indicates success or -provides information on why the handshake failed. This value must manage a -balance of low-cardinality while being fine-grained enough to be useful; -therefore, an enum containing subdomains of authentication errors will be -created. This is presented as a C++ enum below, but will be identical in all -languages. In cases where we cannot categorize an error or cannot get enough -granularity in a given implementation and/or language, `UNKNOWN_FAILURE` will be -the catch-all error code. - -```c++ -enum class TlsTelemetryStatus { - UNKNOWN_FAILURE, - SUCCESS, - // Peer certificate verification failures. - CERTIFICATE_VERIFICATION_FAILED, - CERTIFICATE_REVOKED, - CERTIFICATE_EXPIRED, - CERTIFICATE_NOT_YET_VALID, - CERTIFICATE_AUTHORITY_INVALID, - // TLS negotiation mismatch failures - CERTIFICATE_HOSTNAME_MISMATCH, - CERTIFICATE_MALFORMED, - CIPHER_SUITE_MISMATCH, - PROTOCOL_VERSION_UNSUPPORTED, - INAPPROPRIATE_FALLBACK, - NO_APPLICATION_PROTOCOL, - // Cryptographic failures - SIGNATURE_VERIFICATION_FAILED, - DECRYPTION_FAILED, - KEY_EXCHANGE_FAILURE, - // Other failures - UNEXPECTED_MESSAGE, - HANDSHAKE_TIMEOUT, - PEER_CONNECTION_CLOSED -}; -``` - -### TLS Resumption Type Enum - -Further, whether the handshake is resumed or not is also critical for -understanding authentication behavior. An enum describing the type of resumption -used (or none) will be created - it can be extended in the future to make -finer-grained distinctions between the type of resumption that is used (e.g., -ticket-based resumption vs. session-based resumption). This is presented as a -C++ enum below, but will be identical in all languages. - -```c++ -enum class TlsResumptionType { - FULL_HANDSHAKE, - RESUMED_HANDSHAKE, -}; -``` - -### Metrics Definitions - -The following metrics are count metrics, with the primary information coming -from the labels. - -* `grpc.client.tls.handshakes` - -| Label Name | Required/Optional | Description | -| :--- | :--- | :--- | -| `grpc.tls.handshake.status` | Required | The `TlsTelemetryStatus` enum indicating success or the reason for handshake failure | -| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | -| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | - -* `grpc.server.tls.handshakes` - -| Label Name | Required/Optional | Description | -| :--- | :--- | :--- | -| `grpc.tls.handshake.status` | Required | The `TlsTelemetryStatus` enum indicating success or the reason for handshake failure | -| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | - -### TLS Offload Specific Metrics - -The following metrics are non-per-call bucketed latency metrics that report the duration of offloaded cryptographic operations. - -* `grpc.client.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) - -| Label Name | Required/Optional | Description | -| :--- | :--- | :--- | -| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | -| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | - -* `grpc.server.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) - -| Label Name | Required/Optional | Description | -| :--- | :--- | :--- | -| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | - -* `grpc.client.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) - -| Label Name | Required/Optional | Description | -| :--- | :--- | :--- | -| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | -| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | -| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key operation was done, e.g. “RsaPkcs1Sha256”. | - -* `grpc.server.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) - -| Label Name | Required/Optional | Description | -| :--- | :--- | :--- | -| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | -| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key operation was done, e.g. “RsaPkcs1Sha256”. | - -### Temporary environment variable protection - -This feature will be explicitly configured by users, thus no environment -variable protection is needed. If a user does not configure TLS metrics or -offloading, telemetry won't be collected under this mechanism unless general -telemetry/stats plugins are active. - -## Rationale - -The alternative of splitting generic handshaker metrics and TLS-specific metrics -was considered, but opened many rabbit holes as to what would qualify as a -handshaker (e.g. are HTTP Connect, TCP, RPC Switch, etc. all in scope, or just -alternative protocols to TLS such as ALTS). Thus, the decision was made to scope -the metrics to TLS. - -## Implementation - -### C++ - -In the C++ implementation, the transport security interface (TSI) has -historically been decoupled from gRPC. However, this design choice has been -broken over time, and the two have been coupled for years now. The reasons -behind decoupling TSI and gRPC are no longer relevant, therefore we will fully -accept this coupling. Thus, the TSI code can contain gRPC monitoring specifics. -In the few use-cases where TSI is not called via gRPC, we will ensure that -metric incrementation is not performed. - -We will add the Channel's `StatsPluginGroup` as an optional argument to TSI -handshaker creation functions. This ensures that we don't break any existing -users of TSI and that we never increment metrics when TSI is used outside of -gRPC. When TSI is called from gRPC, we will pass this argument. This will be -stored on the handshaker, and in `ssl_transport_security.cc` we will access this -from the handshaker to increment metrics. - -```diff - tsi_result tsi_ssl_client_handshaker_factory_create_handshaker( - tsi_ssl_client_handshaker_factory* factory, - const char* server_name_indication, size_t network_bio_buf_size, - size_t ssl_bio_buf_size, - std::optional alpn_preferred_protocol_list, -- tsi_handshaker** handshaker); -+ tsi_handshaker** handshaker, -+ std::shared_ptr stats_plugin_group = nullptr); - -``` - -```diff - tsi_result tsi_ssl_server_handshaker_factory_create_handshaker( - tsi_ssl_server_handshaker_factory* factory, size_t network_bio_buf_size, -- size_t ssl_bio_buf_size, tsi_handshaker** handshaker); -+ size_t ssl_bio_buf_size, tsi_handshaker** handshaker, -+ std::shared_ptr stats_plugin_group = nullptr); -``` -### Golang - -For the implementation of the general handshake metric, we leverage gRPC-Go's existing transport-level architecture. - -The abstraction layer for general handshake information is the Transport -Connection Layer (`internal/transport`). Client Handshakes are performed in a -single place: `internal/transport/http2_client.go` inside `NewHTTP2Client` via -`transportCreds.ClientHandshake`. Server Handshakes are also performed in a -single place: `internal/transport/http2_server.go` inside `NewServerTransport` -via `config.Credentials.ServerHandshake`. We can incremement the metrics here -only in the case where TLS is the protocol being used. - -```diff ---- a/internal/transport/http2_client.go -+++ b/internal/transport/http2_client.go -@@ -294,7 +294,23 @@ - if transportCreds != nil { -+ isTLS := transportCreds.Info().SecurityProtocol == "tls" -+ var startTime time.Time -+ if isTLS { -+ startTime = time.Now() -+ } - conn, authInfo, err = transportCreds.ClientHandshake(connectCtx, addr.ServerName, conn) -+ if isTLS { -+ duration := time.Since(startTime).Seconds() -+ -+ } -+ - if err != nil { - return nil, connectionErrorf(isTemporary(err), err, "transport: authentication handshake failed: %v", err) - } - -``` - -To pass out resumption information, we will need to augment the [`TLSInfo`](https://github.com/grpc/grpc-go/blob/660208049b96ff6232e8c7212905b3e357b5bf42/credentials/tls.go#L41) to include [`ConnectionState.DidResume`](https://github.com/golang/go/blob/e62d3e6e897175a07aa44a7b2c7f99700072f22f/src/crypto/tls/common.go#L257). - -The TLS specific offload metrics will go with their implementation. This feature is not yet written in Go, so we cannot discuss specific metric implementation details. - -### Java - -In gRPC-Java, to support shading where core classes are relocated per-consumer, all metric instruments used in core/ or transport modules (like netty/) must be defined within the api/ module. - -Following the pattern established by `MIN_RTT_INSTRUMENT` in `InternalTcpMetrics.java`, we will: - -1. Define a new `InternalSecurityMetrics.java` class in the api/ module under the `io.grpc` package. -2. Expose `MetricRecorder` from `GrpcHttp2ConnectionHandler` in the netty/ module. -3. Instrument the Netty `ClientTlsHandler` and `ServerTlsHandler` in `ProtocolNegotiators.java` to track handshake duration and record it. - -The TLS specific offload metrics will go with their implementation. This feature is not yet written in Java, so we cannot discuss specific metric implementation details. - -### Wrapped Languages - -Wrapped languages that support non-per-call metrics should be able to get this "for free" from the Core implementation. However, Python for example, does not currently support non-per-call metrics. diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index f4f7517cd..a5cc73a66 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -33,7 +33,7 @@ without a structured mechanism for aggregation and analysis. All metrics will be scoped to TLS exclusively. -### TLS Telemetry Status Enum +### TLS Telemetry Result Enum The handshaker status will be represented by an enum that indicates success or provides information on why the handshake failed. This value must manage a @@ -45,7 +45,7 @@ granularity in a given implementation and/or language, `UNKNOWN_FAILURE` will be the catch-all error code. ```c++ -enum class TlsTelemetryStatus { +enum class TlsTelemetryResult { UNKNOWN_FAILURE, SUCCESS, // Peer certificate verification failures. @@ -97,7 +97,7 @@ from the labels. | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.tls.handshake.status` | Required | The `TlsTelemetryStatus` enum indicating success or the reason for handshake failure | +| `grpc.tls.handshake.result` | Required | The `TlsTelemetryResult` enum indicating success or the reason for handshake failure | | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | @@ -105,26 +105,22 @@ from the labels. | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.tls.handshake.status` | Required | The `TlsTelemetryStatus` enum indicating success or the reason for handshake failure | +| `grpc.tls.handshake.result` | Required | The `TlsTelemetryResult` enum indicating success or the reason for handshake failure | | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | ### TLS Offload Specific Metrics The following metrics are non-per-call bucketed latency metrics that report the duration of offloaded cryptographic operations. -* `grpc.client.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) - -| Label Name | Required/Optional | Description | -| :--- | :--- | :--- | -| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | -| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | - * `grpc.server.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) | Label Name | Required/Optional | Description | | :--- | :--- | :--- | | `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +Note - there is no associated client certificate selection metric. This is a +server specific feature. + * `grpc.client.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) | Label Name | Required/Optional | Description | From df34d09ac2b58207349629b11ceef72e45c14ff0 Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Wed, 20 May 2026 14:35:51 +0000 Subject: [PATCH 05/13] update more --- A118-tls-telemetry.md | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index a5cc73a66..4957ea045 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -28,6 +28,7 @@ without a structured mechanism for aggregation and analysis. ### Related Proposals: * [A66: OpenTelemetry Metrics/Stats](https://github.com/grpc/proposal/blob/master/A66-otel-stats.md) * [A79: OpenTelemetry Non-Per-Call Metrics Architecture](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md) +* [A107: TLS Private Key Offloading ](https://github.com/grpc/proposal/blob/master/A107-tls-private-key-offloading.md) ## Proposal @@ -121,20 +122,27 @@ The following metrics are non-per-call bucketed latency metrics that report the Note - there is no associated client certificate selection metric. This is a server specific feature. -* `grpc.client.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) +For the offloaded private key metrics, we specify signing because that is the +only offloaded private key operation supported by gRPC. Older TLS versions have +the concept of other private key operations that gRPC does not support. See +[A107] for more detail on private key signers. + +* `grpc.client.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in A66) | Label Name | Required/Optional | Description | | :--- | :--- | :--- | | `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | -| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key operation was done, e.g. “RsaPkcs1Sha256”. | +| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | +| `grpc.tls.private_key.implementation` | Required | A string identifying the private key signer implementation. | -* `grpc.server.tls.offload_private_key_operation_duration` (unit: float64, type: histogram - latency buckets defined in A66) +* `grpc.server.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in A66) | Label Name | Required/Optional | Description | | :--- | :--- | :--- | | `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | -| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key operation was done, e.g. “RsaPkcs1Sha256”. | +| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | +| `grpc.tls.private_key.implementation` | Required | A string identifying the private key signer implementation. | ### Temporary environment variable protection From 7290804ed579d63024dec11aacdd4409661bdce6 Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Wed, 20 May 2026 14:39:31 +0000 Subject: [PATCH 06/13] added extra labels --- A118-tls-telemetry.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index 4957ea045..2dc9b43d1 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -27,8 +27,10 @@ without a structured mechanism for aggregation and analysis. ### Related Proposals: * [A66: OpenTelemetry Metrics/Stats](https://github.com/grpc/proposal/blob/master/A66-otel-stats.md) +* [A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient](https://github.com/grpc/proposal/blob/master/A78-grpc-metrics-wrr-pf-xds.md) * [A79: OpenTelemetry Non-Per-Call Metrics Architecture](https://github.com/grpc/proposal/blob/master/A79-non-per-call-metrics-architecture.md) -* [A107: TLS Private Key Offloading ](https://github.com/grpc/proposal/blob/master/A107-tls-private-key-offloading.md) +* [A89: Backend Service Metric Label](https://github.com/grpc/proposal/blob/master/A89-backend-service-metric-label.md) +* [A107: TLS Private Key Offloading](https://github.com/grpc/proposal/blob/master/A107-tls-private-key-offloading.md) ## Proposal @@ -101,6 +103,8 @@ from the labels. | `grpc.tls.handshake.result` | Required | The `TlsTelemetryResult` enum indicating success or the reason for handshake failure | | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | +| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). | +| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). | * `grpc.server.tls.handshakes` @@ -135,6 +139,8 @@ the concept of other private key operations that gRPC does not support. See | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | | `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | | `grpc.tls.private_key.implementation` | Required | A string identifying the private key signer implementation. | +| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). | +| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). | * `grpc.server.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in A66) From 02216f52e7dfd67dbad67abd9b6a335255bfa084 Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Wed, 20 May 2026 14:40:26 +0000 Subject: [PATCH 07/13] add TOODs for new labels --- A118-tls-telemetry.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index 2dc9b43d1..9d063f2e6 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -103,8 +103,8 @@ from the labels. | `grpc.tls.handshake.result` | Required | The `TlsTelemetryResult` enum indicating success or the reason for handshake failure | | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | -| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). | -| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). | +| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). TODO - is this actually possible to get in every language | +| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). TODO - is this actually possible to get in every language | * `grpc.server.tls.handshakes` @@ -139,8 +139,8 @@ the concept of other private key operations that gRPC does not support. See | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | | `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | | `grpc.tls.private_key.implementation` | Required | A string identifying the private key signer implementation. | -| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). | -| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). | +| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). TODO - is this actually possible to get in every language | +| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). TODO - is this actually possible to get in every language | * `grpc.server.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in A66) From beb8f75c40c88a18e173058a1ff0047cfc57f464 Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Wed, 20 May 2026 16:58:21 +0000 Subject: [PATCH 08/13] rename to TlsTelemetryHandshakeResult --- A118-tls-telemetry.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index 9d063f2e6..98b207240 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -48,7 +48,7 @@ granularity in a given implementation and/or language, `UNKNOWN_FAILURE` will be the catch-all error code. ```c++ -enum class TlsTelemetryResult { +enum class TlsTelemetryHandshakeResult { UNKNOWN_FAILURE, SUCCESS, // Peer certificate verification failures. @@ -100,7 +100,7 @@ from the labels. | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.tls.handshake.result` | Required | The `TlsTelemetryResult` enum indicating success or the reason for handshake failure | +| `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum indicating success or the reason for handshake failure | | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | | `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). TODO - is this actually possible to get in every language | @@ -110,7 +110,7 @@ from the labels. | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.tls.handshake.result` | Required | The `TlsTelemetryResult` enum indicating success or the reason for handshake failure | +| `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum indicating success or the reason for handshake failure | | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | ### TLS Offload Specific Metrics From 3cc7cbd3b5d3b397aad24d36c63c0f00877631bc Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Wed, 20 May 2026 19:19:48 +0000 Subject: [PATCH 09/13] address PR comments --- A118-tls-telemetry.md | 57 +++++++++++++++++++++++-------------------- 1 file changed, 31 insertions(+), 26 deletions(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index 98b207240..71b702f91 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -1,7 +1,7 @@ A118: Authentication Telemetry ---- * Author(s): @gtcooke94 -* Approver: @dfawley, @easwars, @ejona86, @mattstev, @markdroth +* Approver: @dfawley, @easwars, @ejona86, @matthewstevenson88, @markdroth * Status: In Review * Implemented in: * Last updated: 2026-05-18 @@ -10,7 +10,7 @@ A118: Authentication Telemetry ## Abstract gRPC's authentication stack has no telemetry. This document details adding -non-per-call metrics to the TLS authentication stack. +non-per-call metrics to gRPC's SSL/TLS authentication stack. ## Background @@ -38,7 +38,7 @@ All metrics will be scoped to TLS exclusively. ### TLS Telemetry Result Enum -The handshaker status will be represented by an enum that indicates success or +The TLS handshake result will be represented by an enum that indicates success or provides information on why the handshake failed. This value must manage a balance of low-cardinality while being fine-grained enough to be useful; therefore, an enum containing subdomains of authentication errors will be @@ -100,9 +100,9 @@ from the labels. | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum indicating success or the reason for handshake failure | +| `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum indicating success or the reason for handshake failure. | | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | -| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | +| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum indicating if and how the handshake was resumed. | | `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). TODO - is this actually possible to get in every language | | `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). TODO - is this actually possible to get in every language | @@ -110,8 +110,8 @@ from the labels. | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum indicating success or the reason for handshake failure | -| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum | +| `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum indicating success or the reason for handshake failure. | +| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum indicating if and how the handshake was resumed. | ### TLS Offload Specific Metrics @@ -127,9 +127,10 @@ Note - there is no associated client certificate selection metric. This is a server specific feature. For the offloaded private key metrics, we specify signing because that is the -only offloaded private key operation supported by gRPC. Older TLS versions have -the concept of other private key operations that gRPC does not support. See -[A107] for more detail on private key signers. +only offloaded private key operation supported by gRPC (for example, signature +offload using an EC or RSA key). Older TLS versions have the concept of other +private key operations that gRPC does not support (for example, decryption +offload using an RSA key). See [A107] for more detail on private key signers. * `grpc.client.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in A66) @@ -161,15 +162,15 @@ telemetry/stats plugins are active. The alternative of splitting generic handshaker metrics and TLS-specific metrics was considered, but opened many rabbit holes as to what would qualify as a -handshaker (e.g. are HTTP Connect, TCP, RPC Switch, etc. all in scope, or just +handshaker (e.g. are HTTP Connect, TCP, etc. all in scope, or just alternative protocols to TLS such as ALTS). Thus, the decision was made to scope the metrics to TLS. ## Implementation -### C++ +### C/C++ -In the C++ implementation, the transport security interface (TSI) has +In the C/C++ implementation, the transport security interface (TSI) has historically been decoupled from gRPC. However, this design choice has been broken over time, and the two have been coupled for years now. The reasons behind decoupling TSI and gRPC are no longer relevant, therefore we will fully @@ -177,8 +178,8 @@ accept this coupling. Thus, the TSI code can contain gRPC monitoring specifics. In the few use-cases where TSI is not called via gRPC, we will ensure that metric incrementation is not performed. -We will add the Channel's `StatsPluginGroup` as an optional argument to TSI -handshaker creation functions. This ensures that we don't break any existing +We will add the Channel's `StatsPluginGroup` as an optional argument to the SSL +TSI handshaker creation functions. This ensures that we don't break any existing users of TSI and that we never increment metrics when TSI is used outside of gRPC. When TSI is called from gRPC, we will pass this argument. This will be stored on the handshaker, and in `ssl_transport_security.cc` we will access this @@ -190,9 +191,8 @@ from the handshaker to increment metrics. const char* server_name_indication, size_t network_bio_buf_size, size_t ssl_bio_buf_size, std::optional alpn_preferred_protocol_list, -- tsi_handshaker** handshaker); -+ tsi_handshaker** handshaker, -+ std::shared_ptr stats_plugin_group = nullptr); ++ std::shared_ptr stats_plugin_group, + tsi_handshaker** handshaker); ``` @@ -200,17 +200,18 @@ from the handshaker to increment metrics. tsi_result tsi_ssl_server_handshaker_factory_create_handshaker( tsi_ssl_server_handshaker_factory* factory, size_t network_bio_buf_size, - size_t ssl_bio_buf_size, tsi_handshaker** handshaker); -+ size_t ssl_bio_buf_size, tsi_handshaker** handshaker, -+ std::shared_ptr stats_plugin_group = nullptr); ++ size_t ssl_bio_buf_size, ++ std::shared_ptr stats_plugin_group, ++ tsi_handshaker** handshaker); ``` ### Golang For the implementation of the general handshake metric, we leverage gRPC-Go's existing transport-level architecture. -The abstraction layer for general handshake information is the Transport -Connection Layer (`internal/transport`). Client Handshakes are performed in a +The abstraction layer for general handshake information is the transport +connection layer (`internal/transport`). Client handshakes are performed in a single place: `internal/transport/http2_client.go` inside `NewHTTP2Client` via -`transportCreds.ClientHandshake`. Server Handshakes are also performed in a +`transportCreds.ClientHandshake`. Server handshakes are also performed in a single place: `internal/transport/http2_server.go` inside `NewServerTransport` via `config.Credentials.ServerHandshake`. We can incremement the metrics here only in the case where TLS is the protocol being used. @@ -237,7 +238,7 @@ only in the case where TLS is the protocol being used. ``` -To pass out resumption information, we will need to augment the [`TLSInfo`](https://github.com/grpc/grpc-go/blob/660208049b96ff6232e8c7212905b3e357b5bf42/credentials/tls.go#L41) to include [`ConnectionState.DidResume`](https://github.com/golang/go/blob/e62d3e6e897175a07aa44a7b2c7f99700072f22f/src/crypto/tls/common.go#L257). +To extract resumption information, we will need to augment the [`TLSInfo`](https://github.com/grpc/grpc-go/blob/660208049b96ff6232e8c7212905b3e357b5bf42/credentials/tls.go#L41) to include [`ConnectionState.DidResume`](https://github.com/golang/go/blob/e62d3e6e897175a07aa44a7b2c7f99700072f22f/src/crypto/tls/common.go#L257). The TLS specific offload metrics will go with their implementation. This feature is not yet written in Go, so we cannot discuss specific metric implementation details. @@ -251,8 +252,12 @@ Following the pattern established by `MIN_RTT_INSTRUMENT` in `InternalTcpMetrics 2. Expose `MetricRecorder` from `GrpcHttp2ConnectionHandler` in the netty/ module. 3. Instrument the Netty `ClientTlsHandler` and `ServerTlsHandler` in `ProtocolNegotiators.java` to track handshake duration and record it. -The TLS specific offload metrics will go with their implementation. This feature is not yet written in Java, so we cannot discuss specific metric implementation details. +The TLS specific offload metrics will go with their implementation. This feature +is not yet written in Java, so we cannot discuss specific metric implementation +details. ### Wrapped Languages -Wrapped languages that support non-per-call metrics should be able to get this "for free" from the Core implementation. However, Python for example, does not currently support non-per-call metrics. +Wrapped languages that support non-per-call metrics will get the authentication +telemetry features "for free" from the Core implementation. However, Python for +example, does not currently support non-per-call metrics. From b40d876480903e6bfee6995d99e83f791e48cf8a Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Thu, 21 May 2026 16:19:23 +0000 Subject: [PATCH 10/13] update header for handshake result enum --- A118-tls-telemetry.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index 71b702f91..fac15f629 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -36,7 +36,7 @@ without a structured mechanism for aggregation and analysis. All metrics will be scoped to TLS exclusively. -### TLS Telemetry Result Enum +### TLS Telemetry Handshake Result Enum The TLS handshake result will be represented by an enum that indicates success or provides information on why the handshake failed. This value must manage a From 10157d0ad5aa5687787b6623cfb41392415ee46a Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Fri, 22 May 2026 18:13:48 +0000 Subject: [PATCH 11/13] updates --- A118-tls-telemetry.md | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index fac15f629..0fdef46f8 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -4,7 +4,7 @@ A118: Authentication Telemetry * Approver: @dfawley, @easwars, @ejona86, @matthewstevenson88, @markdroth * Status: In Review * Implemented in: -* Last updated: 2026-05-18 +* Last updated: 2026-05-22 * Discussion at: (filled after thread exists) ## Abstract @@ -32,6 +32,12 @@ without a structured mechanism for aggregation and analysis. * [A89: Backend Service Metric Label](https://github.com/grpc/proposal/blob/master/A89-backend-service-metric-label.md) * [A107: TLS Private Key Offloading](https://github.com/grpc/proposal/blob/master/A107-tls-private-key-offloading.md) +[A66]: A66-otel-stats.md +[A78]: A78-grpc-metrics-wrr-pf-xds.md +[A79]: A79-non-per-call-metrics-architecture.md +[A89]: A89-backend-service-metric-label.md +[A107]: A107-tls-private-key-offloading.md + ## Proposal All metrics will be scoped to TLS exclusively. @@ -101,10 +107,10 @@ from the labels. | Label Name | Required/Optional | Description | | :--- | :--- | :--- | | `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum indicating success or the reason for handshake failure. | -| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | +| `grpc.target` | Required | The target string (as defined in [A66]) passed to the channel. | | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum indicating if and how the handshake was resumed. | -| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). TODO - is this actually possible to get in every language | -| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). TODO - is this actually possible to get in every language | +| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in [A78]). | +| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in [A89]). | * `grpc.server.tls.handshakes` @@ -117,11 +123,11 @@ from the labels. The following metrics are non-per-call bucketed latency metrics that report the duration of offloaded cryptographic operations. -* `grpc.server.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in A66) +* `grpc.server.tls.offload_certificate_selection_duration` (unit: float64, type: histogram - latency buckets defined in [A66]) | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in [A66]). | Note - there is no associated client certificate selection metric. This is a server specific feature. @@ -139,9 +145,9 @@ offload using an RSA key). See [A107] for more detail on private key signers. | `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | | `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | -| `grpc.tls.private_key.implementation` | Required | A string identifying the private key signer implementation. | -| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in A78). TODO - is this actually possible to get in every language | -| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in A89). TODO - is this actually possible to get in every language | +| `grpc.tls.private_key.offloader_name` | Required | A string identifying the private key signer implementation, e.g. "HSM" or "private_key_signer_service". This must be low-cardinality | +| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in [A78]). | +| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in [A89]). | * `grpc.server.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in A66) @@ -149,7 +155,7 @@ offload using an RSA key). See [A107] for more detail on private key signers. | :--- | :--- | :--- | | `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | | `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | -| `grpc.tls.private_key.implementation` | Required | A string identifying the private key signer implementation. | +| `grpc.tls.private_key.offloader_name` | Required | A string identifying the private key signer implementation e.g. "HSM" or "private_key_signer_service". This must be low-cardinality. | ### Temporary environment variable protection From b02596a1ddfed1a415450b2cc8e8fce99a06d2d5 Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Fri, 22 May 2026 18:22:12 +0000 Subject: [PATCH 12/13] swap order --- A118-tls-telemetry.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index 0fdef46f8..7fb54049b 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -144,8 +144,8 @@ offload using an RSA key). See [A107] for more detail on private key signers. | :--- | :--- | :--- | | `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | | `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | -| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | | `grpc.tls.private_key.offloader_name` | Required | A string identifying the private key signer implementation, e.g. "HSM" or "private_key_signer_service". This must be low-cardinality | +| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | | `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in [A78]). | | `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in [A89]). | @@ -154,8 +154,8 @@ offload using an RSA key). See [A107] for more detail on private key signers. | Label Name | Required/Optional | Description | | :--- | :--- | :--- | | `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | -| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | | `grpc.tls.private_key.offloader_name` | Required | A string identifying the private key signer implementation e.g. "HSM" or "private_key_signer_service". This must be low-cardinality. | +| `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | ### Temporary environment variable protection From 461f846258aced6e98462877b1e8b7a00bbcebb3 Mon Sep 17 00:00:00 2001 From: Gregory Cooke Date: Fri, 22 May 2026 18:25:47 +0000 Subject: [PATCH 13/13] a few links --- A118-tls-telemetry.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/A118-tls-telemetry.md b/A118-tls-telemetry.md index 7fb54049b..b8120194a 100644 --- a/A118-tls-telemetry.md +++ b/A118-tls-telemetry.md @@ -138,22 +138,22 @@ offload using an EC or RSA key). Older TLS versions have the concept of other private key operations that gRPC does not support (for example, decryption offload using an RSA key). See [A107] for more detail on private key signers. -* `grpc.client.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in A66) +* `grpc.client.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in [A66]) | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | -| `grpc.target` | Required | The target string (as defined in A66) passed to the channel. | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in [A66]). | +| `grpc.target` | Required | The target string (as defined in [A66]) passed to the channel. | | `grpc.tls.private_key.offloader_name` | Required | A string identifying the private key signer implementation, e.g. "HSM" or "private_key_signer_service". This must be low-cardinality | | `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. | | `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in [A78]). | | `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in [A89]). | -* `grpc.server.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in A66) +* `grpc.server.tls.offload_private_key_signing_duration` (unit: float64, type: histogram - latency buckets defined in [A66]) | Label Name | Required/Optional | Description | | :--- | :--- | :--- | -| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in A66). | +| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in [A66]). | | `grpc.tls.private_key.offloader_name` | Required | A string identifying the private key signer implementation e.g. "HSM" or "private_key_signer_service". This must be low-cardinality. | | `grpc.tls.private_key_algorithm` | Optional | An algorithm enum indicating how the offloaded private key signing was done, e.g. “RsaPkcs1Sha256”. |