Skip to content

WIP: integrate OpenTelemetry SDK for metrics/logs export#40

Open
JoshDreamland wants to merge 7 commits intoClickHouse:mainfrom
JoshDreamland:WIP-OTel
Open

WIP: integrate OpenTelemetry SDK for metrics/logs export#40
JoshDreamland wants to merge 7 commits intoClickHouse:mainfrom
JoshDreamland:WIP-OTel

Conversation

@JoshDreamland
Copy link

No description provided.

Copilot AI review requested due to automatic review settings February 24, 2026 16:59
@CLAassistant
Copy link

CLAassistant commented Feb 24, 2026

CLA assistant check
All committers have signed the CLA.

char* psch_clickhouse_database = nullptr;
bool psch_clickhouse_use_tls = false;
bool psch_clickhouse_skip_tls_verify = false;
extern char* psch_otel_endpoint = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why extern?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces OpenTelemetry (OTel) exporter support alongside the existing ClickHouse exporter, refactoring the export architecture to use a common interface. However, the PR appears to be work-in-progress based on the title "Stash commit" and contains several critical bugs that would prevent it from compiling or functioning correctly.

Changes:

  • Introduces a StatsExporter interface to abstract export backends
  • Adds OpenTelemetry exporter implementation with metrics and logging support
  • Refactors ClickHouse exporter to implement the new interface
  • Adds GUC configuration variables for OTel endpoint and hostname selection

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 24 comments.

Show a summary per file
File Description
src/worker/bgworker.cc Updates include paths from clickhouse_exporter.h to stats_exporter.h and updates comment references
src/export/stats_exporter.h New header defining C API for stats exporters
src/export/stats_exporter.cc New implementation coordinating between ClickHouse and OTel exporters based on configuration
src/export/otel_exporter.h New header declaring OpenTelemetry exporter factory function
src/export/otel_exporter.cc New OpenTelemetry exporter implementation with metrics histograms, counters, and log records
src/export/exporter_interface.h New abstract interface defining the exporter contract
src/export/clickhouse_exporter.h Refactored to declare factory function instead of C API
src/export/clickhouse_exporter.cc Refactored into class implementing StatsExporter interface
src/config/guc.h Adds declarations for use_otel, otel_endpoint, and hostname GUC variables
src/config/guc.cc Implements new GUC variables for OTel configuration
.gitignore Adds CMake build artifacts to ignore list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 76 to 259
void ExportEventStats(const std::vector<PschEvent>& events, StatsExporter *exporter) {
elog(DEBUG1, "pg_stat_ch: BuildClickHouseBlock() called with %zu events", events.size());

elog(DEBUG2, "pg_stat_ch: creating column objects");
clickhouse::Block block;

// Basic columns
elog(DEBUG3, "pg_stat_ch: creating col_ts_start");
auto col_ts_start = exporter->RecordDateTime("ts_start");
elog(DEBUG3, "pg_stat_ch: col_ts_start created");
auto col_duration_us = exporter->MetricUInt64("duration_us");
// Use pre-resolved names from event (resolved at capture time in hooks)
auto col_db = exporter->TagString("db");
auto col_username = exporter->TagString("username");
elog(DEBUG3, "pg_stat_ch: basic columns created");
auto col_pid = exporter->RecordInt32("pid");
auto col_query_id = exporter->RecordInt64("query_id");
auto col_cmd_type = exporter->RecordString("cmd_type");
auto col_rows = exporter->MetricUInt64("rows");
auto col_query = exporter->RecordString("query");
elog(DEBUG3, "pg_stat_ch: all basic columns created");

// Buffer usage columns
auto col_shared_blks_hit = exporter->MetricInt64("shared_blks_hit");
auto col_shared_blks_read = exporter->MetricInt64("shared_blks_read");
auto col_shared_blks_dirtied = exporter->MetricInt64("shared_blks_dirtied");
auto col_shared_blks_written = exporter->MetricInt64("shared_blks_written");
auto col_local_blks_hit = exporter->MetricInt64("local_blks_hit");
auto col_local_blks_read = exporter->MetricInt64("local_blks_read");
auto col_local_blks_dirtied = exporter->MetricInt64("local_blks_dirtied");
auto col_local_blks_written = exporter->MetricInt64("local_blks_written");
auto col_temp_blks_read = exporter->MetricInt64("temp_blks_read");
auto col_temp_blks_written = exporter->MetricInt64("temp_blks_written");

// I/O timing columns
auto col_shared_blk_read_time_us = exporter->MetricInt64("shared_blk_read_time_us");
auto col_shared_blk_write_time_us = exporter->MetricInt64("shared_blk_write_time_us");
auto col_local_blk_read_time_us = exporter->MetricInt64("local_blk_read_time_us");
auto col_local_blk_write_time_us = exporter->MetricInt64("local_blk_write_time_us");
auto col_temp_blk_read_time_us = exporter->MetricInt64("temp_blk_read_time_us");
auto col_temp_blk_write_time_us = exporter->MetricInt64("temp_blk_write_time_us");

// WAL usage columns
auto col_wal_records = exporter->MetricInt64("wal_records");
auto col_wal_fpi = exporter->MetricInt64("wal_fpi");
auto col_wal_bytes = exporter->MetricUInt64("wal_bytes");

// CPU time columns
auto col_cpu_user_time_us = exporter->MetricInt64("cpu_user_time_us");
auto col_cpu_sys_time_us = exporter->MetricInt64("cpu_sys_time_us");

// JIT columns
auto col_jit_functions = exporter->MetricInt32("jit_functions");
auto col_jit_generation_time_us = exporter->MetricInt32("jit_generation_time_us");
auto col_jit_deform_time_us = exporter->MetricInt32("jit_deform_time_us");
auto col_jit_inlining_time_us = exporter->MetricInt32("jit_inlining_time_us");
auto col_jit_optimization_time_us = exporter->MetricInt32("jit_optimization_time_us");
auto col_jit_emission_time_us = exporter->MetricInt32("jit_emission_time_us");

// Parallel worker columns
auto col_parallel_workers_planned = exporter->MetricInt16("parallel_workers_planned");
auto col_parallel_workers_launched = exporter->MetricInt16("parallel_workers_launched");

elog(DEBUG3, "pg_stat_ch: creating error columns");
// Error columns
auto col_err_sqlstate = exporter->MetricFixedString(5, "err_sqlstate");
auto col_err_elevel = exporter->MetricUInt8("err_elevel");
auto col_err_message = exporter->RecordString("err_message");
elog(DEBUG3, "pg_stat_ch: error columns created");

// Client context columns; records rather than tags (no histogram in OTel)
auto col_app = exporter->RecordString("app");
auto col_client_addr = exporter->RecordString("client_addr");

elog(DEBUG2, "pg_stat_ch: all columns created, starting event loop");
size_t event_idx = 0;
for (const auto& ev : events) {
elog(DEBUG2, "pg_stat_ch: processing event %zu: pid=%d, query_len=%u", event_idx, ev.pid,
ev.query_len);

int64_t unix_us = ev.ts_start + kPostgresEpochOffsetUs;
col_ts_start->Append(unix_us);
col_duration_us->Append(ev.duration_us);

// Use pre-resolved names from event (resolved at capture time in hooks)
col_db->Append(std::string(ev.datname, ev.datname_len));
col_username->Append(std::string(ev.username, ev.username_len));

col_pid->Append(ev.pid);
col_query_id->Append(static_cast<int64_t>(ev.queryid));
col_cmd_type->Append(CmdTypeToString(ev.cmd_type));
col_rows->Append(ev.rows);

// Validate query_len before using it
uint16 safe_query_len = ev.query_len;
if (safe_query_len > PSCH_MAX_QUERY_LEN) {
elog(WARNING, "pg_stat_ch: event %zu has invalid query_len %u, clamping", event_idx,
safe_query_len);
safe_query_len = PSCH_MAX_QUERY_LEN;
}
col_query->Append(std::string(ev.query, safe_query_len));

elog(DEBUG3, "pg_stat_ch: event %zu - buffer usage", event_idx);
// Buffer usage
col_shared_blks_hit->Append(ev.shared_blks_hit);
col_shared_blks_read->Append(ev.shared_blks_read);
col_shared_blks_dirtied->Append(ev.shared_blks_dirtied);
col_shared_blks_written->Append(ev.shared_blks_written);
col_local_blks_hit->Append(ev.local_blks_hit);
col_local_blks_read->Append(ev.local_blks_read);
col_local_blks_dirtied->Append(ev.local_blks_dirtied);
col_local_blks_written->Append(ev.local_blks_written);
col_temp_blks_read->Append(ev.temp_blks_read);
col_temp_blks_written->Append(ev.temp_blks_written);

elog(DEBUG3, "pg_stat_ch: event %zu - I/O timing", event_idx);
// I/O timing
col_shared_blk_read_time_us->Append(ev.shared_blk_read_time_us);
col_shared_blk_write_time_us->Append(ev.shared_blk_write_time_us);
col_local_blk_read_time_us->Append(ev.local_blk_read_time_us);
col_local_blk_write_time_us->Append(ev.local_blk_write_time_us);
col_temp_blk_read_time_us->Append(ev.temp_blk_read_time_us);
col_temp_blk_write_time_us->Append(ev.temp_blk_write_time_us);

elog(DEBUG3, "pg_stat_ch: event %zu - WAL usage", event_idx);
// WAL usage
col_wal_records->Append(ev.wal_records);
col_wal_fpi->Append(ev.wal_fpi);
col_wal_bytes->Append(ev.wal_bytes);

elog(DEBUG3, "pg_stat_ch: event %zu - CPU time", event_idx);
// CPU time
col_cpu_user_time_us->Append(ev.cpu_user_time_us);
col_cpu_sys_time_us->Append(ev.cpu_sys_time_us);

elog(DEBUG3, "pg_stat_ch: event %zu - JIT", event_idx);
// JIT
col_jit_functions->Append(ev.jit_functions);
col_jit_generation_time_us->Append(ev.jit_generation_time_us);
col_jit_deform_time_us->Append(ev.jit_deform_time_us);
col_jit_inlining_time_us->Append(ev.jit_inlining_time_us);
col_jit_optimization_time_us->Append(ev.jit_optimization_time_us);
col_jit_emission_time_us->Append(ev.jit_emission_time_us);

elog(DEBUG3, "pg_stat_ch: event %zu - parallel workers", event_idx);
// Parallel workers
col_parallel_workers_planned->Append(ev.parallel_workers_planned);
col_parallel_workers_launched->Append(ev.parallel_workers_launched);

elog(DEBUG3, "pg_stat_ch: event %zu - error info", event_idx);
// Error info (5-char SQLSTATE, trimmed)
col_err_sqlstate->Append(std::string_view(ev.err_sqlstate, 5));
col_err_elevel->Append(ev.err_elevel);
// Error message (validate length)
uint16 safe_err_msg_len = ev.err_message_len;
if (safe_err_msg_len > PSCH_MAX_ERR_MSG_LEN) {
elog(WARNING, "pg_stat_ch: event %zu has invalid err_message_len %u, clamping", event_idx,
safe_err_msg_len);
safe_err_msg_len = PSCH_MAX_ERR_MSG_LEN;
}
col_err_message->Append(std::string(ev.err_message, safe_err_msg_len));

elog(DEBUG3, "pg_stat_ch: event %zu - client context (app_len=%u, addr_len=%u)", event_idx,
ev.application_name_len, ev.client_addr_len);
// Client context - validate lengths
uint8 safe_app_len = ev.application_name_len;
if (safe_app_len > 63) {
elog(WARNING, "pg_stat_ch: event %zu has invalid app_name_len %u, clamping", event_idx,
safe_app_len);
safe_app_len = 63;
}
uint8 safe_addr_len = ev.client_addr_len;
if (safe_addr_len > 45) {
elog(WARNING, "pg_stat_ch: event %zu has invalid client_addr_len %u, clamping", event_idx,
safe_addr_len);
safe_addr_len = 45;
}
col_app->Append(std::string(ev.application_name, safe_app_len));
col_client_addr->Append(std::string(ev.client_addr, safe_addr_len));

event_idx++;
}
elog(DEBUG1, "pg_stat_ch: finished processing %zu events", event_idx);
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing BeginBatch and BeginRow calls in export flow. The ExportEventStats function creates columns but never calls exporter->BeginBatch() before creating columns or exporter->BeginRow() for each event. The OTelExporter implementation expects BeginRow to be called to initialize row state (line 42-47 in otel_exporter.cc). Without these calls, the OTel exporter will not function correctly.

Copilot uses AI. Check for mistakes.
if (!exporter->IsConnected()) {
elog(DEBUG1, "pg_stat_ch: client is null, initializing");
if (!exporter->EstablishNewConnection()) {
PschRecordExportFailure("Failed to connect to ClickHouse");
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading error message. The error message says "Failed to connect to ClickHouse" but this code now supports both ClickHouse and OpenTelemetry exporters. The message should be generic, such as "Failed to connect to exporter backend" or dynamically include the backend type.

Copilot uses AI. Check for mistakes.

void PschExporterShutdown(void) {
g_exporter.exporter.reset();
elog(LOG, "pg_stat_ch: ClickHouse exporter shutdown");
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading log message. The shutdown message says "ClickHouse exporter shutdown" but the exporter could be either ClickHouse or OpenTelemetry. The message should be generic, such as "Statistics exporter shutdown".

Suggested change
elog(LOG, "pg_stat_ch: ClickHouse exporter shutdown");
elog(LOG, "pg_stat_ch: statistics exporter shutdown");

Copilot uses AI. Check for mistakes.
// exporter->CommitBatch(); // Inserts or collects stats
// }

#endif // PG_STAT_CH_SRC_EXPORT_BASIC_EXPORTER_H_
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect header guard at end. The closing header guard comment references BASIC_EXPORTER_H_ but should match the opening guard. It should be PG_STAT_CH_SRC_EXPORT_EXPORTER_INTERFACE_H_ to be consistent.

Copilot uses AI. Check for mistakes.

virtual ~StatsExporter() = 0;
};

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing implementation for pure virtual destructors. The interface declares pure virtual destructors for BasicColumn (line 17), Column (line 22), and StatsExporter (line 53), but these destructors must have an implementation even though they're pure virtual. Without implementations, the code will fail to link. These should be defined in a source file, typically as StatsExporter::~StatsExporter() = default;

Suggested change
inline StatsExporter::BasicColumn::~BasicColumn() = default;
template<typename T>
inline StatsExporter::Column<T>::~Column() = default;
inline StatsExporter::~StatsExporter() = default;

Copilot uses AI. Check for mistakes.
Comment on lines 75 to 77
// Build and export stats (records, metrics, ClickHouse rows) from events
void ExportEventStats(const std::vector<PschEvent>& events, StatsExporter *exporter) {
elog(DEBUG1, "pg_stat_ch: BuildClickHouseBlock() called with %zu events", events.size());
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading debug log message and comment. The log message says "BuildClickHouseBlock() called" and the comment on line 75 mentions "ClickHouse rows", but this function now works with both ClickHouse and OpenTelemetry exporters. The message and comment should be updated to be generic, such as "ExportEventStats() called" and "Build and export stats (records, metrics, backend rows) from events".

Suggested change
// Build and export stats (records, metrics, ClickHouse rows) from events
void ExportEventStats(const std::vector<PschEvent>& events, StatsExporter *exporter) {
elog(DEBUG1, "pg_stat_ch: BuildClickHouseBlock() called with %zu events", events.size());
// Build and export stats (records, metrics, backend rows) from events
void ExportEventStats(const std::vector<PschEvent>& events, StatsExporter *exporter) {
elog(DEBUG1, "pg_stat_ch: ExportEventStats() called with %zu events", events.size());

Copilot uses AI. Check for mistakes.
Comment on lines +144 to +149
if (val < 0) {
LogNegativeValue(name, static_cast<int64_t>(val));
stash_val = 0;
} else {
stash_val = static_cast<uint64_t>(val);
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue with signed/unsigned comparison. The code compares val < 0 where val is of template type T. This works for signed integers, but if T is an unsigned type (like uint8_t or uint64_t), the comparison will always be false and the cast to uint64_t at line 148 will be redundant. Consider using std::is_signed<T>::value to conditionally handle negative values only for signed types, or ensure this template is only instantiated for signed types.

Copilot uses AI. Check for mistakes.
Comment on lines 117 to 119
exp->current_row_tags[name] = to_string(val);
// Also put it in the log record so it's searchable!
exp->current_log_record->SetAttribute(name, val);
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential dangling string_view in map. At line 117, when T is string or an integer type that gets converted via to_string(), the value is safely stored as a std::string in the map. However, the SetAttribute call at line 119 passes val directly. If T is string_view, this could lead to a dangling reference if the string_view's underlying data is temporary. Consider always converting to string before storing or using.

Suggested change
exp->current_row_tags[name] = to_string(val);
// Also put it in the log record so it's searchable!
exp->current_log_record->SetAttribute(name, val);
std::string &stored_value = exp->current_row_tags[name];
stored_value = to_string(val);
// Also put it in the log record so it's searchable!
exp->current_log_record->SetAttribute(name, stored_value);

Copilot uses AI. Check for mistakes.
"pg_stat_ch.hostname",
"The hostname of the current machine.",
nullptr,
&psch_otel_endpoint,
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect variable used in GUC definition. The definition for pg_stat_ch.hostname is assigning to &psch_otel_endpoint instead of &psch_hostname. This will cause the hostname configuration to overwrite the otel_endpoint configuration.

Copilot uses AI. Check for mistakes.
- Add opentelemetry-cpp as submodule and wire into CMake build
  (built before clickhouse-cpp to avoid duplicate abseil targets)
- Add Findabsl.cmake shim so clickhouse-cpp reuses gRPC's abseil
- Add GUCs: pg_stat_ch.use_otel, otel_endpoint, hostname
- Fix OTel exporter: correct hostname GUC binding, string type fixes,
  rename factory to MakeOpenTelemetryExporter()
- Decouple stats_exporter from clickhouse-cpp (no direct #include)
- Update regression test expected output for new GUCs
@iskakaushik iskakaushik changed the title Stash commit WIP: integrate OpenTelemetry SDK for metrics/logs export Feb 24, 2026
Copy link
Collaborator

@iskakaushik iskakaushik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Comments

Thanks for the WIP! The exporter abstraction is a nice design — having a single ExportEventStats() that works with both backends is clean. Here are some things I noticed:


1. psch_use_otel initializer vs boot value mismatch

src/config/guc.cc

The variable initializer (bool psch_use_otel = true) doesn't match the DefineCustomBoolVariable boot value (false). The boot value wins at runtime, but the inconsistency is confusing. Let's pick one and make them match — false seems right since ClickHouse is the default backend.

Also: the two extern keywords on the psch_otel_endpoint and psch_hostname definitions (lines 23-24) should be removed — they're definitions, not declarations.


2. Align OTel attribute names with semantic conventions

src/export/stats_exporter.cc

Let's align the OTel attribute names with the OpenTelemetry Database Semantic Conventions. For example:

  • "db""db.name"
  • "username""db.user"
  • "query""db.statement"
  • "cmd_type""db.operation.name"
  • "duration_us""db.client.operation.duration" (consider converting to seconds as the convention expects)

For the ClickHouse backend these can stay as-is (they map to the events_raw table schema), so this mapping should happen in the OTel exporter's column implementations rather than in the shared ExportEventStats(). Maybe a name_override or a name-mapping hook in the exporter interface?


3. Signed metrics silently clamped to 0 in OTel histograms

src/export/otel_exporter.ccHistogramColumn::Append

Many of the fields routed through MetricInt64 (e.g., shared_blks_hit, cpu_user_time_us, WAL counters) are legitimately signed in PostgreSQL — they just happen to be non-negative in practice. Silently clamping to 0 and logging a warning is lossy.

A cleaner approach: split the interface into MetricUInt64 (for genuinely unsigned values like rows, wal_bytes, duration_us) and MetricInt64 (for signed counters). The OTel exporter can map signed metrics to int64 histograms (OTel supports signed histogram values as of the latest spec), or record them as gauges instead of histograms. This avoids the cast-and-clamp entirely.


4. Crunch() contract is underspecified

src/export/exporter_interface.h

The Crunch() contract is underspecified — in the ClickHouse exporter it's called once per batch (in CommitBatch), but in the OTel exporter it's called once per row (in EndRow). This means the same interface method has fundamentally different call timing depending on the backend.

Suggestion: make the contract explicit. Either:

  • Rename to something like FlushRow() / FinalizeBatch() with clear per-row vs per-batch semantics, or
  • Document in the interface that Crunch() is called by the framework at a backend-defined time and implementers must handle both patterns

As it stands, a new exporter implementer would have to study both existing backends to understand when their Crunch() gets called.


5. cmd_type should be a TagString

src/export/stats_exporter.cc

cmd_type should be a TagString rather than RecordString — it's a natural grouping dimension. You'd want to see duration_us histograms broken down by SELECT vs INSERT vs UPDATE, and in OTel that means it needs to be in the tag set.


6. Column lifecycle vs BeginBatch() ordering

src/export/clickhouse_exporter.cc

BeginBatch() calls columns.clear(), which drops the exporter's shared_ptr references to all columns. If BeginBatch() is called after column creation (e.g., between batches), the caller's shared_ptrs keep the objects alive but CommitBatch() won't iterate them since it loops over columns.

The interface comment shows the correct call order (BeginBatch → create columns → loop events → CommitBatch), but the contract that columns must be created after BeginBatch() should be documented explicitly, since getting it wrong silently produces empty inserts.


7. Dequeued events lost on export failure — add a comment

src/export/stats_exporter.ccPschExportBatch

When CommitBatch() fails, the dequeued events are silently lost — same as the old code. This is fine to keep the same semantics for now, but worth noting with a comment so it's a conscious decision:

// NOTE: Events are dequeued before export. On CommitBatch failure,
// these events are lost. This matches the pre-refactor behavior.

8. OTel instrument lifecycle on collector restart

src/export/otel_exporter.cc

The OTel instruments (histogram_cache, counter_cache) and providers live for the bgworker process lifetime. If the OTel collector restarts or becomes temporarily unavailable:

  • Do the cached Histogram/Counter objects reconnect automatically via the gRPC channel?
  • Does ForceFlush return false on transient errors, or does it throw?

Consider adding a ResetConnection() path (similar to how the ClickHouse exporter does client.reset() on failure) that tears down and recreates the providers. Without this, a collector restart could leave the bgworker in a permanently broken state.


9. Dual logs + metrics — cardinality and volume

src/export/otel_exporter.cc

Emitting both logs and metrics for every event is a valid design — logs give you full-fidelity per-query records for debugging, metrics give you pre-aggregated dashboards. Two things to watch:

  1. Cardinality: Each unique tag combination creates a new time series. With db × username × cmd_type as tags, the series count scales with the product of unique values. Consider whether all current tags should remain tags or whether some should be log-only attributes.
  2. Volume: At high QPS, emitting ~30 histogram observations + a full log record per query could overwhelm the collector. A pg_stat_ch.otel_enable_logs GUC to independently toggle the log pipeline would be a nice escape hatch.

10. OTel path has no test coverage

The OTel exporter path currently has no test coverage — all existing tests exercise the ClickHouse backend (psch_use_otel defaults to false). We should add OTel-specific tests before merging.

Suggested strategy:

  • Unit/integration tests with an in-process mock: Use OpenTelemetry's InMemoryMetricExporter and InMemoryLogRecordExporter as backends instead of gRPC. This lets you assert on exact metric values and log attributes without any external dependencies.
  • TAP test with a real collector: Add a Docker-based test (similar to how we test with ClickHouse) that spins up an OTel Collector, runs queries, and verifies data arrives. This can be optional / CI-only.
  • Smoke test: At minimum, a TAP test that starts PG with pg_stat_ch.use_otel=on (pointing at a non-existent endpoint) and verifies the bgworker handles connection failures gracefully without crashing.

iskakaushik and others added 2 commits February 24, 2026 11:55
When system gRPC is installed, OTel's grpc.cmake finds it via
find_package(gRPC CONFIG QUIET) and skips FetchContent. This means
abseil is never built as a module, so the absl::int128 target
doesn't exist when clickhouse-cpp needs it.

Set CMAKE_DISABLE_FIND_PACKAGE_gRPC=TRUE so OTel always uses
FetchContent for gRPC + abseil, guaranteeing the targets exist.
Copilot AI review requested due to automatic review settings February 24, 2026 18:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 54 to 55
virtual ~StatsExporter() = 0;
};
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pure virtual destructor must still have a definition, otherwise this will cause a link error when deleting through a StatsExporter*. Add an out-of-line definition in a .cc file or define it inline (e.g., inline StatsExporter::~StatsExporter() = default;).

Copilot uses AI. Check for mistakes.
Comment on lines +309 to +315
if (psch_shared_state != nullptr) {
pg_atomic_fetch_add_u64(&psch_shared_state->exported, exporter->NumExported());
}
PschRecordExportSuccess();
}

return exporter->NumExported();
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When CommitBatch() fails, PschExportBatch() still returns NumExported(), which can incorrectly report successful exports and interfere with retry/backoff behavior. Return 0 (or the batch count only on success), and also ensure failures are recorded consistently (e.g., call PschRecordExportFailure(...) on commit failure for exporters that don’t do it internally).

Suggested change
if (psch_shared_state != nullptr) {
pg_atomic_fetch_add_u64(&psch_shared_state->exported, exporter->NumExported());
}
PschRecordExportSuccess();
}
return exporter->NumExported();
int num_exported = exporter->NumExported();
if (psch_shared_state != nullptr) {
pg_atomic_fetch_add_u64(&psch_shared_state->exported, num_exported);
}
PschRecordExportSuccess();
return num_exported;
} else {
PschRecordExportFailure("Failed to commit exporter batch");
return 0;
}

Copilot uses AI. Check for mistakes.
Comment on lines 38 to 39
columns.clear();
if (row_active) EndRow(); // Just in case
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BeginBatch() clears columns before calling EndRow(), but EndRow() iterates columns to emit metrics. If a row is active, this would silently drop metric emission for that row. Call EndRow() before clearing state.

Suggested change
columns.clear();
if (row_active) EndRow(); // Just in case
if (row_active) EndRow(); // Just in case
columns.clear();

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +104
logger->EmitLogRecord(std::move(current_log_record));
row_active = false;
++exported_count;
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported_count appears to be cumulative across batches (it is never reset in BeginBatch()), which will over-report per-batch exports (and cause PschExportBatch() to return increasing counts). Reset exported_count at the start of each batch (and ensure the semantics match ClickHouse’s per-batch behavior).

Copilot uses AI. Check for mistakes.
Comment on lines 165 to +167
// Note on blocking: If we're blocked in ClickHouse network I/O when a barrier
// signal arrives, we can't process it until the I/O completes. The socket
// timeouts configured in clickhouse_exporter.cc (30 seconds) bound this delay.
// timeouts configured in stats_exporter.cc (30 seconds) bound this delay.
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment still refers specifically to 'ClickHouse network I/O', but the exporter can now be OpenTelemetry as well. Consider rewording this to 'exporter network I/O' (or mention both backends) so the comment matches the new behavior.

Copilot uses AI. Check for mistakes.
JoshDreamland and others added 2 commits February 24, 2026 14:24
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 24, 2026 19:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

try {
const std::string hostname = GetAHostname("postgres-primary");
const std::string endpoint = def(psch_otel_endpoint, "localhost:4317");
const std::string pgch_version = PG_STAT_CH_VERSION;
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name pgch_version uses an abbreviation that doesn't match the project naming convention. Consider renaming to pg_stat_ch_version for consistency with other identifiers in the codebase.

Copilot uses AI. Check for mistakes.

// Configure Reader (Manual Flush Mode)
metrics_sdk::PeriodicExportingMetricReaderOptions reader_opts;
reader_opts.export_interval_millis = std::chrono::milliseconds::max();
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting export_interval_millis to max() creates a very large delay that may be surprising. Consider using a named constant like kManualFlushMode with a comment explaining that this effectively disables periodic exports in favor of manual flushing.

Copilot uses AI. Check for mistakes.
Add integration tests that verify the full OTel export pipeline:
hooks → queue → bgworker → OTLP gRPC → OTel Collector → JSONL file.

- 024_otel_export.pl: 6 subtests (basic export, batch sizing, flush,
  field validation, stats accuracy, connection failure handling)
- 025_otel_reconnect.pl: 4 subtests (initial connection, failure
  recovery, failure tracking, error tracking)
- OTel helpers in psch.pm using content-based matching with unique
  query markers for robust cross-subtest isolation
- Docker Compose config for OTel Collector with file exporter
- CI integration in ci-tap.yml
- `otel` test type in run-tests.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants