Skip to content

Commit 79217cd

Browse files
BP602claudeftk789
authored
feat(telemetry): comprehensive Grafana dashboard improvements and accurate metrics (#44)
* feat(telemetry): Integrate OpenTelemetry for comprehensive observability This commit introduces a full-featured OpenTelemetry implementation to provide deep, actionable insights into the application's performance, reliability, and resource usage. It establishes a robust foundation for monitoring, debugging, and future performance tuning. Core Implementation: End-to-End Tracing: Implemented distributed tracing for critical user flows, including WebSocket connections, message sending, and API requests, providing a clear view of how operations flow through the system. Application Metrics: Introduced key metrics for monitoring chat activity, such as active connections, message send/receive rates, and API performance. Structured Telemetry Module: Created a new src/telemetry module to encapsulate all OpenTelemetry logic, including instrumentation, custom metrics, and tracing helpers. Main & Renderer Process Integration: Correctly initialized the OpenTelemetry SDK in the main process and securely exposed telemetry functions to the renderer process via the preload script. Dependencies: Added all necessary @opentelemetry packages to package.json. Debugging Enhancements: To address existing stability issues, this implementation also includes targeted metrics for diagnosing memory leaks and process shutdown problems: Garbage Collection Monitoring: Tracks the duration and frequency of GC events to identify memory pressure. Renderer Memory & DOM Metrics: Periodically captures renderer memory usage and the total DOM node count to pinpoint UI-related leaks. Process Shutdown Diagnostics: Added tracking for all open windows to ensure they are properly closed on application exit. This observability stack will be instrumental in improving the stability and performance of the KickTalk application. * Fix cross-platform icon compatibility - Add cross-platform icon handling for Linux/macOS support - Convert Windows .ico to .png for non-Windows platforms - Update tray and thumbar icons to use platform-appropriate format Fixes application crashes on Linux due to unsupported .ico format 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * feat(telemetry): overhaul observability with manual instrumentation This commit completely refactors the OpenTelemetry implementation to improve stability and significantly expand observability capabilities. The previous automatic instrumentation using NodeSDK is replaced with a manual approach, which is more reliable within the Electron environment. Key changes include: - Replaced automatic instrumentation with a manual OpenTelemetry setup for tracing and metrics, preventing context-related issues in Electron. - Introduced a Prometheus scrape endpoint to expose detailed application metrics for direct collection. - Added a user setting in the General tab, allowing users to enable or disable telemetry collection. - Massively expanded the Grafana dashboard with new panels for CPU, memory, API latency, connection health, and message throughput. - Implemented robust error handling and fallbacks for the telemetry module to prevent application crashes. - Updated the development OTEL stack script to support Podman, increasing developer flexibility. - Added new IPC-based telemetry events for granular tracking of WebSocket connections, API calls, and errors from the renderer. * feat(testing): add WebSocket connection testing infrastructure Add comprehensive scripts for testing WebSocket connection resilience and recovery: • scripts/block-websockets-firewalld.sh - nftables-based IPv4/IPv6 blocking with real AWS/Cloudflare ranges • scripts/block-websockets-hosts.sh - DNS-level blocking via /etc/hosts manipulation These scripts enable testing of WebSocket reconnection behavior during network failures, validating the built-in auto-reconnect mechanisms in KickPusher and 7TV services. Created while investigating issue #41. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * feat(telemetry): comprehensive Grafana dashboard improvements - Fix DOM node count metrics showing 0 by implementing proper OpenTelemetry integration - Add 6 new monitoring panels: GC Performance, DOM Node Count, Error Rates, Memory Efficiency, Handle Efficiency, Message Success Rate - Implement streamer name support across all metrics for cleaner legend displays - Convert Message Throughput and WebSocket Connections panels to stacked charts - Fix WebSocket connection counter using ObservableGauge instead of UpDownCounter for accurate real-time counts - Install and integrate @opentelemetry/exporter-prometheus for proper metrics export - Remove debug logging for cleaner console output - Enhance message and WebSocket telemetry with streamer name attributes Dashboard now provides comprehensive application monitoring with accurate metrics and clean visualizations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: ftk789 <57598907+ftk789@users.noreply.github.com>
1 parent 806403a commit 79217cd

24 files changed

Lines changed: 3229 additions & 15 deletions

File tree

docker-compose.otel.yml

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
services:
2+
# OpenTelemetry Collector
3+
otel-collector:
4+
image: otel/opentelemetry-collector-contrib:latest
5+
container_name: kicktalk-otel-collector
6+
command: ["--config=/etc/otel-collector-config.yml"]
7+
volumes:
8+
- ./otel/collector-config.yml:/etc/otel-collector-config.yml:Z
9+
- ./otel/logs:/var/log/otel:Z
10+
user: "0:0" # Run as root to avoid permission issues
11+
ports:
12+
- "4317:4317" # OTLP gRPC receiver
13+
- "4318:4318" # OTLP HTTP receiver
14+
- "8888:8888" # Prometheus metrics
15+
- "8889:8889" # Prometheus exporter metrics
16+
- "13133:13133" # Health check endpoint
17+
depends_on:
18+
- jaeger
19+
- prometheus
20+
21+
# Jaeger for distributed tracing
22+
jaeger:
23+
image: jaegertracing/all-in-one:latest
24+
container_name: kicktalk-jaeger
25+
ports:
26+
- "16686:16686" # Jaeger UI
27+
- "14250:14250" # Jaeger gRPC
28+
environment:
29+
- COLLECTOR_OTLP_ENABLED=true
30+
- LOG_LEVEL=debug
31+
32+
# Prometheus for metrics storage
33+
prometheus:
34+
image: prom/prometheus:latest
35+
container_name: kicktalk-prometheus
36+
ports:
37+
- "9090:9090"
38+
volumes:
39+
- ./otel/prometheus.yml:/etc/prometheus/prometheus.yml:Z
40+
- prometheus_data:/prometheus
41+
user: "nobody" # Use nobody user for security
42+
command:
43+
- '--config.file=/etc/prometheus/prometheus.yml'
44+
- '--storage.tsdb.path=/prometheus'
45+
- '--web.console.libraries=/etc/prometheus/console_libraries'
46+
- '--web.console.templates=/etc/prometheus/consoles'
47+
- '--storage.tsdb.retention.time=200h'
48+
- '--web.enable-lifecycle'
49+
50+
# Grafana for visualization
51+
grafana:
52+
image: grafana/grafana:latest
53+
container_name: kicktalk-grafana
54+
ports:
55+
- "3000:3000"
56+
environment:
57+
- GF_SECURITY_ADMIN_PASSWORD=admin
58+
volumes:
59+
- grafana_data:/var/lib/grafana
60+
- ./otel/grafana/provisioning:/etc/grafana/provisioning:Z
61+
- ./otel/grafana/dashboards:/var/lib/grafana/dashboards:Z
62+
63+
# Redis for caching telemetry data (optional)
64+
redis:
65+
image: redis:alpine
66+
container_name: kicktalk-redis
67+
ports:
68+
- "6379:6379"
69+
volumes:
70+
- redis_data:/data
71+
72+
volumes:
73+
prometheus_data:
74+
grafana_data:
75+
redis_data:
76+
77+
networks:
78+
default:
79+
name: kicktalk-otel

electron.vite.config.mjs

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,44 @@
11
import { resolve } from "path";
22
import { defineConfig, externalizeDepsPlugin } from "electron-vite";
33
import react from "@vitejs/plugin-react";
4+
import { copyFileSync, mkdirSync, existsSync } from "fs";
5+
import { join, dirname } from "path";
6+
7+
// Custom plugin to copy telemetry files
8+
const copyTelemetryPlugin = () => ({
9+
name: 'copy-telemetry',
10+
writeBundle() {
11+
const srcTelemetry = resolve('src/telemetry');
12+
const outTelemetry = resolve('out/telemetry');
13+
14+
// Create telemetry directory in output
15+
if (!existsSync(outTelemetry)) {
16+
mkdirSync(outTelemetry, { recursive: true });
17+
}
18+
19+
// Copy telemetry files
20+
const files = ['index.js', 'metrics.js', 'tracing.js', 'instrumentation.js', 'prometheus-server.js'];
21+
files.forEach(file => {
22+
const src = join(srcTelemetry, file);
23+
const dest = join(outTelemetry, file);
24+
if (existsSync(src)) {
25+
try {
26+
copyFileSync(src, dest);
27+
console.log(`[Telemetry]: Copied ${file} to build output`);
28+
} catch (error) {
29+
console.warn(`[Telemetry]: Failed to copy ${file}:`, error.message);
30+
}
31+
}
32+
});
33+
}
34+
});
435

536
export default defineConfig({
637
main: {
7-
plugins: [externalizeDepsPlugin({ exclude: ["electron-store", "electron-util"] })],
38+
plugins: [
39+
externalizeDepsPlugin({ exclude: ["electron-store", "electron-util"] }),
40+
copyTelemetryPlugin()
41+
],
842
},
943
preload: {
1044
plugins: [externalizeDepsPlugin({ exclude: ["electron-store", "electron-util"] })],

otel/.gitignore

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# OTEL logs and temporary data
2+
logs/
3+
*.log
4+
*.log.*
5+
6+
# Grafana runtime data (keep provisioning configs)
7+
grafana/data/
8+
grafana/runtime/
9+
10+
# Prometheus data
11+
prometheus/data/
12+
13+
# Temporary collector files
14+
collector-temp/
15+
*.tmp
16+
17+
# Docker volumes (if using local bind mounts)
18+
data/
19+
storage/

otel/README.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# KickTalk OpenTelemetry Setup
2+
3+
This directory contains the OpenTelemetry observability stack for KickTalk application monitoring, including distributed tracing, metrics collection, and log aggregation.
4+
5+
## Architecture
6+
7+
- **OpenTelemetry Collector**: Receives, processes, and exports telemetry data
8+
- **Jaeger**: Distributed tracing backend and UI
9+
- **Prometheus**: Metrics storage and querying
10+
- **Grafana**: Visualization and dashboards
11+
- **Redis**: Optional caching for telemetry data
12+
13+
## Quick Start
14+
15+
1. **Start the observability stack:**
16+
```bash
17+
docker-compose -f docker-compose.otel.yml up -d
18+
```
19+
20+
2. **Access the services:**
21+
- **Grafana Dashboard**: http://localhost:3000 (admin/admin)
22+
- **Jaeger UI**: http://localhost:16686
23+
- **Prometheus**: http://localhost:9090
24+
- **OTEL Collector Health**: http://localhost:13133
25+
26+
3. **Configure KickTalk** to send telemetry to:
27+
- **OTLP gRPC**: `http://localhost:4317`
28+
- **OTLP HTTP**: `http://localhost:4318`
29+
30+
## Configuration
31+
32+
### OTEL Collector (`collector-config.yml`)
33+
34+
The collector is configured to:
35+
- **Receive** telemetry via OTLP (gRPC/HTTP)
36+
- **Process** data with batching, memory limiting, and attribute filtering
37+
- **Export** traces to Jaeger, metrics to Prometheus, and logs to files
38+
39+
Key features:
40+
- **Privacy-focused**: Automatically filters sensitive data (tokens, auth info)
41+
- **Resource attribution**: Adds service.name, version, environment tags
42+
- **Performance optimized**: Batching and memory limits configured
43+
44+
### Prometheus (`prometheus.yml`)
45+
46+
Scrapes metrics from:
47+
- OTEL Collector internal metrics
48+
- KickTalk application metrics (port 9464)
49+
- Jaeger metrics for tracing health
50+
51+
### Grafana Dashboards
52+
53+
Pre-configured dashboards for:
54+
- **KickTalk Overview**: Application health, connections, message throughput
55+
- **Memory & Performance**: Resource usage, API response times
56+
- **Connection Health**: WebSocket stability, reconnection rates
57+
58+
## Application Integration
59+
60+
To integrate KickTalk with this observability stack, the application needs to:
61+
62+
1. **Install OTEL SDK** packages for Node.js/Electron
63+
2. **Configure exporters** to send data to `localhost:4317`
64+
3. **Implement metrics** for key application events
65+
4. **Add tracing** to critical code paths
66+
67+
## Metrics to Implement
68+
69+
### Connection Metrics
70+
- `kicktalk_websocket_connections_active` - Active WebSocket connections
71+
- `kicktalk_websocket_reconnections_total` - Connection reconnection events
72+
- `kicktalk_connection_errors_total` - Connection failure events
73+
74+
### Message Metrics
75+
- `kicktalk_messages_sent_total` - Messages sent by user
76+
- `kicktalk_messages_received_total` - Messages received from chat
77+
- `kicktalk_message_send_duration_seconds` - Message send latency
78+
79+
### Resource Metrics
80+
- `kicktalk_memory_usage_bytes` - Application memory consumption
81+
- `kicktalk_cpu_usage_percent` - CPU utilization
82+
- `kicktalk_open_handles_total` - File/socket handles
83+
84+
### API Metrics
85+
- `kicktalk_api_request_duration_seconds` - API response times
86+
- `kicktalk_api_requests_total` - API request counts by endpoint/status
87+
88+
## Traces to Implement
89+
90+
### User Actions
91+
- Message sending flow (input → validation → API → confirmation)
92+
- Chatroom joining/leaving
93+
- Settings changes
94+
95+
### System Operations
96+
- WebSocket connection establishment
97+
- API calls (Kick, 7TV)
98+
- Emote loading and caching
99+
100+
### Error Scenarios
101+
- Connection failures and recovery
102+
- API timeouts and retries
103+
- Memory leak detection points
104+
105+
## Privacy & Security
106+
107+
The collector configuration includes privacy protections:
108+
- **Automatic filtering** of authentication tokens
109+
- **Local-only operation** by default
110+
- **Configurable data retention** periods
111+
- **No PII collection** in standard metrics
112+
113+
## Development Usage
114+
115+
### View Real-time Metrics
116+
```bash
117+
# Watch collector logs
118+
docker-compose -f docker-compose.otel.yml logs -f otel-collector
119+
120+
# Query Prometheus directly
121+
curl http://localhost:9090/api/v1/query?query=up
122+
123+
# Check collector health
124+
curl http://localhost:13133
125+
```
126+
127+
### Custom Dashboards
128+
129+
Add custom dashboard JSON files to `otel/grafana/dashboards/` and they'll be automatically loaded into Grafana.
130+
131+
### Testing Telemetry
132+
133+
Send test traces/metrics to the collector:
134+
```bash
135+
# Test OTLP HTTP endpoint
136+
curl -X POST http://localhost:4318/v1/traces \
137+
-H "Content-Type: application/json" \
138+
-d '{"resourceSpans":[...]}'
139+
```
140+
141+
## Production Considerations
142+
143+
For production deployment:
144+
- Use external Prometheus/Jaeger instances
145+
- Configure authentication for Grafana
146+
- Set up alerting rules in Prometheus
147+
- Implement log rotation and retention policies
148+
- Consider using OTEL Collector in agent/gateway mode
149+
150+
## Stopping the Stack
151+
152+
```bash
153+
docker-compose -f docker-compose.otel.yml down
154+
```
155+
156+
To remove all data:
157+
```bash
158+
docker-compose -f docker-compose.otel.yml down -v
159+
```

0 commit comments

Comments
 (0)