Troubleshooting Guide

This document provides solutions to common issues when using the Gardener Fluent Bit OTLP Output Plugin.

Common Issues

Logs Not Being Sent

Symptoms

Logs are not appearing in the backend
Queue size keeps growing
No export errors in metrics

Troubleshooting Steps

Check client type configuration:
```
SeedType OTLPGRPC  # or OTLPHTTP, stdout
ShootType otlp_grpc
```
Verify the client type is set correctly. If empty, the plugin won't send logs.

Verify endpoint connectivity:

# For gRPC
grpcurl -plaintext victorialogs.logging.svc:4317 list

# For HTTP
curl -v https://victorialogs.logging.svc/insert/opentelemetry/v1/logs

Check TLS configuration:
- Ensure certificate paths are correct
- Verify certificate validity: openssl x509 -in /path/to/cert.crt -text -noout
- Check server name matches certificate CN/SAN
- Verify CA certificate is correct

Review plugin logs:

# Increase log level in config
LogLevel debug

# Check Fluent Bit logs
kubectl logs -n logging daemonset/fluent-bit

Check backend availability:

# Test network connectivity
nc -zv victorialogs.logging.svc 4317

# Check DNS resolution
nslookup victorialogs.logging.svc

Queue Growing Continuously

Symptoms

dque_queue_size metric keeps increasing
Disk usage growing
Logs being dropped (if queue full)

Root Causes and Solutions

Backend performance issues:
- Backend may be too slow or unavailable
- Check backend metrics and logs
- Scale backend if needed

Batch size too small:

# Increase batch size
DQueBatchProcessorMaxBatchSize 512

Export interval too high:

# Decrease export interval
DQueBatchProcessorExportInterval 500ms

Network latency:
- Check network connectivity to backend
- Enable compression to reduce bandwidth:
```
Compression 1
```
Backend throttling:
- Check for rate limiting errors
- Adjust throttle configuration:
```
ThrottleEnabled true
ThrottleRequestsPerSec 100
```

High Memory Usage

Symptoms

Pod OOMKilled
High memory metrics
System slowdown

Solutions

Queue size too large:

# Reduce in-memory queue
DQueBatchProcessorMaxQueueSize 256

Batch size too large:

# Reduce batch size
DQueBatchProcessorMaxBatchSize 128

Too many clients:
- Check for client leaks with dynamic routing
- Review client cleanup:
```
DeletedClientTimeExpiration 30m
```

Memory leak:

Enable profiling and analyze heap:

go tool pprof http://localhost:2021/debug/pprof/heap

TLS/mTLS Errors

Symptoms

"certificate verify failed" errors
"tls: bad certificate" errors
Connection refused

Solutions

Certificate not found:

# Check file exists and is readable
ls -la /etc/ssl/fluent-bit/tls.crt

# Check permissions
kubectl exec -it fluent-bit-xxx -- ls -la /etc/ssl/fluent-bit/

Certificate expired:

# Check certificate validity
openssl x509 -in /path/to/cert.crt -noout -dates

CA certificate mismatch:

# Verify certificate chain
openssl verify -CAfile /path/to/ca.crt /path/to/cert.crt

Server name mismatch:

# Set correct server name for SNI
TLSServerName victorialogs.logging.svc

TLS version incompatibility:

# Adjust TLS version
TLSMinVersion 1.2
TLSMaxVersion 1.3

Dynamic Routing Not Working

Symptoms

Logs always go to default client
Expected client not created
"client not found" errors

Solutions

Check JSONPath configuration:

# Verify path matches log structure
DynamicHostPath {"kubernetes": {"namespace_name": "namespace"}}

Verify regex pattern:

# Test regex against namespace names
DynamicHostRegex ^shoot--

Check controller sync:

# Increase sync timeout
ControllerSyncTimeout 120s

Review cluster state routing:

# Ensure state-based routing is configured
SendLogsToMainClusterWhenIsInReadyState true

Check namespace metadata:

# Verify logs contain expected metadata
LogLevel debug
# Look for "extracted dynamic host" messages

Export Errors

Symptoms

dque_export_errors_total metric increasing
"context deadline exceeded" errors
"connection reset" errors

Solutions

Timeout too short:

# Increase export timeout
DQueBatchProcessorExportTimeout 60s
Timeout 60s

Backend overloaded:

Reduce export rate:

ThrottleEnabled true
ThrottleRequestsPerSec 50

Network issues:
- Check network connectivity
- Test with smaller batches:
```
DQueBatchProcessorMaxBatchSize 128
```
Backend errors:
- Check backend logs for error details
- Verify backend configuration

Kubernetes Metadata Missing

Symptoms

Logs missing pod_name, namespace, container_name
Logs dropped when DropLogEntryWithoutK8sMetadata is true

Solutions

Enable fallback to tag:
```
FallbackToTagWhenMetadataIsMissing true
```

Check tag configuration:

TagKey tag
TagPrefix kubernetes\\.var\\.log\\.containers
TagExpression \\.([^_]+)_([^_]+)_(.+)-([a-z0-9]{64})\\.log$

Verify Fluent Bit Kubernetes filter:

[Filter]
    Name kubernetes
    Match kubernetes.*
    Kube_URL https://kubernetes.default.svc:443
    Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token

Check service account permissions:

# Verify RBAC allows reading pods/namespaces
kubectl auth can-i get pods --as=system:serviceaccount:logging:fluent-bit

Debug Mode

Enable comprehensive debugging:

[Output]
    Name gardener
    Match *
    LogLevel debug
    Pprof true

Accessing Debug Information

Metrics:
```
curl http://localhost:2021/metrics
```
Health Check:
```
curl http://localhost:2021/healthz
```

CPU Profile:

go tool pprof http://localhost:2021/debug/pprof/profile

Heap Profile:

go tool pprof http://localhost:2021/debug/pprof/heap

Goroutines:

curl http://localhost:2021/debug/pprof/goroutine?debug=2

Performance Issues

High CPU Usage

Compression overhead:

# Disable compression if CPU-constrained
Compression 0

Too many regex operations:
- Simplify tag expressions
- Use Kubernetes filter instead of tag parsing
Too many clients:
- Review dynamic routing configuration
- Reduce client count if possible

High Disk I/O

Frequent syncs:

# Disable fsync for better performance
DQueSync false

Small segments:

# Increase segment size
DQueSegmentSize 1000

Disk queue location:
- Use faster storage for queue directory
- Consider tmpfs for ephemeral environments:
```
DQueDir /dev/shm/fluent-bit-buffers
```

Metrics Analysis

Key Metrics to Monitor

# Queue health
curl -s http://localhost:2021/metrics | grep dque_queue_size

# Export performance
curl -s http://localhost:2021/metrics | grep dque_export_duration_seconds

# Error rate
curl -s http://localhost:2021/metrics | grep dque_export_errors_total

# Drop rate
curl -s http://localhost:2021/metrics | grep dque_dropped_total

Alert Recommendations

Queue Growing:
- Alert: dque_queue_size > 400 for 5 minutes
- Action: Investigate backend performance or increase export rate
High Error Rate:
- Alert: rate(dque_export_errors_total[5m]) > 0.1
- Action: Check backend connectivity and logs
Logs Dropped:
- Alert: rate(dque_dropped_total[5m]) > 0
- Action: Increase queue size or export rate
High Export Latency:
- Alert: histogram_quantile(0.95, dque_export_duration_seconds) > 10
- Action: Check network latency or backend performance

Getting Help

If you've tried the solutions above and still have issues:

Collect debug information:

# Plugin logs
kubectl logs -n logging daemonset/fluent-bit --tail=1000 > fluent-bit.log

# Metrics
curl http://localhost:2021/metrics > metrics.txt

# Configuration
kubectl get configmap fluent-bit-config -o yaml > config.yaml

Check known issues:
- GitHub Issues: https://github.com/gardener/logging/issues
Ask for help:
- Create a GitHub issue with debug information
- Join Gardener Slack: #gardener

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide

Common Issues

Logs Not Being Sent

Symptoms

Troubleshooting Steps

Queue Growing Continuously

Symptoms

Root Causes and Solutions

High Memory Usage

Symptoms

Solutions

TLS/mTLS Errors

Symptoms

Solutions

Dynamic Routing Not Working

Symptoms

Solutions

Export Errors

Symptoms

Solutions

Kubernetes Metadata Missing

Symptoms

Solutions

Debug Mode

Accessing Debug Information

Performance Issues

High CPU Usage

High Disk I/O

Metrics Analysis

Key Metrics to Monitor

Alert Recommendations

Getting Help

Additional Resources

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting Guide

Common Issues

Logs Not Being Sent

Symptoms

Troubleshooting Steps

Queue Growing Continuously

Symptoms

Root Causes and Solutions

High Memory Usage

Symptoms

Solutions

TLS/mTLS Errors

Symptoms

Solutions

Dynamic Routing Not Working

Symptoms

Solutions

Export Errors

Symptoms

Solutions

Kubernetes Metadata Missing

Symptoms

Solutions

Debug Mode

Accessing Debug Information

Performance Issues

High CPU Usage

High Disk I/O

Metrics Analysis

Key Metrics to Monitor

Alert Recommendations

Getting Help

Additional Resources