This document provides solutions to common issues when using the Gardener Fluent Bit OTLP Output Plugin.
- Logs are not appearing in the backend
- Queue size keeps growing
- No export errors in metrics
-
Check client type configuration:
SeedType OTLPGRPC # or OTLPHTTP, stdout ShootType otlp_grpcVerify the client type is set correctly. If empty, the plugin won't send logs.
-
Verify endpoint connectivity:
# For gRPC grpcurl -plaintext victorialogs.logging.svc:4317 list # For HTTP curl -v https://victorialogs.logging.svc/insert/opentelemetry/v1/logs
-
Check TLS configuration:
- Ensure certificate paths are correct
- Verify certificate validity:
openssl x509 -in /path/to/cert.crt -text -noout - Check server name matches certificate CN/SAN
- Verify CA certificate is correct
-
Review plugin logs:
# Increase log level in config LogLevel debug # Check Fluent Bit logs kubectl logs -n logging daemonset/fluent-bit
-
Check backend availability:
# Test network connectivity nc -zv victorialogs.logging.svc 4317 # Check DNS resolution nslookup victorialogs.logging.svc
dque_queue_sizemetric keeps increasing- Disk usage growing
- Logs being dropped (if queue full)
-
Backend performance issues:
- Backend may be too slow or unavailable
- Check backend metrics and logs
- Scale backend if needed
-
Batch size too small:
# Increase batch size DQueBatchProcessorMaxBatchSize 512 -
Export interval too high:
# Decrease export interval DQueBatchProcessorExportInterval 500ms -
Network latency:
- Check network connectivity to backend
- Enable compression to reduce bandwidth:
Compression 1
-
Backend throttling:
- Check for rate limiting errors
- Adjust throttle configuration:
ThrottleEnabled true ThrottleRequestsPerSec 100
- Pod OOMKilled
- High memory metrics
- System slowdown
-
Queue size too large:
# Reduce in-memory queue DQueBatchProcessorMaxQueueSize 256 -
Batch size too large:
# Reduce batch size DQueBatchProcessorMaxBatchSize 128 -
Too many clients:
- Check for client leaks with dynamic routing
- Review client cleanup:
DeletedClientTimeExpiration 30m
-
Memory leak:
- Enable profiling and analyze heap:
go tool pprof http://localhost:2021/debug/pprof/heap
- "certificate verify failed" errors
- "tls: bad certificate" errors
- Connection refused
-
Certificate not found:
# Check file exists and is readable ls -la /etc/ssl/fluent-bit/tls.crt # Check permissions kubectl exec -it fluent-bit-xxx -- ls -la /etc/ssl/fluent-bit/
-
Certificate expired:
# Check certificate validity openssl x509 -in /path/to/cert.crt -noout -dates -
CA certificate mismatch:
# Verify certificate chain openssl verify -CAfile /path/to/ca.crt /path/to/cert.crt -
Server name mismatch:
# Set correct server name for SNI TLSServerName victorialogs.logging.svc -
TLS version incompatibility:
# Adjust TLS version TLSMinVersion 1.2 TLSMaxVersion 1.3
- Logs always go to default client
- Expected client not created
- "client not found" errors
-
Check JSONPath configuration:
# Verify path matches log structure DynamicHostPath {"kubernetes": {"namespace_name": "namespace"}}
-
Verify regex pattern:
# Test regex against namespace names DynamicHostRegex ^shoot-- -
Check controller sync:
# Increase sync timeout ControllerSyncTimeout 120s -
Review cluster state routing:
# Ensure state-based routing is configured SendLogsToMainClusterWhenIsInReadyState true -
Check namespace metadata:
# Verify logs contain expected metadata LogLevel debug # Look for "extracted dynamic host" messages
dque_export_errors_totalmetric increasing- "context deadline exceeded" errors
- "connection reset" errors
-
Timeout too short:
# Increase export timeout DQueBatchProcessorExportTimeout 60s Timeout 60s -
Backend overloaded:
- Reduce export rate:
ThrottleEnabled true ThrottleRequestsPerSec 50
-
Network issues:
- Check network connectivity
- Test with smaller batches:
DQueBatchProcessorMaxBatchSize 128
-
Backend errors:
- Check backend logs for error details
- Verify backend configuration
- Logs missing pod_name, namespace, container_name
- Logs dropped when
DropLogEntryWithoutK8sMetadatais true
-
Enable fallback to tag:
FallbackToTagWhenMetadataIsMissing true
-
Check tag configuration:
TagKey tag TagPrefix kubernetes\\.var\\.log\\.containers TagExpression \\.([^_]+)_([^_]+)_(.+)-([a-z0-9]{64})\\.log$ -
Verify Fluent Bit Kubernetes filter:
[Filter] Name kubernetes Match kubernetes.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token -
Check service account permissions:
# Verify RBAC allows reading pods/namespaces kubectl auth can-i get pods --as=system:serviceaccount:logging:fluent-bit
Enable comprehensive debugging:
[Output]
Name gardener
Match *
LogLevel debug
Pprof true-
Metrics:
curl http://localhost:2021/metrics
-
Health Check:
curl http://localhost:2021/healthz
-
CPU Profile:
go tool pprof http://localhost:2021/debug/pprof/profile
-
Heap Profile:
go tool pprof http://localhost:2021/debug/pprof/heap
-
Goroutines:
curl http://localhost:2021/debug/pprof/goroutine?debug=2
-
Compression overhead:
# Disable compression if CPU-constrained Compression 0 -
Too many regex operations:
- Simplify tag expressions
- Use Kubernetes filter instead of tag parsing
-
Too many clients:
- Review dynamic routing configuration
- Reduce client count if possible
-
Frequent syncs:
# Disable fsync for better performance DQueSync false -
Small segments:
# Increase segment size DQueSegmentSize 1000 -
Disk queue location:
- Use faster storage for queue directory
- Consider tmpfs for ephemeral environments:
DQueDir /dev/shm/fluent-bit-buffers
# Queue health
curl -s http://localhost:2021/metrics | grep dque_queue_size
# Export performance
curl -s http://localhost:2021/metrics | grep dque_export_duration_seconds
# Error rate
curl -s http://localhost:2021/metrics | grep dque_export_errors_total
# Drop rate
curl -s http://localhost:2021/metrics | grep dque_dropped_total-
Queue Growing:
- Alert:
dque_queue_size > 400for 5 minutes - Action: Investigate backend performance or increase export rate
- Alert:
-
High Error Rate:
- Alert:
rate(dque_export_errors_total[5m]) > 0.1 - Action: Check backend connectivity and logs
- Alert:
-
Logs Dropped:
- Alert:
rate(dque_dropped_total[5m]) > 0 - Action: Increase queue size or export rate
- Alert:
-
High Export Latency:
- Alert:
histogram_quantile(0.95, dque_export_duration_seconds) > 10 - Action: Check network latency or backend performance
- Alert:
If you've tried the solutions above and still have issues:
-
Collect debug information:
# Plugin logs kubectl logs -n logging daemonset/fluent-bit --tail=1000 > fluent-bit.log # Metrics curl http://localhost:2021/metrics > metrics.txt # Configuration kubectl get configmap fluent-bit-config -o yaml > config.yaml
-
Check known issues:
- GitHub Issues: https://github.com/gardener/logging/issues
-
Ask for help:
- Create a GitHub issue with debug information
- Join Gardener Slack: #gardener