Collector High Availability
Learn how to set up a highly available OpenTelemetry Collector deployment for production environments.
What Is High Availability (HA)?
Ensure telemetry collection and processing infrastructure works even if individual Collector instances fail.
Why High Availability for the Collector?
- Avoid data loss from agent-mode collectors when exporting to a dead backend.
- Ensure telemetry continuity during rolling updates or infrastructure failures.
- Enable horizontal scalability for load-balancing traces, logs, and metrics.
note
Using Agent-Gateway Architecture is the recommended deployment pattern for high availability.
Agent-Gateway Architecture
- Agent Collectors run on every host, container, or node.
- Gateway Collectors are centralized, scalable backend services receiving telemetry from agents.
- Each layer can be scaled independently and horizontally.
Architecture
A typical high-availability OpenTelemetry Collector deployment consist of:
-
Multiple Collector Instances
- Deployed across different availability zones/regions
- Each instance capable of handling the full workload
- Redundant storage for temporary data buffering
-
Load Balancer
- Distributes incoming telemetry data
- Health checks to detect collector availability
- Session affinity for consistent routing
-
Automatic Failover
- Occurs if a collector becomes unavailable
-
Shared Storage Backend
- Persistent storage for collector state
- Shared configuration management
- Metrics and traces storage
Sizing and Resource Requirements
View the Sizing and Scaling page for a more in-depth guide.
Gateway Collector Requirements
Minimum Configuration:
- 2 collectors behind a load balancer
- 2 CPU cores per collector
- 8GB memory per collector
- 60GB usable space for persistent queue per collector
Throughput-Based Sizing
The following table shows the number of collectors needed based on expected throughput. This assumes each collector has 4 CPU cores and 16GB of memory:
Telemetry Throughput | Logs / second | Collectors |
---|---|---|
5 GB/m | 250,000 | 2 |
10 GB/m | 500,000 | 3 |
20 GB/m | 1,000,000 | 5 |
100 GB/m | 5,000,000 | 25 |
It's important to over-provision your collector fleet to provide fault tolerance. If one or more collector systems fail or are brought offline for maintenance, the remaining collectors must have enough available capacity to handle the telemetry throughput.
Scaling
-
Monitor these metrics to determine when to scale:
- CPU utilization
- Memory usage
- Network throughput
- Queue length
- Error rates
-
Configure auto-scaling based on:
- CPU utilization > 70%
- Memory usage > 80%
- Request rate per collector
Load Balancer Configuration
Configure a load balancer.
- Health check endpoint:
/health
- Health check interval: 30 seconds
- Unhealthy threshold: 3 failures
- Healthy threshold: 2 successes
View the Load Balancing Best Practices, here.
Resilience
View the Resilience page for a more in-depth guide.
Configure:
- Batching - Aggregates telemetry signals before exporting them
- Retry - Retry sending telemetry batches when there is an error or a network outage.
- Persistent Queue - Retries are stored in a sending queue on disk to guarantee persistence if a collector crashes.
Retry
For workloads that cannot afford to have telemetry dropped, consider increasing the max_elapsed_time
significantly. Keep in mind that a large max elapsed time combined with a large backend outage will cause the collector to "buffer" a significant amount of telemetry to disk.
Persistent Queue
The sending queue has three important options:
- Number of consumers: Determines how many batches will be retried in parallel
- Queue size: Determines how many batches are stored in the queue
- Persistent queuing: Allows the collector to buffer telemetry batches to disk
Monitoring and Maintenance
View the Monitoring page for a more in-depth guide.
Health Monitoring
-
Set up monitoring for:
- Collector instance health
- Load balancer health
- Data throughput
- Error rates
- Resource utilization
-
Configure alerts for:
- Collector failures
- High latency
- Error rate thresholds
- Resource exhaustion
Monitoring the Collectors
To monitor collector logs, set up a Bindplane Collector source that will send log files from the Collector itself:
- Add a "Bindplane Collector" source to your configuration
- Configure the source with default settings
- Push the configuration to your collectors
- View the logs in your destination of choice
Best Practices
-
Resource Allocation
- Size collectors for peak load
- Include buffer for traffic spikes
- Monitor resource usage
-
Network Configuration
- Use dedicated networks
- Configure appropriate timeouts
- Enable TLS for security
-
Data Management
- Implement data buffering
- Configure appropriate batch sizes
- Set up retry policies
-
Security
- Enable TLS encryption
- Implement authentication
- Use network policies
- Regular security updates
-
Load Balancing
- Configure health checks to ensure collectors are ready to receive traffic
- Ensure even connection distribution among collectors
- Support both TCP/UDP and HTTP/gRPC protocols
Troubleshooting
Common Issues
-
Load Balancer Issues
- Check health check configuration
- Verify network connectivity
- Review security groups/firewall rules
-
Collector Failures
- Check resource utilization
- Review error logs
- Verify configuration
-
Data Loss
- Check buffer configuration
- Verify exporter settings
- Review retry policies