Collector High Availability

Learn how to set up a highly available OpenTelemetry Collector deployment for production environments.

What Is High Availability (HA)?

Ensure telemetry collection and processing infrastructure works even if individual Collector instances fail.

Why High Availability for the Collector?

Avoid data loss from agent-mode collectors when exporting to a dead backend.
Ensure telemetry continuity during rolling updates or infrastructure failures.
Enable horizontal scalability for load-balancing traces, logs, and metrics.

note

Using Agent-Gateway Architecture is the recommended deployment pattern for high availability.

Agent-Gateway Architecture

Agent Collectors run on every host, container, or node.
Gateway Collectors are centralized, scalable backend services receiving telemetry from agents.
Each layer can be scaled independently and horizontally.

Architecture

A typical high-availability OpenTelemetry Collector deployment consist of:

Multiple Collector Instances
- Deployed across different availability zones/regions
- Each instance capable of handling the full workload
- Redundant storage for temporary data buffering
Load Balancer
- Distributes incoming telemetry data
- Health checks to detect collector availability
- Session affinity for consistent routing
Automatic Failover
- Occurs if a collector becomes unavailable
Shared Storage Backend
- Persistent storage for collector state
- Shared configuration management
- Metrics and traces storage

Sizing and Resource Requirements

View the Sizing and Scaling page for a more in-depth guide.

Gateway Collector Requirements

Minimum Configuration:

2 collectors behind a load balancer
2 CPU cores per collector
8GB memory per collector
60GB usable space for persistent queue per collector

Throughput-Based Sizing

The following table shows the number of collectors needed based on expected throughput. This assumes each collector has 4 CPU cores and 16GB of memory:

Telemetry Throughput	Logs / second	Collectors
5 GB/m	250,000	2
10 GB/m	500,000	3
20 GB/m	1,000,000	5
100 GB/m	5,000,000	25

It's important to over-provision your collector fleet to provide fault tolerance. If one or more collector systems fail or are brought offline for maintenance, the remaining collectors must have enough available capacity to handle the telemetry throughput.

Scaling

Monitor these metrics to determine when to scale:
- CPU utilization
- Memory usage
- Network throughput
- Queue length
- Error rates
Configure auto-scaling based on:
- CPU utilization > 70%
- Memory usage > 80%
- Request rate per collector

Load Balancer Configuration

Configure a load balancer.

Health check endpoint: /health
Health check interval: 30 seconds
Unhealthy threshold: 3 failures
Healthy threshold: 2 successes

View the Load Balancing Best Practices, here.

Resilience

View the Resilience page for a more in-depth guide.

Configure:

Batching - Aggregates telemetry signals before exporting them
Retry - Retry sending telemetry batches when there is an error or a network outage.
Persistent Queue - Retries are stored in a sending queue on disk to guarantee persistence if a collector crashes.

Retry

For workloads that cannot afford to have telemetry dropped, consider increasing the max_elapsed_time significantly. Keep in mind that a large max elapsed time combined with a large backend outage will cause the collector to "buffer" a significant amount of telemetry to disk.

Persistent Queue

The sending queue has three important options:

Number of consumers: Determines how many batches will be retried in parallel
Queue size: Determines how many batches are stored in the queue
Persistent queuing: Allows the collector to buffer telemetry batches to disk

Monitoring and Maintenance

View the Monitoring page for a more in-depth guide.

Health Monitoring

Set up monitoring for:
- Collector instance health
- Load balancer health
- Data throughput
- Error rates
- Resource utilization
Configure alerts for:
- Collector failures
- High latency
- Error rate thresholds
- Resource exhaustion

Monitoring the Collectors

To monitor collector logs, set up a Bindplane Collector source that will send log files from the Collector itself:

Add a "Bindplane Collector" source to your configuration
Configure the source with default settings
Push the configuration to your collectors
View the logs in your destination of choice

Best Practices

Resource Allocation
- Size collectors for peak load
- Include buffer for traffic spikes
- Monitor resource usage
Network Configuration
- Use dedicated networks
- Configure appropriate timeouts
- Enable TLS for security
Data Management
- Implement data buffering
- Configure appropriate batch sizes
- Set up retry policies
Security
- Enable TLS encryption
- Implement authentication
- Use network policies
- Regular security updates
Load Balancing
- Configure health checks to ensure collectors are ready to receive traffic
- Ensure even connection distribution among collectors
- Support both TCP/UDP and HTTP/gRPC protocols

Troubleshooting

Common Issues

Load Balancer Issues
- Check health check configuration
- Verify network connectivity
- Review security groups/firewall rules
Collector Failures
- Check resource utilization
- Review error logs
- Verify configuration
Data Loss
- Check buffer configuration
- Verify exporter settings
- Review retry policies

Updated 3 months ago