How a Leading Healthcare Technology Provider Scaled OpenTelemetry Log Ingestion by 150%
“We turned a performance bottleneck into a 150% throughput gain and restored real-time visibility.”
A leading healthcare technology provider was running a high-volume telemetry pipeline built on the OpenTelemetry Collector with Kafka as the primary transport layer.
The pipeline was responsible for ingesting and routing hundreds of thousands of log events per second. Over time, ingestion began to lag behind production demand.
Throughput was capped at 12,000 events per second (EPS) per Kafka partition across 16 partitions — roughly 192,000 EPS in total.
As consumer lag grew exponentially, the backlog threatened to impact operational visibility and compliance retention for regulated log data.
Through a combination of receiver tuning, encoding optimization, and transport-level adjustments, Bindplane helped the team scale to 30,000 EPS per partition — a 150% improvement in throughput — while stabilizing ingestion and clearing the backlog in under 48 hours.
Problem
The healthcare provider’s telemetry pipeline had reached its scalability ceiling.
Despite adding Kafka partitions and collector instances, overall throughput remained flat. The OpenTelemetry Kafka receiver was the bottleneck — unable to keep pace with incoming messages, it created growing consumer lag and delayed visibility across multiple downstream systems.
Challenge
The customer needed to increase throughput to meet production demand without overprovisioning infrastructure or introducing instability.
Performance issues within the OpenTelemetry Collector’s Kafka receiver limited scalability, and the root causes weren’t obvious.
Symptoms:
- Kafka consumer lag growing faster than ingestion
- Inconsistent CPU and memory utilization
- High serialization overhead from OTLP JSON
- Export protocol bottlenecks under sustained load
Requirements
To restore stability and scale efficiently, the customer required:
- Sustained 25–30K EPS per partition without data loss
- Compatibility with existing OpenTelemetry Collector architecture
- Controlled, measurable optimization (no guesswork)
- A repeatable benchmarking framework to validate performance changes
Solution
Bindplane engineers collaborated directly with the customer’s observability team to run a multi-week benchmarking effort.
Each configuration change was isolated, tested, and measured against real production workloads using OpenTelemetry-native tools.
Optimization steps and results:
- Batch Processor Placement: Moving batching earlier in the pipeline improved efficiency. Throughput: 12K → 17K EPS (+41%)
- Franz-Go Kafka Client: Replacing the default client with Franz-Go, a high-performance pure-Go Kafka library. Throughput: 17K → 23K EPS (+35%)
- Encoding Fix (Breakthrough): Switching from OTLP JSON (which was performing hidden format conversions) to raw JSON. Throughput: 23K → 30K EPS (+30%)
- Transport Protocol Optimization: Changing from gRPC to HTTPS export to eliminate serialization overhead. Throughput: +3K EPS improvement, improved stability under load
Outcome:
The pipeline scaled from 192K to 480K EPS aggregate throughput, achieving a 150% performance gain.
Backlogs were cleared within 48 hours, restoring real-time visibility across the environment.
Future
The healthcare provider’s telemetry pipeline is now positioned for continued growth with:
- A clear performance benchmarking framework
- Verified Kafka receiver optimizations shared upstream with the OpenTelemetry community
- Plans to adopt future enhancements, including exporter-side batching and HA collector tuning
Bindplane continues to collaborate with the OpenTelemetry community to make these performance improvements available to everyone running OpenTelemetry at scale.
Conclusion
This engagement showcases how deep telemetry expertise and open-source collaboration can turn a critical production crisis into a long-term performance win.
By taking a data-driven approach to root cause analysis, the healthcare provider not only restored stability but also set a new internal performance baseline for Kafka-based ingestion.
The outcome highlights a broader truth: in modern observability, scaling isn’t just about infrastructure — it’s about precision engineering.
Bindplane’s OpenTelemetry-native approach allowed this customer to achieve 150% higher throughput, reduced ingest lag, and stronger resilience — all without rewriting or replacing their existing stack.
For organizations running telemetry pipelines at scale, this case proves what’s possible when open-source standards meet production-grade optimization.
