How a Leading Healthcare Technology Provider Scaled OpenTelemetry Log Ingestion by 150%

A leading healthcare technology provider was running a high-volume telemetry pipeline built on the OpenTelemetry Collector with Kafka as the primary transport layer.

The pipeline was responsible for ingesting and routing hundreds of thousands of log events per second. Over time, ingestion began to lag behind production demand.

Throughput was capped at 12,000 events per second (EPS) per Kafka partition across 16 partitions — roughly 192,000 EPS in total.

As consumer lag grew exponentially, the backlog threatened to impact operational visibility and compliance retention for regulated log data.

Through a combination of receiver tuning, encoding optimization, and transport-level adjustments, Bindplane helped the team scale to 30,000 EPS per partition — a 150% improvement in throughput — while stabilizing ingestion and clearing the backlog in under 48 hours.

Problem

The healthcare provider’s telemetry pipeline had reached its scalability ceiling.

Despite adding Kafka partitions and collector instances, overall throughput remained flat. The OpenTelemetry Kafka receiver was the bottleneck — unable to keep pace with incoming messages, it created growing consumer lag and delayed visibility across multiple downstream systems.

Challenge

The customer needed to increase throughput to meet production demand without overprovisioning infrastructure or introducing instability.

Performance issues within the OpenTelemetry Collector’s Kafka receiver limited scalability, and the root causes weren’t obvious.

Symptoms:

Kafka consumer lag growing faster than ingestion
Inconsistent CPU and memory utilization
High serialization overhead from OTLP JSON
Export protocol bottlenecks under sustained load

Requirements

To restore stability and scale efficiently, the customer required:

Sustained 25–30K EPS per partition without data loss
Compatibility with existing OpenTelemetry Collector architecture
Controlled, measurable optimization (no guesswork)
A repeatable benchmarking framework to validate performance changes

Solution

Bindplane engineers collaborated directly with the customer’s observability team to run a multi-week benchmarking effort.

Each configuration change was isolated, tested, and measured against real production workloads using OpenTelemetry-native tools.

Optimization steps and results:

Batch Processor Placement: Moving batching earlier in the pipeline improved efficiency. Throughput: 12K → 17K EPS (+41%)
Franz-Go Kafka Client: Replacing the default client with Franz-Go, a high-performance pure-Go Kafka library. Throughput: 17K → 23K EPS (+35%)
Encoding Fix (Breakthrough): Switching from OTLP JSON (which was performing hidden format conversions) to raw JSON. Throughput: 23K → 30K EPS (+30%)
Transport Protocol Optimization: Changing from gRPC to HTTPS export to eliminate serialization overhead. Throughput: +3K EPS improvement, improved stability under load

Outcome:

The pipeline scaled from 192K to 480K EPS aggregate throughput, achieving a 150% performance gain.

Backlogs were cleared within 48 hours, restoring real-time visibility across the environment.

Future

The healthcare provider’s telemetry pipeline is now positioned for continued growth with:

A clear performance benchmarking framework
Verified Kafka receiver optimizations shared upstream with the OpenTelemetry community
Plans to adopt future enhancements, including exporter-side batching and HA collector tuning

Bindplane continues to collaborate with the OpenTelemetry community to make these performance improvements available to everyone running OpenTelemetry at scale.

Conclusion

This engagement showcases how deep telemetry expertise and open-source collaboration can turn a critical production crisis into a long-term performance win.

By taking a data-driven approach to root cause analysis, the healthcare provider not only restored stability but also set a new internal performance baseline for Kafka-based ingestion.

The outcome highlights a broader truth: in modern observability, scaling isn’t just about infrastructure — it’s about precision engineering.

Bindplane’s OpenTelemetry-native approach allowed this customer to achieve 150% higher throughput, reduced ingest lag, and stronger resilience — all without rewriting or replacing their existing stack.

For organizations running telemetry pipelines at scale, this case proves what’s possible when open-source standards meet production-grade optimization.

How a Leading Healthcare Technology Provider Scaled OpenTelemetry Log Ingestion by 150%

Problem

Challenge

Requirements

Solution

Future

Conclusion

Related posts

Get our latest content
in your inbox every week

Join the Community

Ready to Get Started

How a Leading Healthcare Technology Provider Scaled OpenTelemetry Log Ingestion by 150%

Problem

Challenge

Requirements

Solution

Future

Conclusion

Get our latest content in your inbox every week

Join the Community

Ready to Get Started

Get our latest content
in your inbox every week