🔥 Missed the Bindplane Launch Week? Get caught up on everything we announced! 🔥Explore now
Bindplane

Kafka Performance Crisis: How We Scaled OpenTelemetry Log Ingestion by 150%

Boosting Throughput from 192K to 480K EPS Across 16 Partitions.

Dakota Paasman
Dakota Paasman
Share:
I want to give a huge shoutout to Travis and Denton for helping me benchmark and test the Kafka receiver as well as fine tune the content of this blog post. ❤️

When your telemetry pipeline starts falling behind, the countdown to production impact has already begun.

One Bindplane customer operating a large-scale log ingestion pipeline built on the OpenTelemetry Collector and Kafka hit that breaking point. Instead of keeping pace with incoming data, their pipeline was ingesting just 12,000 events per second (EPS) per partition/collector—and this Kafka topic had 16 partitions. In aggregate, that was roughly 192K EPS.

After a multi-week performance triage, we scaled that number to 30,000 EPS per partition—a total of ~480K EPS—representing a 150% improvement in throughput.

Consumer lag had been growing—not linearly, but exponentially—and every hour that passed pushed the backlog further out of reach. Without intervention, critical logs risked being delayed or lost entirely.

What followed was a coordinated, data-driven optimization effort that uncovered misconfigurations, architectural trade-offs, and hidden performance ceilings. The result? A stable 30K EPS per partition, backlog cleared in under 48 hours, and a pipeline ready for sustained scale.

Outlining the Kafka Receiver Problem

By the time our team got involved, the Kafka receiver in this customer’s environment was consistently underperforming:

  • Baseline throughput: ~12K EPS
  • Target requirement: 25–28K EPS
  • Impact: Backlog growing daily, with risk of delayed or lost telemetry

The Kafka consumer lag graph looked like a hockey stick. Left unchecked, this meant delayed visibility, compromised incident response, and potential compliance issues for retained logs.

Performance Testing the Kafka Receiver

We approached this like a root cause hunt, not guesswork:

  • Parameter Isolation: Only one config change tested at a time
  • Load Simulation: Reproduced production traffic volumes in a controlled testbed
  • Profiling: CPU, memory, and throughput monitored at each pipeline stage
  • Comparative Benchmarking: Tested multiple Kafka clients, encoding types, transport protocols, and batching placements

Each run was documented in a performance matrix to track the impact of every permutation.

Fix #1 – Batching Strategy

Batch processor placement had a measurable impact.

Configurations Tested:

  1. Early Batching: Receiver → Batch → Processors
    • Works best for high-volume, low-complexity pipelines
    • Reduces per-record overhead early
  2. Late Batching: Receiver → Processors → Batch
    • Better for complex pipelines with filtering/transformation
    • Avoids batch-wide operations on data that’s later dropped

Note: The batch processor will eventually be replaced by exporter-side batching, but for now, correct placement can still yield gains.

Batching Diagram

For this particular use case, we changed the location of the Batch processor. Putting it at the beginning of the pipeline increased the throughput from 12K to ~17K per partition.

Impact in our testing:

  • Before: 12K EPS
  • After: ~17K EPS

âś… +41% throughput gain, but consumer lag was still increasing.

Fix #2 – Enable the Franz-Go Kafka Client

The default Kafka client in the OpenTelemetry receiver wasn’t scaling well.

Franz-Go—a high-performance, pure-Go Kafka client—was available behind a feature gate and could be enabled at runtime:

1--feature-gates=receiver.kafkareceiver.UseFranzGo

Huge shoutout to Marc for starting the discussion about Kafka receiver performance, and volunteered to add the Franz-Go client as a feature gate!

You can opt in to use the Franz-Go client by enabling the above feature gate when you run the OpenTelemetry Collector.

Additional Capabilities with Franz-Go:

  • Supports consuming from multiple topics via regex expressions
  • To enable regex topic consumption, prefix your topic name with ^
  • This behavior matches the librdkafka client’s implementation
  • If any topic in the deprecated topic setting has the ^ prefix, regex consuming will be enabled for all topics in that configuration

Impact in our testing:

  • Before: 17K EPS
  • After: ~23K EPS

✅ +35% throughput gain, but consumer lag was still increasing!! 🤔

We've shared these findings with the Kafka receiver maintainers. The Franz-Go client was definitely more efficient for us. Changing to the Franz-Go Client proved that the Kafka client impacts high-throughput ingestion, but it wasn’t enough to resolve the backlog on its own.

Fix #3 – Encoding (The Breakthrough)

This was the single largest gain.

The Problem:

The pipeline was using OTLP JSON encoding for logs that weren’t actually OTLP-formatted. This “worked” only because the receiver was performing costly format conversion in the background—burning CPU cycles and throttling throughput.

Encoding Performance Comparison:

Kafka encoding JSON

Impact:

Switching to standard JSON immediately pushed throughput to 30K EPS when combined with Franz-Go. The backlog began shrinking within minutes.

Impact in our testing:

  • Before: 23K EPS
  • After: ~30K EPS

âś… +30% throughput gain. The consumer lag finally started dropping!

Fix #3.5 – Transport Protocol Optimization

Destination and export protocol also mattered.

Findings:

  • /dev/null: Highest possible throughput (benchmark baseline)
  • SecOps backend: ~2K EPS drop due to downstream processing
  • HTTPS export: ~3K EPS faster than gRPC for this workload

Theory:

Converting from JSON to the gRPC protobuf format introduced significant overhead, adding serialization cost on top of gRPC’s connection and protocol management. This double hit to performance made gRPC slower under sustained load. Switching to HTTPS avoided the conversion step entirely, resulting in a 3K EPS gain with minimal downside for stability.

Lessons & Takeaways

Even after these gains, we observed suboptimal CPU and memory utilization during high-volume ingestion. This indicates the receiver may not be fully leveraging the resources available to it. Some of this could be addressed through further configuration tuning—both in the receiver and in Kafka itself—but it may also point to inherent limitations in the current receiver design.

In this case, the improvement was from 12K → 30K EPS per partition/collector. The Kafka topic had 16 partitions, meaning we effectively scaled from ~192K EPS (12K × 16) to ~480K EPS (30K × 16) in aggregate throughput.

Pushing beyond 30–40K EPS will likely require a combination of:

  • OpenTelemetry Collector configuration and high availability tuning
  • Adjusting the Kafka configuration and partition tuning

Best Practices

Early Batching:

Works best for high-volume, low-complexity pipelines, like in this use case with the Kafka receiver.

Encoding Selection:

Kafka Log Encoding

Feature Gate:

Use the Franz-Go Client.

1--feature-gates=receiver.kafkareceiver.UseFranzGo

Performance Testing Framework:

  • Simulate production traffic
  • Track EPS, lag, CPU, memory
  • Revert quickly if regressions appear

Giving Back to the OpenTelemetry Community

Misconfigured encoding or reliance on defaults could be silently throttling many enterprise OpenTelemetry deployments. Our work was shared with Kafka receiver maintainers to help with developing the Franz-Go client and clearer encoding documentation.

Closing Thoughts

This wasn’t a single config tweak—it was a coordinated, data-driven rescue of a failing ingestion pipeline.

Through systematic tuning, we improved throughput from 12K to 30K EPS per partition/collector—scaling from ~192K EPS to ~480K EPS in aggregate across 16 partitions. That’s a 150% increase in throughput.

Backlog cleared, stability restored, and the pipeline is now capable of sustaining production load without falling behind.

If you’re running OpenTelemetry at scale, take a close look at your client library, encoding, and pipeline architecture—because those defaults might be costing you more than you realize.

Dakota Paasman
Dakota Paasman
Share:

Related posts

All posts

Get our latest content
in your inbox every week

By subscribing to our Newsletter, you agreed to our Privacy Notice

Community Engagement

Join the Community

Become a part of our thriving community, where you can connect with like-minded individuals, collaborate on projects, and grow together.

Ready to Get Started

Deploy in under 20 minutes with our one line installation script and start configuring your pipelines.

Try it now