🔥 Missed the Bindplane Launch Week? Get caught up on everything we announced! 🔥Explore now
Bindplane

Resilience with Zero Data Loss in High-Volume Telemetry Pipelines with OpenTelemetry and Bindplane

When processing massive volumes of telemetry data, ensuring every signal is delivered exactly once becomes critical. Imagine shipping a million logs only to discover half of them vanished—or worse, showed up twice.

Andy Keller
Andy Keller
Share:

This was the problem one Bindplane customer had with processing enormous S3-stored log files. Our engineering team tackled the problem head-on, enhancing the S3 event receiver with offset tracking and chaos testing methodologies.

Implementing Offsets for Resilient Data Processing

Offsets are the foundation of resilient data processing, tracking exactly which portions of a file have been successfully processed. When processing large files containing thousands or millions of logs, the system breaks these into manageable chunks. As each chunk is successfully sent, the offset is updated to reflect progress. Depending on the data type, offsets might represent byte positions in the file or simply a count of processed records.

For example, when processing a file with 100,000 logs in chunks of 1,000, the system updates the offset after each successful chunk transmission. If a failure occurs after processing 30,000 logs, the system can resume from position 30,001 rather than reprocessing from the beginning. This prevents both data duplication and data loss.

We implemented offset storage using the OpenTelemetry Collector storage extension API, specifically leveraging the Redis Storage Extension for environments with multiple load-balanced collectors in Kubernetes. This allows all collectors in a cluster to share offset information, ensuring that if one collector fails mid-processing, another can seamlessly continue from the last successful offset position.

Testing Resilience with Controlled Failure Injection

To validate our offset implementation worked correctly under real-world conditions, we developed a Random Failure Processor in the BDOT Collector. This processor intentionally injects failures into the telemetry pipeline at a configurable rate, allowing us to simulate various failure scenarios without waiting for them to occur naturally in production.

Inspired by Netflix's Chaos Monkey methodology, this approach intentionally introduces controlled chaos into the system to prove its resilience. By dialing the failure rate up to extreme levels (even 50%), we could verify that our offset tracking and retry mechanisms functioned correctly under severe conditions.

The Random Failure Processor is simple in implementation but powerful for testing. It allowed us to confirm that even with unreasonably high failure rates, the pipeline eventually processed all data without duplications or omissions. This testing methodology provides confidence that the system will handle real-world intermittent failures gracefully.

Optimizing for Scale with JSON Stream Parsing

Processing CloudTrail logs presented another challenge due to their massive size, approximately 150MB compressed and 1GB uncompressed, containing around a million log records in a single file. Traditional approaches would load the entire JSON structure into memory before processing, consuming gigabytes of RAM throughout the operation.

We implemented a streaming approach using Go's JSON library to process these files token by token. Rather than loading the entire file at once, the system reads the opening structure, identifies the records array, and then processes each log entry individually. This allows us to create payloads of 1,000 logs, send them, and garbage collect the memory before moving to the next chunk. This memory-efficient approach enables easier processing of extremely large files.

Parsing Avro for Structured Log Ingestion

In addition to JSON, Bindplane also supports parsing logs stored in Apache Avro Object Container File (OCF) format. This format is commonly used for structured, schema-defined event data written to object storage. The BDOT Collector includes an Avro OCF parser that reads the embedded schema, iterates through each record, and emits logs in a consistent key-value format. This enables native ingestion of Avro-encoded telemetry from AWS S3 without requiring external decoding steps.

Impact and Future Plans

Building resilient telemetry pipelines requires both thoughtful implementation and rigorous testing. Our offset tracking ensures no data is lost or duplicated during processing, while the Random Failure Processor testing methodology provides confidence in the pipeline’s ability to handle failure. These improvements allow processing massive volumes of telemetry data reliably, even in environments where failures are inevitable like this S3 event receiver example our customer was facing.

As part of our commitment to the OpenTelemetry community, we're working to contribute all three of these improvements upstream, including Avro and JSON stream parsing, the Random Failure Processor, and the S3 event receiver offset implementation. We’re hoping these contributions will help improve the OpenTelemetry Collector in turn helping the community build more resilient telemetry pipelines.

Want to give Bindplane a try? Spin up a free instance of Bindplane Cloud and hit the ground running right away.

Andy Keller
Andy Keller
Share:

Related posts

All posts

Get our latest content
in your inbox every week

By subscribing to our Newsletter, you agreed to our Privacy Notice

Community Engagement

Join the Community

Become a part of our thriving community, where you can connect with like-minded individuals, collaborate on projects, and grow together.

Ready to Get Started

Deploy in under 20 minutes with our one line installation script and start configuring your pipelines.

Try it now