đŸ”„ Missed the Bindplane Launch Week? Get caught up on everything we announced! đŸ”„Explore now
Technical “How-To’s”

How to Build Resilient Telemetry Pipelines with the OpenTelemetry Collector: High Availability and Gateway Architecture

Do you remember when humans used to write step-by-step tutorials?

Adnan Rahic
Adnan Rahic
Share:

remember-when-humans-used-to-write-tutorials

Let’s bring that back. Today you’ll learn how to configure high availability for the OpenTelemetry Collector so you don’t lose telemetry during node failures, rolling upgrades, or traffic spikes. The guide covers both Docker and Kubernetes samples with hands-on demos of configs.

But first, let’s lay some groundwork.

How to define High Availability (HA) with the OpenTelemetry Collector?

You want to ensure telemetry collection and processing works even if individual Collector instances fail. It’s outlined in three main points:

  • Avoid data loss when exporting to a dead observability backend.
  • Ensure telemetry continuity during rolling updates or infrastructure failures.
  • Enable horizontal scalability for load-balancing traces, logs, and metrics.

To enable high availability it’s recommended that you use the Agent-Gateway deployment pattern. This means:

  • Agent Collectors run on every host, container, or node.
  • Gateway Collectors are centralized, scalable back-end services receiving telemetry from Agent Collectors.
  • Each layer can be scaled independently and horizontally.
agent-gateway-deployment-pattern-otel-col

Please note, an Agent Collector and Gateway Collector is essentially the same binary. They’re completely identical. The ONLY difference is WHERE it is running. Think of it this way.

  • An Agent Collector runs close to the workload–in the context of Kubernetes it could be a sidecar, or a deployment for every namespace–or for Docker, a service alongside your app in the docker-compose.yaml. This would tend to mean the dev team will own this instance of the Collector.
  • A Gateway Collector is a central (standalone) operation of the collector–think a standalone Collector in a specific namespace or even a dedicated Kubernetes cluster–typically owned by the platform team. This is the final step of the telemetry pipeline letting the platform team enforce policies like filtering logs, sampling traces, dropping metrics, before sending it to an observability backend.

Here’s an awesome explanation on StackOverflow. Yes, it’s still a thing. No, not everything is explained by AI. 😂

To satisfy all high availability I’ll walk you through how to configure:

  • Multiple Collector Instances. Each instance is capable of handling the full workload with redundant storage for temporary data buffering.
  • A Load Balancer. It’ll distribute incoming telemetry data and maintain consistent routing. Load balancers also support automatic failover if a collector becomes unavailable.
  • Shared Storage. Persistent storage for collector state and configuration management.

Now it’s time to get our hands dirty with some hands-on coding.

Configure Agent-Gateway High Availability (HA) with the OpenTelemetry Collector

Let me first explain this concept by using Docker and visualize it with Bindplane. This architecture is transferable and usable for any type of Linux or Windows VM setup as well. More about Kubernetes further below.

There are three options you can use. Either using a load balancer like Nginx or Traefik. Or, using the loadbalancing exporter that’s available in the Collector. Finally, if you’re fully committed to a containerized environment, use native load balancing in Kubernetes with services and a horizontal pod autoscaler.

Nginx Load Balancer

The Nginx option is the simpler, out-of-the-box solution.

I’ll set up the architecture with:

  • Three Gateway Collectors in parallel
  • One Nginx load balancer
  • One Agent Collector configured to generate telemetry (app simulation)
load-balancing-otel-col-with-nginx

This structure is the bare-bones minimum you’ll end up using. Note that you'll end up using three separate services for the gateway collectors. The reason behind this is that each collector needs to have its own separate file_storage path to store data in the persistent queue. In Docker, this means you need to make sure each container gets a unique volume. Let me explain how that works.

Copy the content below into a docker-compose.yaml.

yaml
1version: '3.8'
2
3volumes:
4  gw1-storage:      # persistent queue for gateway-1
5  gw2-storage:      # persistent queue for gateway-2
6  gw3-storage:      # persistent queue for gateway-3
7  telgen-storage:   # persistent queue for telemetry generator
8  external-gw-storage: # persistent queue for external gateway
9
10services:
11  # ────────────── GATEWAYS (3×) ──────────────
12  gw1:
13    image: ghcr.io/observiq/bindplane-agent:1.79.2
14    container_name: gw1
15    hostname: gw1
16    command: ["--config=/etc/otel/config/config.yaml"]
17    volumes:
18      - ./config:/etc/otel/config
19      - gw1-storage:/etc/otel/storage  # 60 GiB+ queue
20    environment:
21      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
22      OPAMP_SECRET_KEY: "<secret>"
23      OPAMP_LABELS: ephemeral=true
24      MANAGER_YAML_PATH: /etc/otel/config/gw1-manager.yaml
25      CONFIG_YAML_PATH: /etc/otel/config/config.yaml
26      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
27
28  gw2:
29    image: ghcr.io/observiq/bindplane-agent:1.79.2
30    container_name: gw2
31    hostname: gw2
32    command: ["--config=/etc/otel/config/config.yaml"]
33    volumes:
34      - ./config:/etc/otel/config
35      - gw2-storage:/etc/otel/storage  # 60 GiB+ queue
36    environment:
37      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
38      OPAMP_SECRET_KEY: "<secret>"
39      OPAMP_LABELS: ephemeral=true
40      MANAGER_YAML_PATH: /etc/otel/config/gw2-manager.yaml
41      CONFIG_YAML_PATH: /etc/otel/config/config.yaml
42      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
43    
44  gw3:
45    image: ghcr.io/observiq/bindplane-agent:1.79.2
46    container_name: gw3
47    hostname: gw3
48    command: ["--config=/etc/otel/config/config.yaml"]
49    volumes:
50      - ./config:/etc/otel/config
51      - gw3-storage:/etc/otel/storage  # 60 GiB+ queue
52    environment:
53      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
54      OPAMP_SECRET_KEY: "<secret>"
55      OPAMP_LABELS: ephemeral=true
56      MANAGER_YAML_PATH: /etc/otel/config/gw3-manager.yaml
57      CONFIG_YAML_PATH: /etc/otel/config/config.yaml
58      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
59
60  # ────────────── OTLP LOAD-BALANCER ──────────────
61  otlp-lb:
62    image: nginx:1.25-alpine
63    volumes:
64      - ./nginx-otlp.conf:/etc/nginx/nginx.conf:ro
65    ports:
66      - "4317:4317"   # OTLP gRPC
67      - "4318:4318"   # OTLP HTTP/JSON
68    depends_on: [gw1, gw2, gw3]
69
70  # ────────────── TELEMETRY GENERATOR ──────────────
71  telgen:
72    image: ghcr.io/observiq/bindplane-agent:1.79.2
73    container_name: telgen
74    hostname: telgen
75    command: ["--config=/etc/otel/config/config.yaml"]
76    volumes:
77      - ./config:/etc/otel/config
78      - telgen-storage:/etc/otel/storage  # 60 GiB+ queue
79    environment:
80      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
81      OPAMP_SECRET_KEY: "<secret>"
82      OPAMP_LABELS: ephemeral=true
83      MANAGER_YAML_PATH: /etc/otel/config/telgen-manager.yaml
84      CONFIG_YAML_PATH: /etc/otel/config/config.yaml
85      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
86
87  # ────────────── EXTERNAL GATEWAY ──────────────
88  external-gw:
89    image: ghcr.io/observiq/bindplane-agent:1.79.2
90    container_name: external-gw
91    hostname: external-gw
92    command: ["--config=/etc/otel/config/external-gw-config.yaml"]
93    volumes:
94      - ./config:/etc/otel/config
95      - external-gw-storage:/etc/otel/storage  # 60 GiB+ queue
96    environment:
97      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"   # point to your Bindplane server
98      OPAMP_SECRET_KEY: "<secret>"
99      OPAMP_LABELS: ephemeral=true
100      MANAGER_YAML_PATH: /etc/otel/config/external-gw-manager.yaml
101      CONFIG_YAML_PATH: /etc/otel/config/external-gw-config.yaml
102      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml

Open your Bindplane instance and click the Install Agent button.

overview-docker-services-with-nginx-load-balancer

Set the platform to Linux, since I’m demoing this with Docker, and hit next.

select-option-linux-for-install-agent-for-nginx

This screen now shows the environment variables you'll need to replace in the docker-compose.yaml.

agent-keys

Go ahead and replace the OPAMP_SECRET_KEY with your own secret key from Bindplane. If you’re using a self-hosted instance of Bindplane, replace your OPAMP_ENDPOINT as well. Use the values after -e and -s which represent the endpoint and secret.

Create a nginx-otlp.conf file for the load balancer.

text
1worker_processes auto;
2events { worker_connections 1024; }
3
4stream {
5  upstream otlp_grpc {
6    server gw1:4317 max_fails=3 fail_timeout=15s;
7    server gw2:4317 max_fails=3 fail_timeout=15s;
8    server gw3:4317 max_fails=3 fail_timeout=15s;
9  }
10  server {
11    listen 4317;            # gRPC
12    proxy_pass otlp_grpc;
13    proxy_connect_timeout 1s;
14    proxy_timeout 30s;
15  }
16}
17
18http {
19  upstream otlp_http {
20    server gw1:4318 max_fails=3 fail_timeout=15s;
21    server gw2:4318 max_fails=3 fail_timeout=15s;
22    server gw3:4318 max_fails=3 fail_timeout=15s;
23  }
24  server {
25    listen 4318;            # HTTP/JSON
26    location / {
27      proxy_pass http://otlp_http;
28      proxy_next_upstream error timeout http_502 http_503 http_504;
29    }
30  }
31}

Create a ./config directory in the same root directory as your docker-compose.yaml, and create 3 files.

1> config/
2    config.yaml
3    telgen-config.yaml
4    logging.yaml

Paste this basic config into the config.yaml and telgen-config.yaml for the BDOT Collector to have a base config to start. I’ll then configure it with Bindplane.

yaml
1receivers:
2  nop:
3processors:
4  batch:
5exporters:
6  nop:
7service:
8  pipelines:
9    metrics:
10      receivers: [nop]
11      processors: [batch]
12      exporters: [nop]
13  telemetry:
14    metrics:
15      level: none

And, a base setup for the logging.yaml.

yaml
1output: stdout
2level: info

Start the Docker Compose services.

bash
1docker compose up -d

Jump into Bindplane and create three configurations for:

  • telgen
  • otlp-lb-gw
  • external-gw
nginx-load-balanser-configs

The telgen configuration has a Telemetry Generator source.

telemetry-generator-source

And, an OTLP destination.

telemetry-generator-destination-otelcol-agent-mode

The OTLP destination is configured to send telemetry to the otlp-lb hostname, which is the hostname for the Nginx load balancer I’m running in Docker Compose.

Next, the otlp-lb-gw configuration has an OTLP source that listens on 0.0.0.0 and ports 4317 and 4318.

source-gateway-mode-collector-nginx-load-balancer

The destination is also OTLP, but instead sending to the external-gw hostname.

destination-gatewa-mode-collector-nginx-load-balancer

Finally, the external-gw configuration is again using an identical OTLP source.

source-external-gateway-nginx-load-balancer

And, a Dev Null destination.

destination-external-gateway-nginx-load-balancer

This setup enables you to drop in whatever destination you want in the list of destinations for the external-gw configuration. Go wild! 😂

If you open the processor node for the Dev Null destination, you’ll see logs flowing through the load balancer.

snapshop-telemetry-nginx-load-balancer-external-gateway

While in the otlp-lb-gw configuration, if you open a processor node, you’ll see evenly distributed load across all three collectors.

gateway-collectors-nginx-load-balancer

That’s how you load balance telemetry across multiple collectors with Nginx.

If you would rather apply these configs via the Bindplane CLI, get the files on GitHub, here.

Load Balancing Exporter

The second option is to use the dedicated loadbalancing exporter in the collector. With this exporter you can specify multiple downstream collectors that will receive the telemetry traffic equally.

otel-col-loadbalancing-exporter-diagram

One quick note before about the load balancing exporter. You don’t always need it. Its main job is to make sure spans from the same trace stick together and get routed to the same backend collector. That’s useful for distributed tracing with sampling. But if you’re just shipping logs and metrics, or even traces without fancy sampling rules, you can probably skip it and stick with Nginx.

I’ll set up the architecture just as I did above but with yet another collector instead the Nginx load balancer:

  • Three Gateway Collectors in parallel
  • One Gateway Collector using the loadbalancing exporter
  • One Agent Collector configured to generate telemetry (app simulation)

This behaves identical to an Nginx load balancer. However this requires one less step and less configuration overhead. No need to configure and run Nginx, manage specific Nginx files, instead run one more instance of the collector and use a trusty collector config.yaml that you’re already familiar with.

The drop in replacement for the use case above is as follows. In the docker-compose.yaml replace the otlp-lb Nginx service with another OpenTelemetry Collector service named lb.

yaml
1services:
2
3
4# ...
5
6  lb:
7    image: ghcr.io/observiq/bindplane-agent:1.79.2
8    container_name: lb
9    hostname: lb
10    command: ["--config=/etc/otel/config/lb-config.yaml"]
11    volumes:
12      - ./config:/etc/otel/config
13      - lb-storage:/etc/otel/storage
14    ports:
15      - "4317:4317"   # OTLP gRPC - external endpoint
16      - "4318:4318"   # OTLP HTTP/JSON - external endpoint
17    environment:
18      OPAMP_ENDPOINT: "wss://app.bindplane.com/v1/opamp"
19      OPAMP_SECRET_KEY: "01JFJGVKWHQ1SPQVDGZEHVA995"
20      OPAMP_LABELS: ephemeral=true
21      MANAGER_YAML_PATH: /etc/otel/config/lb-manager.yaml
22      CONFIG_YAML_PATH: /etc/otel/config/lb-config.yaml
23      LOGGING_YAML_PATH: /etc/otel/config/logging.yaml
24    depends_on: [gw1, gw2, gw3]
25
26# ...

Create a base lb-config.yaml for this collector instance in the ./config directory. Bindplane will update this remotely once you add a destination for the loadbalancing exporter.

yaml
1receivers:
2  nop:
3processors:
4  batch:
5exporters:
6  nop:
7service:
8  pipelines:
9    metrics:
10      receivers: [nop]
11      processors: [batch]
12      exporters: [nop]
13  telemetry:
14    metrics:
15      level: none

Go ahead and restart Docker Compose.

bash
1docker compose down
2docker compose up -d

This will start the new lb collector. In Bindplane, go ahead and create a new configuration called lb and add an OTLP source that listens on 0.0.0.0 and ports 4317 and 4318.

source-otlp-otelcol-load-balancer

Now, create a custom destination and paste the loadbalancing exporter configuration in the input field.

yaml
1loadbalancing:
2    protocol:
3      otlp:
4        tls:
5          insecure: true
6        timeout: 30s
7        retry_on_failure:
8          enabled: true
9          initial_interval: 5s
10          max_elapsed_time: 300s
11          max_interval: 30s
12        sending_queue:
13          enabled: true
14          num_consumers: 10
15          queue_size: 5000
16    resolver:
17      static:
18        hostnames:
19          - gw1:4317
20          - gw2:4317
21          - gw3:4317
destination-custom-loadbalancingexporter-otelcol-load-balancer

Note that the hostnames correlate to the hostnames of the gateway collectors configured in Docker Compose. Save this configuration and roll it out to the new lb collector. Opening the gw configuration in Bindplane and selecting a processor node, you’ll see the telemetry flowing through all 3 gateway collector instances.

otelcol-loadbalancing-exporter-telemetry-flow

You’ll see an even nicer split by seeing the telemetry throughput across all collectors in the Agents view.

otelcol-loadbalancing-exporter-telemetry-flow-all-configs

The lb and external-gw are reporting the same throughput with the three gateway collectors load balancing traffic equally.

The loadbalancing exporter is behaving like a drop-in replacement for Nginx. I would call that a win. Less configuration overhead, fewer moving parts, and no need to learn specific Nginx configs. Instead, focus only on the collector.

To get this sample up-and-running quickly, apply these configs via the Bindplane CLI, get the files on GitHub, here.

Since you now have a good understanding of how to configure OpenTelemetry Collector infrastructure for high availability, let's move into details about resilience specifically.

Building Resilience into Your Collector

When it comes to resilience, features like retry logic, persistent queues, and batching should be handled in the Agent Collectors. These are the instances sitting closest to your workloads; they’re most at risk of losing data if something goes wrong. The Agent’s job is to collect, buffer, and forward telemetry reliably, even when the backend is flaky or slow.

How you configure the OpenTelemetry collector for resilience to avoid losing telemetry during network issues or telemetry backend outages:

  • Batching groups signals before export, improving efficiency.
  • Retry ensures failed exports are re-attempted. For critical workloads, increase max_elapsed_time to tolerate longer outages—but be aware this will increase the buffer size on disk.
  • Persistent Queue stores retries on disk, protecting against data loss if the Collector crashes. You can configure:
    • Number of consumers – how many parallel retry workers run
    • Queue size – how many batches are stored
    • Persistence – enables disk buffering for reliability

Retry & Persistent Queue

Luckily enough for you, Bindplane handles both retries and the persistent queues out-of-the-box for OTLP exporters.

Take a look at the telgen configuration. This is the collector we’re running in agent-mode simulating a bunch of telemetry traffic.

In the telgen-config.yaml, you'll see OTLP exporter otlp/lb is configured with both the persistent queue and retries.

yaml
1exporters:
2    otlp/lb:
3        compression: gzip
4        endpoint: gw:4317
5        retry_on_failure:
6            enabled: true
7            initial_interval: 5s
8            max_elapsed_time: 300s
9            max_interval: 30s
10        sending_queue:
11            enabled: true
12            num_consumers: 10
13            queue_size: 5000
14            storage: file_storage/lb
15        timeout: 30s
16        tls:
17            insecure: true

This is because the advanced settings for every OTLP exporter in Bindplane have this default configuration enabled.

resilience-otlp-exporter-advanced-settings

The persistent queue directory here is the storage directory that we configured by creating a volume in Docker.

yaml
1# docker-compose.yaml 
2
3...
4
5volumes:
6  gw1-storage:      # persistent queue for gateway-1
7  gw2-storage:      # persistent queue for gateway-2
8  gw3-storage:      # persistent queue for gateway-3
9  telgen-storage:   # persistent queue for telemetry generator
10  lb-storage:    # persistent queue for load-balancing gateway
11  external-gw-storage: # persistent queue for external gateway
12
13...

Bindplane then automatically configures a storage extension in the config and enables it like this:

yaml
1# telgen-config.yaml 
2
3...
4extensions:
5    file_storage/lb:
6        compaction:
7            directory: ${OIQ_OTEL_COLLECTOR_HOME}/storage
8            on_rebound: true
9        directory: ${OIQ_OTEL_COLLECTOR_HOME}/storage
10service:
11    extensions:
12        - file_storage/lb
13...

Note that the OIQ_OTEL_COLLECTOR_HOME environment variable actually is mapped to the /etc/otel directory.

Now your telemetry pipeline becomes resilient and HA-ready with data persistence to survive restarts, persistent queue buffering to handle temporary outages, and failover recovery to prevent data loss.

Batching

Batching is a whole other story, because you need to add a processor on the processor node for it to be enabled before connecting it to the destination.

Agent-mode collectors should batch telemetry before sending it to the gateway collector. The OTLP receiver on the gateway side will receive batches and forward them to your telemetry backend of choice.

In the telgen configuration, click a processor node and add a batch processor.

resilience-batch-processor

This config will send a batch of telemetry signals every 200ms regardless of the size. Or, it will send a batch of the size 8192 regardless of the timeout. Applying this processor in Bindplane will generate a config like this:

yaml
1# telgen-config.yaml 
2
3...
4
5processors:
6    batch/lb: null
7    batch/lb-0__processor0:
8        send_batch_max_size: 0
9        send_batch_size: 8192
10        timeout: 200ms
11
12...

Kubernetes-native load balancing with HorizontalPodAutoscaler

Finally, after all the breakdowns, explanations, and diagrams, it’s time to show you what it would look like in the wild with a simple Kubernetes sample.

Using Kubernetes is the preferred architecture suggested by the Bindplane team and the OpenTelemetry community. K8s will maximize the benefits you get with Bindplane as well.

I’ll set up the architecture with:

  • One Agent-mode Collector running per node on the K8s cluster configured to generate telemetry (app simulation)
  • A Gateway Collector Deployment
    • Using a HorizontalPodAutoscaler scaling from 2 to 10 pods
    • And a ClusterIP service
    • Configured with persistent storage, sending queue, and retry
  • An external Gateway Collector running on another cluster acting as a mock telemetry backend

Luckily enough getting all the K8s YAML manifests for the collectors is all point-and-click from the Bindplane UI. However, you need to build the configurations first, before applying the collectors to your K8s cluster.

For the sake of simplicity l’ll show how to spin up two K8s clusters with kind, and use them in this demo.

bash
1kind create cluster --name kind-2
2kind create cluster --name kind-1
3
4# make sure you set the context to the kind-1 cluster first
5kubectl config use-context kind-kind-1

Next, jump into Bindplane and create three configurations for:

  • telgen-kind-1
  • gw-kind-1
  • external-gw-kind-2
kubernetes-load-balancer-hpa-config-overview

The telgen-kind-1 configuration has a Custom source with a telemetrygeneratorreceiver.

yaml
1telemetrygeneratorreceiver:
2        generators:
3            - additional_config:
4                body: 127.0.0.1 - - [30/Jun/2025:12:00:00 +0000] \"GET /index.html HTTP/1.1\" 200 512
5                severity: 9
6              type: logs
7        payloads_per_second: 1
kubernetes-load-balancer-telemetry-generator

And, a Bindplane Gateway destination.

Note: This is identical to any OTLP destination.

kubernetes-load-balancer-bindplane-gateway-destination

The Bindplane Gateway destination is configured to send telemetry to the bindplane-gateway-agent.bindplane-agent.svc.cluster.local

hostname, which is the hostname for the Bindplane Gateway Collector service in Kubernetes that you’ll start in a second.

The final step for this configuration is to click a processor node and add a batch processor.

kubernetes-telgen-batch

Next, the gw-kind-1 configuration has a Bindplane Gateway source that listens on 0.0.0.0 and ports 4317 and 4318.

kubernetes-load-balancer-gateway-source-cluster-1

The destination is OTLP, and sending telemetry to the IP address (172.18.0.2) and port (30317) of the external gateway running on the second K8s cluster.

Note: This might differ for your clusters. If you are using kind, like I am in this demo, the IP will be 172.18.0.2.

kubernetes-load-balancer-destination-external-gateway

Finally, the external-gw-kind-2 configuration is again using an OTLP source.

kubernetes-load-balancer-external-gateway-source

And, a Dev Null destination.

kubernetes-load-balancer-external-gateway-destination

Feel free to use the Bindplane CLI and these resources to apply all the configurations in one go without having to do it manually in the UI.

With the configurations created, you can install collectors easily by getting manifest files from your Bindplane account. Navigate to the install agents UI in Bindplane and select a Kubernetes environment. Use the Node platform and telgen-kind-1 configuration.

kubernetes-load-balancer-install-c

Clicking next will show a manifest file for you to apply in the cluster.

kubernetes-load-balancer-install-node-agent-2

Save this file as node-agent-kind-1.yaml. Check out below what a sample of it looks like. Or, see the file in GitHub, here.

yaml
1---
2apiVersion: v1
3kind: Namespace
4metadata:
5  labels:
6    app.kubernetes.io/name: bindplane-agent
7  name: bindplane-agent
8---
9apiVersion: v1
10kind: ServiceAccount
11metadata:
12  labels:
13    app.kubernetes.io/name: bindplane-agent
14  name: bindplane-agent
15  namespace: bindplane-agent
16---
17apiVersion: rbac.authorization.k8s.io/v1
18kind: ClusterRole
19metadata:
20  name: bindplane-agent
21  labels:
22    app.kubernetes.io/name: bindplane-agent
23rules:
24- apiGroups:
25  - ""
26  resources:
27  - events
28  - namespaces
29  - namespaces/status
30  - nodes
31  - nodes/spec
32  - nodes/stats
33  - nodes/proxy
34  - pods
35  - pods/status
36  - replicationcontrollers
37  - replicationcontrollers/status
38  - resourcequotas
39  - services
40  verbs:
41  - get
42  - list
43  - watch
44- apiGroups:
45  - apps
46  resources:
47  - daemonsets
48  - deployments
49  - replicasets
50  - statefulsets
51  verbs:
52  - get
53  - list
54  - watch
55- apiGroups:
56  - extensions
57  resources:
58  - daemonsets
59  - deployments
60  - replicasets
61  verbs:
62  - get
63  - list
64  - watch
65- apiGroups:
66  - batch
67  resources:
68  - jobs
69  - cronjobs
70  verbs:
71  - get
72  - list
73  - watch
74- apiGroups:
75    - autoscaling
76  resources:
77    - horizontalpodautoscalers
78  verbs:
79    - get
80    - list
81    - watch
82---
83apiVersion: rbac.authorization.k8s.io/v1
84kind: ClusterRoleBinding
85metadata:
86  name: bindplane-agent
87  labels:
88    app.kubernetes.io/name: bindplane-agent
89roleRef:
90  apiGroup: rbac.authorization.k8s.io
91  kind: ClusterRole
92  name: bindplane-agent
93subjects:
94- kind: ServiceAccount
95  name: bindplane-agent
96  namespace: bindplane-agent
97---
98apiVersion: v1
99kind: Service
100metadata:
101  labels:
102    app.kubernetes.io/name: bindplane-agent
103  name: bindplane-node-agent
104  namespace: bindplane-agent
105spec:
106  ports:
107  - appProtocol: grpc
108    name: otlp-grpc
109    port: 4317
110    protocol: TCP
111    targetPort: 4317
112  - appProtocol: http
113    name: otlp-http
114    port: 4318
115    protocol: TCP
116    targetPort: 4318
117  selector:
118    app.kubernetes.io/name: bindplane-agent
119    app.kubernetes.io/component: node
120  sessionAffinity: None
121  type: ClusterIP
122---
123apiVersion: v1
124kind: Service
125metadata:
126  labels:
127    app.kubernetes.io/name: bindplane-agent
128    app.kubernetes.io/component: node
129  name: bindplane-node-agent-headless
130  namespace: bindplane-agent
131spec:
132  clusterIP: None
133  ports:
134  - appProtocol: grpc
135    name: otlp-grpc
136    port: 4317
137    protocol: TCP
138    targetPort: 4317
139  - appProtocol: http
140    name: otlp-http
141    port: 4318
142    protocol: TCP
143    targetPort: 4318
144  selector:
145    app.kubernetes.io/name: bindplane-agent
146    app.kubernetes.io/component: node
147  sessionAffinity: None
148  type: ClusterIP
149---
150apiVersion: v1
151kind: ConfigMap
152metadata:
153  name: bindplane-node-agent-setup
154  labels:
155    app.kubernetes.io/name: bindplane-agent
156    app.kubernetes.io/component: node
157  namespace: bindplane-agent
158data:
159  # This script assumes it is running in /etc/otel.
160  setup.sh: |
161    # Configure storage/ emptyDir volume permissions so the
162    # manager configuration can ge written to it.
163    chown 10005:10005 storage/
164
165    # Copy config and logging configuration files to storage/
166    # hostPath volume if they do not already exist.
167    if [ ! -f storage/config.yaml ]; then
168      echo '
169      receivers:
170        nop:
171      processors:
172        batch:
173      exporters:
174        nop:
175      service:
176        pipelines:
177          metrics:
178            receivers: [nop]
179            processors: [batch]
180            exporters: [nop]
181        telemetry:
182          metrics:
183            level: none
184      ' > storage/config.yaml
185    fi
186    if [ ! -f storage/logging.yaml ]; then
187      echo '
188      output: stdout
189      level: info
190      ' > storage/logging.yaml
191    fi
192    chown 10005:10005 storage/config.yaml storage/logging.yaml
193---
194apiVersion: apps/v1
195kind: DaemonSet
196metadata:
197  name: bindplane-node-agent
198  labels:
199    app.kubernetes.io/name: bindplane-agent
200    app.kubernetes.io/component: node
201  namespace: bindplane-agent
202spec:
203  selector:
204    matchLabels:
205      app.kubernetes.io/name: bindplane-agent
206      app.kubernetes.io/component: node
207  template:
208    metadata:
209      labels:
210        app.kubernetes.io/name: bindplane-agent
211        app.kubernetes.io/component: node
212      annotations:
213        prometheus.io/scrape: "true"
214        prometheus.io/path: /metrics
215        prometheus.io/port: "8888"
216        prometheus.io/scheme: http
217        prometheus.io/job-name: bindplane-node-agent
218    spec:
219      serviceAccount: bindplane-agent
220      initContainers:
221        - name: setup
222          image: busybox:latest
223          securityContext:
224            # Required for changing permissions from
225            # root to otel user in emptyDir volume.
226            runAsUser: 0
227          command: ["sh", "/setup/setup.sh"]
228          volumeMounts:
229            - mountPath: /etc/otel/config
230              name: config
231            - mountPath: /storage
232              name: storage
233            - mountPath: "/setup"
234              name: setup
235      containers:
236        - name: opentelemetry-collector
237          image: ghcr.io/observiq/bindplane-agent:1.80.1
238          imagePullPolicy: IfNotPresent
239          securityContext:
240            readOnlyRootFilesystem: true
241            # Required for reading container logs hostPath.
242            runAsUser: 0
243          ports:
244            - containerPort: 8888
245              name: prometheus
246          resources:
247            requests:
248              memory: 200Mi
249              cpu: 100m
250            limits:
251              memory: 200Mi
252          env:
253            - name: OPAMP_ENDPOINT
254              value: wss://app.bindplane.com/v1/opamp
255            - name: OPAMP_SECRET_KEY
256              value: <secret>
257            - name: OPAMP_AGENT_NAME
258              valueFrom:
259                fieldRef:
260                  fieldPath: spec.nodeName
261            - name: OPAMP_LABELS
262              value: configuration=telgen-kind-1,container-platform=kubernetes-daemonset,install_id=0979c5c2-bd7a-41c1-89b8-2c16441886ab
263            - name: KUBE_NODE_NAME
264              valueFrom:
265                fieldRef:
266                  fieldPath: spec.nodeName
267            # The collector process updates config.yaml
268            # and manager.yaml when receiving changes
269            # from the OpAMP server.
270            #
271            # The config.yaml is persisted by saving it to the
272            # hostPath volume, allowing the agent to continue
273            # running after restart during an OpAMP server outage.
274            #
275            # The manager configuration must be re-generated on
276            # every startup due to how the bindplane-agent handles
277            # manager configuration. It prefers a manager config file
278            # over environment variables, meaning it cannot be
279            # updated using environment variables, if it is persisted).
280            - name: CONFIG_YAML_PATH
281              value: /etc/otel/storage/config.yaml
282            - name: MANAGER_YAML_PATH
283              value: /etc/otel/config/manager.yaml
284            - name: LOGGING_YAML_PATH
285              value: /etc/otel/storage/logging.yaml
286          volumeMounts:
287            - mountPath: /etc/otel/config
288              name: config
289            - mountPath: /run/log/journal
290              name: runlog
291              readOnly: true
292            - mountPath: /var/log
293              name: varlog
294              readOnly: true
295            - mountPath: /var/lib/docker/containers
296              name: dockerlogs
297              readOnly: true
298            - mountPath: /etc/otel/storage
299              name: storage
300      volumes:
301        - name: config
302          emptyDir: {}
303        - name: runlog
304          hostPath:
305            path: /run/log/journal
306        - name: varlog
307          hostPath:
308            path: /var/log
309        - name: dockerlogs
310          hostPath:
311            path: /var/lib/docker/containers
312        - name: storage
313          hostPath:
314            path: /var/lib/observiq/otelcol/container
315        - name: setup
316          configMap:
317            name: bindplane-node-agent-setup

In short, this manifest deploys the BDOT Collector as a DaemonSet on every node, using OpAMP to receive config from Bindplane. It includes:

  • RBAC to read Kubernetes objects (pods, nodes, deployments, etc.)
  • Services to expose OTLP ports (4317 gRPC, 4318 HTTP)
  • An init container to bootstrap a config to start the collector which will be replaced by the telgen-kind-1 configuration once started
  • Persistent hostPath storage for retries and disk buffering
  • Prometheus annotations for metrics scraping

Your file will include the correct OPAMP_ENDPOINT, OPAMP_SECRET_KEY, and OPAMP_LABELS.

Go ahead and apply this manifest to the first k8s cluster.

bash
1kubectl config use-context kind-kind-1
2kubectl apply -f node-agent-kind-1.yaml

Now, install another collector in the K8s cluster, but now choose a Gateway and the gw-kind-1 configuration.

kubernetes-load-balancer-install-gw-agent-1

You’ll get a manifest file to apply again, but this time a deployment. Save it as gateway-collector-kind-1.yaml.

kubernetes-load-balancer-install-gateway-agent-2

Here’s the full manifest as a deployment with a horizontal pod autoscaler. Or, check out what it looks like on GitHub.

yaml
1---
2apiVersion: v1
3kind: Namespace
4metadata:
5  labels:
6    app.kubernetes.io/name: bindplane-agent
7  name: bindplane-agent
8---
9apiVersion: v1
10kind: ServiceAccount
11metadata:
12  labels:
13    app.kubernetes.io/name: bindplane-agent
14  name: bindplane-agent
15  namespace: bindplane-agent
16---
17apiVersion: v1
18kind: Service
19metadata:
20  labels:
21    app.kubernetes.io/name: bindplane-agent
22    app.kubernetes.io/component: gateway
23  name: bindplane-gateway-agent
24  namespace: bindplane-agent
25spec:
26  ports:
27  - appProtocol: grpc
28    name: otlp-grpc
29    port: 4317
30    protocol: TCP
31    targetPort: 4317
32  - appProtocol: http
33    name: otlp-http
34    port: 4318
35    protocol: TCP
36    targetPort: 4318
37  - appProtocol: tcp
38    name: splunk-tcp
39    port: 9997
40    protocol: TCP
41    targetPort: 9997
42  - appProtocol: tcp
43    name: splunk-hec
44    port: 8088
45    protocol: TCP
46    targetPort: 8088
47  selector:
48    app.kubernetes.io/name: bindplane-agent
49    app.kubernetes.io/component: gateway
50  sessionAffinity: None
51  type: ClusterIP
52---
53apiVersion: v1
54kind: Service
55metadata:
56  labels:
57    app.kubernetes.io/name: bindplane-agent
58    app.kubernetes.io/component: gateway
59  name: bindplane-gateway-agent-headless
60  namespace: bindplane-agent
61spec:
62  clusterIP: None
63  ports:
64  - appProtocol: grpc
65    name: otlp-grpc
66    port: 4317
67    protocol: TCP
68    targetPort: 4317
69  - appProtocol: http
70    name: otlp-http
71    port: 4318
72    protocol: TCP
73    targetPort: 4318
74  selector:
75    app.kubernetes.io/name: bindplane-agent
76    app.kubernetes.io/component: gateway
77  sessionAffinity: None
78  type: ClusterIP
79---
80apiVersion: apps/v1
81kind: Deployment
82metadata:
83  name: bindplane-gateway-agent
84  labels:
85    app.kubernetes.io/name: bindplane-agent
86    app.kubernetes.io/component: gateway
87  namespace: bindplane-agent
88spec:
89  selector:
90    matchLabels:
91      app.kubernetes.io/name: bindplane-agent
92      app.kubernetes.io/component: gateway
93  template:
94    metadata:
95      labels:
96        app.kubernetes.io/name: bindplane-agent
97        app.kubernetes.io/component: gateway
98      annotations:
99        prometheus.io/scrape: "true"
100        prometheus.io/path: /metrics
101        prometheus.io/port: "8888"
102        prometheus.io/scheme: http
103        prometheus.io/job-name: bindplane-gateway-agent
104    spec:
105      serviceAccount: bindplane-agent
106      affinity:
107        podAntiAffinity:
108          preferredDuringSchedulingIgnoredDuringExecution:
109            - weight: 100
110              podAffinityTerm:
111                topologyKey: kubernetes.io/hostname
112                labelSelector:
113                  matchExpressions:
114                    - key: app.kubernetes.io/name
115                      operator: In
116                      values:  [bindplane-agent]
117                    - key: app.kubernetes.io/component
118                      operator: In
119                      values: [gateway]
120      securityContext:
121        runAsNonRoot: true
122        runAsUser: 1000000000
123        runAsGroup: 1000000000
124        fsGroup: 1000000000
125        seccompProfile:
126          type: RuntimeDefault
127      initContainers:
128        - name: setup-volumes
129          image: ghcr.io/observiq/bindplane-agent:1.80.1
130          securityContext:
131            runAsNonRoot: true
132            runAsUser: 1000000000
133            runAsGroup: 1000000000
134            readOnlyRootFilesystem: true
135            allowPrivilegeEscalation: false
136            seccompProfile:
137              type: RuntimeDefault
138            capabilities:
139              drop:
140                - ALL
141          command:
142            - 'sh'
143            - '-c'
144            - |
145              echo '
146              receivers:
147                nop:
148              processors:
149                batch:
150              exporters:
151                nop:
152              service:
153                pipelines:
154                  metrics:
155                    receivers: [nop]
156                    processors: [batch]
157                    exporters: [nop]
158                telemetry:
159                  metrics:
160                    level: none
161              ' > /etc/otel/storage/config.yaml
162              echo '
163              output: stdout
164              level: info
165              ' > /etc/otel/storage/logging.yaml
166          resources:
167            requests:
168              memory: 200Mi
169              cpu: 100m
170            limits:
171              memory: 200Mi
172          volumeMounts:
173            - mountPath: /etc/otel/storage
174              name: bindplane-gateway-agent-storage
175      containers:
176        - name: opentelemetry-container
177          image: ghcr.io/observiq/bindplane-agent:1.80.1
178          imagePullPolicy: IfNotPresent
179          securityContext:
180            runAsNonRoot: true
181            runAsUser: 1000000000
182            runAsGroup: 1000000000
183            readOnlyRootFilesystem: true
184            allowPrivilegeEscalation: false
185            seccompProfile:
186              type: RuntimeDefault
187            capabilities:
188              drop:
189                - ALL
190          resources:
191            requests:
192              memory: 500Mi
193              cpu: 250m
194            limits:
195              memory: 500Mi
196          ports:
197            - containerPort: 8888
198              name: prometheus
199          env:
200            - name: OPAMP_ENDPOINT
201              value: wss://app.bindplane.com/v1/opamp
202            - name: OPAMP_SECRET_KEY
203              value: <secret>
204            - name: OPAMP_AGENT_NAME
205              valueFrom:
206                fieldRef:
207                  fieldPath: metadata.name
208            - name: OPAMP_LABELS
209              value: configuration=gw-kind-1,container-platform=kubernetes-gateway,install_id=51dbe4d2-83d2-45c0-ab4a-e0c127a59649
210            - name: KUBE_NODE_NAME
211              valueFrom:
212                fieldRef:
213                  fieldPath: spec.nodeName
214            # The collector process updates config.yaml
215            # and manager.yaml when receiving changes
216            # from the OpAMP server.
217            - name: CONFIG_YAML_PATH
218              value: /etc/otel/storage/config.yaml
219            - name: MANAGER_YAML_PATH
220              value: /etc/otel/config/manager.yaml
221            - name: LOGGING_YAML_PATH
222              value: /etc/otel/storage/logging.yaml
223          volumeMounts:
224          - mountPath: /etc/otel/storage
225            name: bindplane-gateway-agent-storage
226          - mountPath: /etc/otel/config
227            name: config
228      volumes:
229        - name: config
230          emptyDir: {}
231        - name: bindplane-gateway-agent-storage
232          emptyDir: {}
233      # Allow exporters to drain their queue for up to
234      # five minutes.
235      terminationGracePeriodSeconds: 500
236---
237apiVersion: autoscaling/v2
238kind: HorizontalPodAutoscaler
239metadata:
240  name: bindplane-gateway-agent
241  namespace: bindplane-agent
242spec:
243  maxReplicas: 10
244  minReplicas: 2
245  scaleTargetRef:
246    apiVersion: apps/v1
247    kind: Deployment
248    name: bindplane-gateway-agent
249  metrics:
250  - type: Resource
251    resource:
252      name: cpu
253      target:
254        type: Utilization
255        averageUtilization: 60

Here’s a breakdown of what this manifest does:

  • Creates a dedicated namespace and service account for the Bindplane Gateway Collector (bindplane-agent).
  • Defines two Kubernetes services:
    • A standard ClusterIP service for OTLP (gRPC/HTTP) and Splunk (TCP/HEC) traffic.
    • A headless service for direct pod discovery, useful in peer-to-peer setups.
  • Deploys the Bindplane Agent as a scalable Deployment:
    • Runs the OpenTelemetry Collector image.
    • Bootstraps basic config via an initContainer.
    • Secure runtime with strict securityContext settings.
    • Prometheus annotations enable metrics scraping.
  • Auto-scales the collector horizontally using an HPA:
    • Scales between 2 and 10 replicas based on CPU utilization.
  • Uses OpAMP to receive remote config and updates from Bindplane.
  • Mounts ephemeral storage for config and persistent queue support using emptyDir.

Your file will include the correct OPAMP_ENDPOINT, OPAMP_SECRET_KEY, and OPAMP_LABELS.

Apply it in the first k8s cluster.

bash
1kubectl apply -f gateway-collector-kind-1.yaml

Now, create an identical Gateway Collector as above but use the external-gw-kind-2 configuration.

kubernetes-load-balancer-install-external-gateway-agent

You’ll get a manifest file to apply again, but this time apply it in your second cluster. Save it as gateway-collector-kind-2.yaml. Here’s what it looks like in GitHub. I won’t bother showing you the manifest YAML since it will be identical to the one above.

bash
1kubectl config use-context kind-kind-2
2kubectl apply -f gateway-collector-kind-2.yaml

Finally, to expose this external Gateway Collector’s service and enable OTLP traffic from cluster 1 to cluster 2, I’ll use this NodePort service called gateway-nodeport-service.yaml.

yaml
1apiVersion: v1
2kind: Service
3metadata:
4  name: bindplane-gateway-agent-nodeport
5  namespace: bindplane-agent
6  labels:
7    app.kubernetes.io/name: bindplane-agent
8    app.kubernetes.io/component: gateway
9spec:
10  type: NodePort
11  ports:
12  - name: otlp-grpc
13    port: 4317
14    targetPort: 4317
15    nodePort: 30317
16    protocol: TCP
17  - name: otlp-http
18    port: 4318
19    targetPort: 4318
20    nodePort: 30318
21    protocol: TCP
22  selector:
23    app.kubernetes.io/name: bindplane-agent
24    app.kubernetes.io/component: gateway

And, apply it with:

bash
1kubectl apply -f gateway-nodeport-service.yaml

Your final setup will look like this.

kubernetes-load-balancer-high-availability-collectors-architecture

One Agent-mode collector sending telemetry traffic via a horizontally scaled Gateway-mode collector to an external Gateway running in a separate cluster. This can be any other telemetry backend of your choice.

You’ll have 5 collectors running in total.

kuberentes-load-balancing-high-availability-collectors-overview

And, 3 configurations, where 2 of them will be scaled between 2 and 10 collector pods.

kuberentes-load-balancing-high-availability-collectors-configurations

To get this sample up-and-running quickly, apply these configs via the Bindplane CLI, get the files on GitHub, here.

Final thoughts

At the end of the day, high availability for the OpenTelemetry Collector means one thing: don’t lose telemetry when stuff breaks.

You want things to keep working when a telemetry backend goes down, a node restarts, or you’re pushing out updates. That’s why the Agent-Gateway pattern exists. That’s why we scale horizontally. That’s why we use batching, retries, and persistent queues.

Set it up once, and sleep better knowing your pipeline won’t fall over at the first hiccup. Keep signals flowing. No drops. No drama.

Want to give Bindplane a try? Spin up a free instance of Bindplane Cloud and hit the ground running right away.

Adnan Rahic
Adnan Rahic
Share:

Related posts

All posts

Get our latest content
in your inbox every week

By subscribing to our Newsletter, you agreed to our Privacy Notice

Community Engagement

Join the Community

Become a part of our thriving community, where you can connect with like-minded individuals, collaborate on projects, and grow together.

Ready to Get Started

Deploy in under 20 minutes with our one line installation script and start configuring your pipelines.

Try it now