Metrics

Monitor performance and health metrics for Bindy DNS infrastructure.

Operator Metrics

Bindy exposes Prometheus-compatible metrics on port 8080 at /metrics. These metrics provide comprehensive observability into the operator’s behavior and resource management.

Accessing Metrics

The metrics endpoint is exposed on all operator pods:

# Port forward to the operator
kubectl port-forward -n dns-system deployment/bindy-controller 8080:8080

# View metrics
curl http://localhost:8080/metrics

Available Metrics

All metrics use the namespace prefix bindy_firestoned_io_.

Reconciliation Metrics

bindy_firestoned_io_reconciliations_total (Counter) Total number of reconciliation attempts by resource type and outcome.

Labels:

resource_type: Kind of resource (Bind9Cluster, Bind9Instance, DNSZone, ARecord, AAAARecord, TXTRecord, CNAMERecord, MXRecord, NSRecord, SRVRecord, CAARecord)
status: Outcome (success, error, requeue)

# Reconciliation success rate
rate(bindy_firestoned_io_reconciliations_total{status="success"}[5m])

# Error rate by resource type
rate(bindy_firestoned_io_reconciliations_total{status="error"}[5m])

bindy_firestoned_io_reconciliation_duration_seconds (Histogram) Duration of reconciliation operations in seconds.

Labels:

resource_type: Kind of resource

Buckets: 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0

# Average reconciliation duration
rate(bindy_firestoned_io_reconciliation_duration_seconds_sum[5m])
/ rate(bindy_firestoned_io_reconciliation_duration_seconds_count[5m])

# 95th percentile latency
histogram_quantile(0.95, bindy_firestoned_io_reconciliation_duration_seconds_bucket)

bindy_firestoned_io_requeues_total (Counter) Total number of requeue operations.

Labels:

resource_type: Kind of resource
reason: Reason for requeue (error, rate_limit, dependency_wait)

# Requeue rate by reason
rate(bindy_firestoned_io_requeues_total[5m])

Resource Lifecycle Metrics

bindy_firestoned_io_resources_created_total (Counter) Total number of resources created.

Labels:

resource_type: Kind of resource

bindy_firestoned_io_resources_updated_total (Counter) Total number of resources updated.

Labels:

resource_type: Kind of resource

bindy_firestoned_io_resources_deleted_total (Counter) Total number of resources deleted.

Labels:

resource_type: Kind of resource

bindy_firestoned_io_resources_active (Gauge) Currently active resources being tracked.

Labels:

resource_type: Kind of resource

# Resource creation rate
rate(bindy_firestoned_io_resources_created_total[5m])

# Active resources by type
bindy_firestoned_io_resources_active

Error Metrics

bindy_firestoned_io_errors_total (Counter) Total number of errors by resource type and category.

Labels:

resource_type: Kind of resource
error_type: Category (api_error, validation_error, network_error, timeout, reconcile_error)

# Error rate by type
rate(bindy_firestoned_io_errors_total[5m])

# Errors by resource type
sum(rate(bindy_firestoned_io_errors_total[5m])) by (resource_type)

Leader Election Metrics

bindy_firestoned_io_leader_elections_total (Counter) Total number of leader election events.

Labels:

status: Event type (acquired, lost, renewed)

bindy_firestoned_io_leader_status (Gauge) Current leader election status (1 = leader, 0 = follower).

Labels:

pod_name: Name of the pod

# Current leader
bindy_firestoned_io_leader_status == 1

# Leader election rate
rate(bindy_firestoned_io_leader_elections_total[5m])

Performance Metrics

bindy_firestoned_io_generation_observation_lag_seconds (Histogram) Lag between resource spec generation change and controller observation.

Labels:

resource_type: Kind of resource

Buckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0

# Average observation lag
rate(bindy_firestoned_io_generation_observation_lag_seconds_sum[5m])
/ rate(bindy_firestoned_io_generation_observation_lag_seconds_count[5m])

Prometheus Configuration

The operator deployment includes Prometheus scrape annotations:

annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080"
  prometheus.io/path: "/metrics"

Prometheus will automatically discover and scrape these metrics if configured with Kubernetes service discovery.

Example Queries

# Reconciliation success rate (last 5 minutes)
sum(rate(bindy_firestoned_io_reconciliations_total{status="success"}[5m]))
/ sum(rate(bindy_firestoned_io_reconciliations_total[5m]))

# DNSZone reconciliation p95 latency
histogram_quantile(0.95,
  sum(rate(bindy_firestoned_io_reconciliation_duration_seconds_bucket{resource_type="DNSZone"}[5m])) by (le)
)

# Error rate by resource type (last hour)
topk(10,
  sum(rate(bindy_firestoned_io_errors_total[1h])) by (resource_type)
)

# Active resources per type
sum(bindy_firestoned_io_resources_active) by (resource_type)

# Requeue backlog
sum(rate(bindy_firestoned_io_requeues_total[5m])) by (resource_type, reason)

Grafana Dashboard

Import the Bindy operator dashboard (coming soon) or create custom panels using the queries above.

Recommended panels:

Reconciliation Rate - Total reconciliations/sec by resource type
Reconciliation Latency - P50, P95, P99 latencies
Error Rate - Errors/sec by resource type and error category
Active Resources - Gauge showing current active resources
Leader Status - Current leader pod and election events
Resource Lifecycle - Created/Updated/Deleted rates

Resource Metrics

Pod Metrics

View CPU and memory usage:

# All DNS pods
kubectl top pods -n dns-system

# Specific instance
kubectl top pods -n dns-system -l instance=primary-dns

# Sort by CPU
kubectl top pods -n dns-system --sort-by=cpu

# Sort by memory
kubectl top pods -n dns-system --sort-by=memory

Node Metrics

# Node resource usage
kubectl top nodes

# Detailed node info
kubectl describe node <node-name>

DNS Query Metrics

Using BIND9 Statistics

Enable BIND9 statistics channel (future enhancement):

spec:
  config:
    statisticsChannels:
      - address: "127.0.0.1"
        port: 8053

Query Counters

Monitor query rate and types:

Total queries received
Queries by record type (A, AAAA, MX, etc.)
Successful vs failed queries
NXDOMAIN responses

Performance Metrics

Query Latency

Measure DNS query response time:

# Test query latency
time dig @<dns-server-ip> example.com

# Multiple queries for average
for i in {1..10}; do time dig @<dns-server-ip> example.com +short; done

Zone Transfer Metrics

Monitor zone transfer performance:

Transfer duration
Transfer size
Transfer failures
Lag between primary and secondary

Kubernetes Metrics

Resource Utilization

# View resource requests vs limits
kubectl describe pod -n dns-system <pod-name> | grep -A5 "Limits:\|Requests:"

Pod Health

# Pod status and restarts
kubectl get pods -n dns-system -o wide

# Events
kubectl get events -n dns-system --sort-by='.lastTimestamp'

Prometheus Integration

BIND9 Exporter

Deploy bind_exporter as sidecar (future enhancement):

containers:
- name: bind-exporter
  image: prometheuscommunity/bind-exporter:latest
  args:
    - "--bind.stats-url=http://localhost:8053"
  ports:
    - name: metrics
      containerPort: 9119

Service Monitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: bindy-metrics
spec:
  selector:
    matchLabels:
      app: bind9
  endpoints:
  - port: metrics
    interval: 30s

Key Metrics to Monitor

Query Rate - Queries per second
Query Latency - Response time
Error Rate - Failed queries percentage
Cache Hit Ratio - Cache effectiveness
Zone Transfer Status - Success/failure of transfers
Resource Usage - CPU and memory utilization
Pod Health - Running vs desired replicas

Grafana Dashboards

Create dashboards for:

DNS Overview

Total query rate
Average latency
Error rate
Top queried domains

Instance Health

Pod status
CPU/memory usage
Restart count
Network I/O

Zone Management

Zones count
Records per zone
Zone transfer status
Serial numbers

Alerting Thresholds

Recommended alert thresholds:

Metric	Warning	Critical
CPU Usage	> 70%	> 90%
Memory Usage	> 70%	> 90%
Query Latency	> 100ms	> 500ms
Error Rate	> 1%	> 5%
Pod Restarts	> 3/hour	> 10/hour

Best Practices

Baseline metrics - Establish normal operating ranges
Set appropriate alerts - Avoid alert fatigue
Monitor trends - Look for gradual degradation
Capacity planning - Use metrics to plan scaling
Regular review - Review dashboards weekly

Keyboard shortcuts

Bindy - BIND9 DNS Controller for Kubernetes