Metrics
Monitor performance and health metrics for Bindy DNS infrastructure.
Operator Metrics
Bindy exposes Prometheus-compatible metrics on port 8080 at /metrics. These metrics provide comprehensive observability into the operator’s behavior and resource management.
Accessing Metrics
The metrics endpoint is exposed on all operator pods:
# Port forward to the operator
kubectl port-forward -n dns-system deployment/bindy-controller 8080:8080
# View metrics
curl http://localhost:8080/metrics
Available Metrics
All metrics use the namespace prefix bindy_firestoned_io_.
Reconciliation Metrics
bindy_firestoned_io_reconciliations_total (Counter)
Total number of reconciliation attempts by resource type and outcome.
Labels:
resource_type: Kind of resource (Bind9Cluster,Bind9Instance,DNSZone,ARecord,AAAARecord,TXTRecord,CNAMERecord,MXRecord,NSRecord,SRVRecord,CAARecord)status: Outcome (success,error,requeue)
# Reconciliation success rate
rate(bindy_firestoned_io_reconciliations_total{status="success"}[5m])
# Error rate by resource type
rate(bindy_firestoned_io_reconciliations_total{status="error"}[5m])
bindy_firestoned_io_reconciliation_duration_seconds (Histogram)
Duration of reconciliation operations in seconds.
Labels:
resource_type: Kind of resource
Buckets: 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0
# Average reconciliation duration
rate(bindy_firestoned_io_reconciliation_duration_seconds_sum[5m])
/ rate(bindy_firestoned_io_reconciliation_duration_seconds_count[5m])
# 95th percentile latency
histogram_quantile(0.95, bindy_firestoned_io_reconciliation_duration_seconds_bucket)
bindy_firestoned_io_requeues_total (Counter)
Total number of requeue operations.
Labels:
resource_type: Kind of resourcereason: Reason for requeue (error,rate_limit,dependency_wait)
# Requeue rate by reason
rate(bindy_firestoned_io_requeues_total[5m])
Resource Lifecycle Metrics
bindy_firestoned_io_resources_created_total (Counter)
Total number of resources created.
Labels:
resource_type: Kind of resource
bindy_firestoned_io_resources_updated_total (Counter)
Total number of resources updated.
Labels:
resource_type: Kind of resource
bindy_firestoned_io_resources_deleted_total (Counter)
Total number of resources deleted.
Labels:
resource_type: Kind of resource
bindy_firestoned_io_resources_active (Gauge)
Currently active resources being tracked.
Labels:
resource_type: Kind of resource
# Resource creation rate
rate(bindy_firestoned_io_resources_created_total[5m])
# Active resources by type
bindy_firestoned_io_resources_active
Error Metrics
bindy_firestoned_io_errors_total (Counter)
Total number of errors by resource type and category.
Labels:
resource_type: Kind of resourceerror_type: Category (api_error,validation_error,network_error,timeout,reconcile_error)
# Error rate by type
rate(bindy_firestoned_io_errors_total[5m])
# Errors by resource type
sum(rate(bindy_firestoned_io_errors_total[5m])) by (resource_type)
Leader Election Metrics
bindy_firestoned_io_leader_elections_total (Counter)
Total number of leader election events.
Labels:
status: Event type (acquired,lost,renewed)
bindy_firestoned_io_leader_status (Gauge)
Current leader election status (1 = leader, 0 = follower).
Labels:
pod_name: Name of the pod
# Current leader
bindy_firestoned_io_leader_status == 1
# Leader election rate
rate(bindy_firestoned_io_leader_elections_total[5m])
Performance Metrics
bindy_firestoned_io_generation_observation_lag_seconds (Histogram)
Lag between resource spec generation change and controller observation.
Labels:
resource_type: Kind of resource
Buckets: 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0
# Average observation lag
rate(bindy_firestoned_io_generation_observation_lag_seconds_sum[5m])
/ rate(bindy_firestoned_io_generation_observation_lag_seconds_count[5m])
Prometheus Configuration
The operator deployment includes Prometheus scrape annotations:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
Prometheus will automatically discover and scrape these metrics if configured with Kubernetes service discovery.
Example Queries
# Reconciliation success rate (last 5 minutes)
sum(rate(bindy_firestoned_io_reconciliations_total{status="success"}[5m]))
/ sum(rate(bindy_firestoned_io_reconciliations_total[5m]))
# DNSZone reconciliation p95 latency
histogram_quantile(0.95,
sum(rate(bindy_firestoned_io_reconciliation_duration_seconds_bucket{resource_type="DNSZone"}[5m])) by (le)
)
# Error rate by resource type (last hour)
topk(10,
sum(rate(bindy_firestoned_io_errors_total[1h])) by (resource_type)
)
# Active resources per type
sum(bindy_firestoned_io_resources_active) by (resource_type)
# Requeue backlog
sum(rate(bindy_firestoned_io_requeues_total[5m])) by (resource_type, reason)
Grafana Dashboard
Import the Bindy operator dashboard (coming soon) or create custom panels using the queries above.
Recommended panels:
- Reconciliation Rate - Total reconciliations/sec by resource type
- Reconciliation Latency - P50, P95, P99 latencies
- Error Rate - Errors/sec by resource type and error category
- Active Resources - Gauge showing current active resources
- Leader Status - Current leader pod and election events
- Resource Lifecycle - Created/Updated/Deleted rates
Resource Metrics
Pod Metrics
View CPU and memory usage:
# All DNS pods
kubectl top pods -n dns-system
# Specific instance
kubectl top pods -n dns-system -l instance=primary-dns
# Sort by CPU
kubectl top pods -n dns-system --sort-by=cpu
# Sort by memory
kubectl top pods -n dns-system --sort-by=memory
Node Metrics
# Node resource usage
kubectl top nodes
# Detailed node info
kubectl describe node <node-name>
DNS Query Metrics
Using BIND9 Statistics
Enable BIND9 statistics channel (future enhancement):
spec:
config:
statisticsChannels:
- address: "127.0.0.1"
port: 8053
Query Counters
Monitor query rate and types:
- Total queries received
- Queries by record type (A, AAAA, MX, etc.)
- Successful vs failed queries
- NXDOMAIN responses
Performance Metrics
Query Latency
Measure DNS query response time:
# Test query latency
time dig @<dns-server-ip> example.com
# Multiple queries for average
for i in {1..10}; do time dig @<dns-server-ip> example.com +short; done
Zone Transfer Metrics
Monitor zone transfer performance:
- Transfer duration
- Transfer size
- Transfer failures
- Lag between primary and secondary
Kubernetes Metrics
Resource Utilization
# View resource requests vs limits
kubectl describe pod -n dns-system <pod-name> | grep -A5 "Limits:\|Requests:"
Pod Health
# Pod status and restarts
kubectl get pods -n dns-system -o wide
# Events
kubectl get events -n dns-system --sort-by='.lastTimestamp'
Prometheus Integration
BIND9 Exporter
Deploy bind_exporter as sidecar (future enhancement):
containers:
- name: bind-exporter
image: prometheuscommunity/bind-exporter:latest
args:
- "--bind.stats-url=http://localhost:8053"
ports:
- name: metrics
containerPort: 9119
Service Monitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: bindy-metrics
spec:
selector:
matchLabels:
app: bind9
endpoints:
- port: metrics
interval: 30s
Key Metrics to Monitor
- Query Rate - Queries per second
- Query Latency - Response time
- Error Rate - Failed queries percentage
- Cache Hit Ratio - Cache effectiveness
- Zone Transfer Status - Success/failure of transfers
- Resource Usage - CPU and memory utilization
- Pod Health - Running vs desired replicas
Grafana Dashboards
Create dashboards for:
DNS Overview
- Total query rate
- Average latency
- Error rate
- Top queried domains
Instance Health
- Pod status
- CPU/memory usage
- Restart count
- Network I/O
Zone Management
- Zones count
- Records per zone
- Zone transfer status
- Serial numbers
Alerting Thresholds
Recommended alert thresholds:
| Metric | Warning | Critical |
|---|---|---|
| CPU Usage | > 70% | > 90% |
| Memory Usage | > 70% | > 90% |
| Query Latency | > 100ms | > 500ms |
| Error Rate | > 1% | > 5% |
| Pod Restarts | > 3/hour | > 10/hour |
Best Practices
- Baseline metrics - Establish normal operating ranges
- Set appropriate alerts - Avoid alert fatigue
- Monitor trends - Look for gradual degradation
- Capacity planning - Use metrics to plan scaling
- Regular review - Review dashboards weekly