High Availability

Design and implement highly available DNS infrastructure with Bindy.

Overview

High availability (HA) DNS ensures continuous DNS service even during:

Pod failures
Node failures
Availability zone outages
Regional outages
Planned maintenance

HA Architecture Components

1. Multiple Replicas

Run multiple replicas of each Bind9Instance:

apiVersion: bindy.firestoned.io/v1alpha1
kind: Bind9Instance
metadata:
  name: primary-dns
spec:
  replicas: 3  # Multiple replicas for pod-level HA

Benefits:

Survives pod crashes
Load distribution
Zero-downtime updates

2. Multiple Instances

Deploy separate primary and secondary instances:

# Primary instance
apiVersion: bindy.firestoned.io/v1alpha1
kind: Bind9Instance
metadata:
  name: primary-dns
  labels:
    dns-role: primary
spec:
  replicas: 2
---
# Secondary instance  
apiVersion: bindy.firestoned.io/v1alpha1
kind: Bind9Instance
metadata:
  name: secondary-dns
  labels:
    dns-role: secondary
spec:
  replicas: 2

Benefits:

Role separation
Independent scaling
Failover capability

3. Geographic Distribution

Deploy instances across multiple regions:

# US East primary
apiVersion: bindy.firestoned.io/v1alpha1
kind: Bind9Instance
metadata:
  name: primary-us-east
  labels:
    dns-role: primary
    region: us-east-1
spec:
  replicas: 2
---
# US West secondary
apiVersion: bindy.firestoned.io/v1alpha1
kind: Bind9Instance
metadata:
  name: secondary-us-west
  labels:
    dns-role: secondary
    region: us-west-2
spec:
  replicas: 2
---
# EU secondary
apiVersion: bindy.firestoned.io/v1alpha1
kind: Bind9Instance
metadata:
  name: secondary-eu-west
  labels:
    dns-role: secondary
    region: eu-west-1
spec:
  replicas: 2

Benefits:

Regional failure tolerance
Lower latency for global users
Regulatory compliance (data locality)

HA Patterns

Pattern 1: Active-Passive

One active primary, multiple passive secondaries:

graph LR
    primary["Primary<br/>(Active)<br/>us-east-1"]
    sec1["Secondary<br/>(Passive)<br/>us-west-2"]
    sec2["Secondary<br/>(Passive)<br/>eu-west-1"]
    clients["Clients query any"]

    primary -->|AXFR| sec1
    sec1 -->|AXFR| sec2
    primary --> clients
    sec1 --> clients
    sec2 --> clients

    style primary fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    style sec1 fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    style sec2 fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    style clients fill:#fff9c4,stroke:#f57f17,stroke-width:2px

Updates go to primary only
Secondaries receive via zone transfer
Clients query any available instance

Pattern 2: Multi-Primary

Multiple primaries in different regions:

graph LR
    primary1["Primary<br/>(zone-a)<br/>us-east-1"]
    primary2["Primary<br/>(zone-b)<br/>eu-west-1"]

    primary1 <-->|Sync| primary2

    style primary1 fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    style primary2 fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px

Different zones on different primaries
Geographic distribution of updates
Careful coordination required

Pattern 3: Anycast

Same IP announced from multiple locations:

graph TB
    client["Client Query (192.0.2.53)"]
    dns_us["DNS<br/>US"]
    dns_eu["DNS<br/>EU"]
    dns_apac["DNS<br/>APAC"]

    client --> dns_us
    client --> dns_eu
    client --> dns_apac

    style client fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    style dns_us fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    style dns_eu fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    style dns_apac fill:#f3e5f5,stroke:#4a148c,stroke-width:2px

Requires BGP routing
Lowest latency routing
Automatic failover

Pod-Level HA

Anti-Affinity

Spread pods across nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: primary-dns
spec:
  replicas: 3
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: instance
                  operator: In
                  values:
                  - primary-dns
              topologyKey: kubernetes.io/hostname

Topology Spread

Distribute across availability zones:

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        instance: primary-dns

Service-Level HA

Liveness and Readiness Probes

Ensure only healthy pods serve traffic:

livenessProbe:
  exec:
    command: ["dig", "@localhost", "version.bind", "txt", "chaos"]
  initialDelaySeconds: 30
  periodSeconds: 10
  
readinessProbe:
  exec:
    command: ["dig", "@localhost", "version.bind", "txt", "chaos"]
  initialDelaySeconds: 5
  periodSeconds: 5

Pod Disruption Budgets

Limit concurrent disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: primary-dns-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      instance: primary-dns

Monitoring HA

Check Instance Distribution

# View instances across regions
kubectl get bind9instances -A -L region

# View pod distribution
kubectl get pods -n dns-system -o wide

# Check zone spread
kubectl get pods -n dns-system \
  -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,ZONE:.spec.nodeSelector

Test Failover

# Simulate pod failure
kubectl delete pod -n dns-system <pod-name>

# Verify automatic recovery
kubectl get pods -n dns-system -w

# Test DNS during failover
while true; do dig @$SERVICE_IP example.com +short; sleep 1; done

Disaster Recovery

Backup Strategy

# Regular backups of all CRDs
kubectl get bind9instances,dnszones,arecords,aaaarecords,cnamerecords,mxrecords,txtrecords,nsrecords,srvrecords,caarecords \
  -A -o yaml > backup-$(date +%Y%m%d).yaml

Recovery Procedures

Single Pod Failure - Kubernetes automatically recreates
Instance Failure - Clients fail over to other instances
Regional Failure - Zone data available from other regions
Complete Loss - Restore from backup

# Restore from backup
kubectl apply -f backup-20241126.yaml

Operator High Availability

The Bindy operator itself can run in high availability mode with automatic leader election. This ensures continuous DNS management even if operator pods fail.

Leader Election

Multiple operator instances use Kubernetes Lease objects for distributed leader election:

graph TB
    op1["Operator<br/>Instance 1<br/>(Leader)"]
    op2["Operator<br/>Instance 2<br/>(Standby)"]
    op3["Operator<br/>Instance 3<br/>(Standby)"]
    lease["Kubernetes API<br/>Lease Object"]

    op1 --> lease
    op2 --> lease
    op3 --> lease

    style op1 fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px
    style op2 fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    style op3 fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    style lease fill:#fff9c4,stroke:#f57f17,stroke-width:2px

How it works:

All operator instances attempt to acquire the lease
One instance becomes the leader and starts reconciling resources
Standby instances wait and monitor the lease
If the leader fails, a standby automatically takes over (~15 seconds)

HA Operator Deployment

Deploy multiple operator replicas with leader election enabled:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: bindy
  namespace: dns-system
spec:
  replicas: 3  # Run 3 instances for HA
  selector:
    matchLabels:
      app: bindy
  template:
    metadata:
      labels:
        app: bindy
    spec:
      serviceAccountName: bindy
      containers:
      - name: operator
        image: ghcr.io/firestoned/bindy:latest
        env:
        # Leader election configuration
        - name: ENABLE_LEADER_ELECTION
          value: "true"
        - name: LEASE_NAME
          value: "bindy-leader"
        - name: LEASE_NAMESPACE
          value: "dns-system"
        - name: LEASE_DURATION_SECONDS
          value: "15"
        - name: LEASE_RENEW_DEADLINE_SECONDS
          value: "10"
        - name: LEASE_RETRY_PERIOD_SECONDS
          value: "2"
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name

Configuration Options

Environment variables for leader election:

Variable	Default	Description
`ENABLE_LEADER_ELECTION`	`true`	Enable/disable leader election
`LEASE_NAME`	`bindy-leader`	Name of the Lease resource
`LEASE_NAMESPACE`	`dns-system`	Namespace for the Lease
`LEASE_DURATION_SECONDS`	`15`	How long leader holds lease
`LEASE_RENEW_DEADLINE_SECONDS`	`10`	Leader must renew before this
`LEASE_RETRY_PERIOD_SECONDS`	`2`	How often to attempt lease acquisition
`POD_NAME`	`$HOSTNAME`	Unique identity for this operator instance

Monitoring Leader Election

Check which operator instance is the current leader:

# View the lease object
kubectl get lease -n dns-system bindy-leader -o yaml

# Output shows current leader
spec:
  holderIdentity: bindy-7d8f9c5b4d-x7k2m  # Current leader pod
  leaseDurationSeconds: 15
  renewTime: "2025-11-30T12:34:56Z"

Monitor operator logs to see leadership changes:

# Watch operator logs
kubectl logs -n dns-system deployment/bindy -f

# Look for leadership events
INFO Attempting to acquire lease bindy-leader
INFO Lease acquired, this instance is now the leader
INFO Starting all controllers
WARN Leadership lost! Stopping all controllers...
INFO Lease acquired, this instance is now the leader

Failover Testing

Test automatic failover:

# Find current leader
LEADER=$(kubectl get lease -n dns-system bindy-leader -o jsonpath='{.spec.holderIdentity}')
echo "Current leader: $LEADER"

# Delete the leader pod
kubectl delete pod -n dns-system $LEADER

# Watch for new leader election (typically ~15 seconds)
kubectl get lease -n dns-system bindy-leader -w

# Verify DNS operations continue uninterrupted
kubectl get bind9instances -A

RBAC Requirements

Leader election requires additional permissions in the operator’s Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: bindy
  namespace: dns-system
rules:
# Leases for leader election
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["get", "create", "update", "patch"]

Troubleshooting

Operator not reconciling resources:

# Check which instance is leader
kubectl get lease -n dns-system bindy-leader -o jsonpath='{.spec.holderIdentity}'

# Verify that pod exists and is running
kubectl get pods -n dns-system

# Check operator logs
kubectl logs -n dns-system deployment/bindy -f

Multiple operators reconciling (split brain):

This should never happen with proper leader election. If you suspect it:

# Check lease configuration
kubectl get lease -n dns-system bindy-leader -o yaml

# Verify all operators use the same LEASE_NAME
kubectl get deployment -n dns-system bindy -o yaml | grep LEASE_NAME

# Force lease release (recreate it)
kubectl delete lease -n dns-system bindy-leader

Leader election disabled but multiple replicas running:

This will cause conflicts. Either:

Enable leader election: Set ENABLE_LEADER_ELECTION=true
Or run single replica: kubectl scale deployment bindy --replicas=1

Performance Impact

Leader election adds minimal overhead:

Failover time: ~15 seconds (configurable via LEASE_DURATION_SECONDS)
Network traffic: 1 lease renewal every 2 seconds from leader only
CPU/Memory: Negligible (<1% increase)

Best Practices

Run 3+ Operator Replicas - For operator HA with leader election
Run 3+ DNS Instance Replicas - Odd numbers for quorum
Multi-AZ Deployment - Spread across availability zones
Geographic Redundancy - At least 2 regions for critical zones
Monitor Continuously - Alert on degraded HA
Test Failover - Regular disaster recovery drills (both operator and DNS instances)
Automate Recovery - Use Kubernetes self-healing
Document Procedures - Runbooks for incidents
Enable Leader Election - Always run operator with ENABLE_LEADER_ELECTION=true in production
Monitor Lease Health - Alert if lease ownership changes frequently (indicates instability)

Next Steps

Zone Transfers - Configure zone replication
Replication - Multi-region replication strategies
Performance - Optimize for high availability

Keyboard shortcuts

Bindy - BIND9 DNS Controller for Kubernetes