Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Incident Response Playbooks - Bindy DNS Controller

Version: 1.0 Last Updated: 2025-12-17 Owner: Security Team Compliance: SOX 404, PCI-DSS 12.10.1, Basel III


Table of Contents


Overview

This document provides step-by-step incident response playbooks for security incidents involving the Bindy DNS Controller. Each playbook follows the NIST Incident Response Lifecycle: Preparation, Detection & Analysis, Containment, Eradication, Recovery, and Post-Incident Activity.

Objectives

  1. Rapid Response: Minimize time between detection and containment
  2. Clear Procedures: Provide step-by-step guidance for responders
  3. Minimize Impact: Reduce blast radius and prevent escalation
  4. Evidence Preservation: Maintain audit trail for forensics and compliance
  5. Continuous Improvement: Learn from incidents to strengthen defenses

Incident Classification

Severity Levels

SeverityDefinitionResponse TimeEscalation
πŸ”΄ CRITICALComplete service outage, data breach, or active exploitationImmediate (< 15 min)CISO, CTO, VP Engineering
🟠 HIGHDegraded service, vulnerability with known exploit, unauthorized access< 1 hourSecurity Lead, Engineering Manager
🟑 MEDIUMVulnerability without exploit, suspicious activity, minor service impact< 4 hoursSecurity Team, On-Call Engineer
πŸ”΅ LOWInformational findings, potential issues, no immediate risk< 24 hoursSecurity Team

Response Team

Roles and Responsibilities

RoleResponsibilitiesContact
Incident CommanderOverall coordination, decision-making, stakeholder communicationOn-call rotation
Security LeadThreat analysis, forensics, remediation guidancesecurity@firestoned.io
Platform EngineerKubernetes cluster operations, pod managementplatform@firestoned.io
DNS EngineerBIND9 expertise, zone managementdns-team@firestoned.io
Compliance OfficerRegulatory reporting, evidence collectioncompliance@firestoned.io
CommunicationsInternal/external communication, customer notificationscomms@firestoned.io

On-Call Rotation

  • Primary: Security Lead (24/7 PagerDuty)
  • Secondary: Platform Engineer (escalation)
  • Tertiary: CTO (executive escalation)

Communication Protocols

Internal Communication

War Room (Incident > MEDIUM):

  • Slack Channel: #incident-[YYYY-MM-DD]-[number]
  • Video Call: Zoom war room (pinned in channel)
  • Status Updates: Every 30 minutes during active incident

Status Page:

  • Update status.firestoned.io for customer-impacting incidents
  • Templates: Investigating β†’ Identified β†’ Monitoring β†’ Resolved

External Communication

Regulatory Reporting (CRITICAL incidents only):

  • PCI-DSS: Notify acquiring bank within 24 hours if cardholder data compromised
  • SOX: Document incident for quarterly IT controls audit
  • Basel III: Report cyber risk event to risk management committee

Customer Notification:

  • Criteria: Data breach, prolonged outage (> 4 hours), SLA violation
  • Channel: Email to registered contacts, status page
  • Timeline: Initial notification within 2 hours, updates every 4 hours

Playbook Index

IDPlaybookSeverityTrigger
P1Critical Vulnerability DetectedπŸ”΄ CRITICALGitHub issue, CVE alert, security scan
P2Compromised Controller PodπŸ”΄ CRITICALAnomalous behavior, unauthorized access
P3DNS Service OutageπŸ”΄ CRITICALAll BIND9 pods down, DNS queries failing
P4RNDC Key CompromiseπŸ”΄ CRITICALKey leaked, unauthorized RNDC access
P5Unauthorized DNS Changes🟠 HIGHUnexpected zone modifications
P6DDoS Attack🟠 HIGHQuery flood, resource exhaustion
P7Supply Chain CompromiseπŸ”΄ CRITICALMalicious commit, compromised dependency

Playbooks


P1: Critical Vulnerability Detected

Severity: πŸ”΄ CRITICAL Response Time: Immediate (< 15 minutes) SLA: Patch deployed within 24 hours

Trigger

  • Daily security scan detects CRITICAL vulnerability (CVSS 9.0-10.0)
  • GitHub Security Advisory published for Bindy dependency
  • CVE announced with active exploitation in the wild
  • Automated GitHub issue created: [SECURITY] CRITICAL vulnerability detected

Detection

# Automated detection via GitHub Actions
# Workflow: .github/workflows/security-scan.yaml
# Frequency: Daily at 00:00 UTC

# Manual check:
cargo audit --deny warnings
trivy image ghcr.io/firestoned/bindy:latest --severity CRITICAL,HIGH

Response Procedure

Phase 1: Detection & Analysis (T+0 to T+15 min)

Step 1.1: Acknowledge Incident

# Acknowledge PagerDuty alert
# Create Slack war room: #incident-[date]-vuln-[CVE-ID]

Step 1.2: Assess Vulnerability

# Review GitHub issue or security scan results
# Questions to answer:
# - What is the vulnerable component? (dependency, base image, etc.)
# - What is the CVSS score and attack vector?
# - Is there a known exploit (Exploit-DB, Metasploit)?
# - Is Bindy actually vulnerable (code path reachable)?

Step 1.3: Check Production Exposure

# Verify if vulnerable version is deployed
kubectl get deploy -n dns-system bindy -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check image digest
kubectl get pods -n dns-system -l app.kubernetes.io/name=bindy -o jsonpath='{.items[0].spec.containers[0].image}'

# Compare with vulnerable version from security advisory

Step 1.4: Determine Impact

  • If Bindy is NOT vulnerable (code path not reachable):

    • Update to patched version at next release (non-urgent)
    • Document exception in SECURITY.md
    • Close incident as FALSE POSITIVE
  • If Bindy IS vulnerable (exploitable in production):

    • PROCEED TO CONTAINMENT (Phase 2)

Phase 2: Containment (T+15 min to T+1 hour)

Step 2.1: Isolate Vulnerable Pods (if actively exploited)

# Scale down controller to prevent further exploitation
kubectl scale deploy -n dns-system bindy --replicas=0

# NOTE: This stops DNS updates but does NOT affect DNS queries
# BIND9 continues serving existing zones

Step 2.2: Review Audit Logs

# Check for signs of exploitation
kubectl logs -n dns-system -l app.kubernetes.io/name=bindy --tail=1000 | grep -i "error\|panic\|exploit"

# Review Kubernetes audit logs (if available)
# Look for: Unusual API calls, secret reads, privilege escalation attempts

Step 2.3: Assess Blast Radius

  • Controller compromised? Check for unauthorized DNS changes, secret reads
  • BIND9 affected? Check if RNDC keys were stolen
  • Data exfiltration? Review network logs for unusual egress traffic

Phase 3: Eradication (T+1 hour to T+24 hours)

Step 3.1: Apply Patch

Option A: Update Dependency (Rust crate)

# Update specific dependency
cargo update -p <vulnerable-package>

# Verify fix
cargo audit

# Run tests
cargo test

# Build new image
docker build -t ghcr.io/firestoned/bindy:hotfix-$(date +%s) .

# Push to registry
docker push ghcr.io/firestoned/bindy:hotfix-$(date +%s)

Option B: Update Base Image

# Update Dockerfile to latest Chainguard image
# docker/Dockerfile:
FROM cgr.dev/chainguard/static:latest-dev  # Use latest digest

# Rebuild and push
docker build -t ghcr.io/firestoned/bindy:hotfix-$(date +%s) .
docker push ghcr.io/firestoned/bindy:hotfix-$(date +%s)

Option C: Apply Workaround (if no patch available)

  • Disable vulnerable feature flag
  • Add input validation to prevent exploit
  • Document workaround in SECURITY.md

Step 3.2: Verify Fix

# Scan patched image
trivy image ghcr.io/firestoned/bindy:hotfix-$(date +%s) --severity CRITICAL,HIGH

# Expected: No CRITICAL vulnerabilities found

Step 3.3: Emergency Release

# Tag release
git tag -s hotfix-v0.1.1 -m "Security hotfix: CVE-XXXX-XXXXX"
git push origin hotfix-v0.1.1

# Trigger release workflow
# Verify signed commits, SBOM generation, vulnerability scans pass

Phase 4: Recovery (T+24 hours to T+48 hours)

Step 4.1: Deploy Patched Version

# Update deployment manifest (GitOps)
# deploy/controller/deployment.yaml:
spec:
  template:
    spec:
      containers:
      - name: bindy
        image: ghcr.io/firestoned/bindy:hotfix-v0.1.1  # Patched version

# Apply via FluxCD (GitOps) or manually
kubectl apply -f deploy/controller/deployment.yaml

# Verify rollout
kubectl rollout status deploy/bindy -n dns-system

# Confirm pods running patched version
kubectl get pods -n dns-system -l app.kubernetes.io/name=bindy -o jsonpath='{.items[0].spec.containers[0].image}'

Step 4.2: Verify Service Health

# Check controller logs
kubectl logs -n dns-system -l app.kubernetes.io/name=bindy --tail=100

# Verify reconciliation working
kubectl get dnszones --all-namespaces
kubectl describe dnszone -n team-web example-com

# Test DNS resolution
dig @<bind9-ip> example.com

Step 4.3: Run Security Scans

# Full security scan
cargo audit
trivy image ghcr.io/firestoned/bindy:hotfix-v0.1.1

# Expected: All clear

Phase 5: Post-Incident (T+48 hours to T+1 week)

Step 5.1: Document Incident

  • Update CHANGELOG.md with hotfix details
  • Document root cause in incident report
  • Update SECURITY.md if needed (known issues, exceptions)

Step 5.2: Notify Stakeholders

  • Update status page: β€œResolved - Security patch deployed”
  • Send email to compliance team (attach incident report)
  • Notify customers if required (data breach, SLA violation)

Step 5.3: Post-Incident Review (PIR)

  • What went well? (Detection, response time, communication)
  • What could improve? (Patch process, testing, automation)
  • Action items: (Update playbook, add monitoring, improve defenses)

Step 5.4: Update Metrics

  • MTTR (Mean Time To Remediate): ____ hours
  • SLA compliance: βœ… Met / ❌ Missed
  • Update vulnerability dashboard

Success Criteria

  • βœ… Patch deployed within 24 hours
  • βœ… No exploitation detected in production
  • βœ… Service availability maintained (or minimal downtime)
  • βœ… All security scans pass post-patch
  • βœ… Incident documented and reported to compliance

P2: Compromised Controller Pod

Severity: πŸ”΄ CRITICAL Response Time: Immediate (< 15 minutes) Impact: Unauthorized DNS modifications, secret theft, lateral movement

Trigger

  • Anomalous controller behavior (unexpected API calls, network traffic)
  • Unauthorized modifications to DNS zones
  • Security alert from SIEM or IDS
  • Pod logs show suspicious activity (reverse shell, file downloads)

Detection

# Monitor controller logs for anomalies
kubectl logs -n dns-system -l app.kubernetes.io/name=bindy --tail=500 | grep -E "(shell|wget|curl|nc|bash)"

# Check for unexpected processes in pod
kubectl exec -n dns-system <controller-pod> -- ps aux

# Review Kubernetes audit logs
# Look for: Unusual secret reads, excessive API calls, privilege escalation attempts

Response Procedure

Phase 1: Detection & Analysis (T+0 to T+15 min)

Step 1.1: Confirm Compromise

# Check controller logs
kubectl logs -n dns-system <controller-pod> --tail=1000 > /tmp/controller-logs.txt

# Indicators of compromise (IOCs):
# - Reverse shell activity (nc, bash -i, /dev/tcp/)
# - File downloads (wget, curl to suspicious domains)
# - Privilege escalation attempts (sudo, setuid)
# - Crypto mining (high CPU, connections to mining pools)

Step 1.2: Assess Impact

# Check for unauthorized DNS changes
kubectl get dnszones --all-namespaces -o yaml > /tmp/dnszones-snapshot.yaml

# Compare with known good state (GitOps repo)
diff /tmp/dnszones-snapshot.yaml /path/to/gitops/dnszones/

# Check for secret reads
# Review Kubernetes audit logs for GET /api/v1/namespaces/dns-system/secrets/*

Phase 2: Containment (T+15 min to T+1 hour)

Step 2.1: Isolate Controller Pod

# Apply network policy to block all egress (prevent data exfiltration)
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: bindy-controller-quarantine
  namespace: dns-system
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: bindy
  policyTypes:
  - Egress
  egress: []  # Block all egress
EOF

# Delete compromised pod (force recreation)
kubectl delete pod -n dns-system <controller-pod> --force --grace-period=0

Step 2.2: Rotate Credentials

# Rotate RNDC key (if potentially stolen)
# Generate new key
tsig-keygen -a hmac-sha256 rndc-key > /tmp/new-rndc-key.conf

# Update secret
kubectl create secret generic rndc-key-new \
  --from-file=rndc.key=/tmp/new-rndc-key.conf \
  -n dns-system \
  --dry-run=client -o yaml | kubectl apply -f -

# Update BIND9 pods to use new key (restart required)
kubectl rollout restart statefulset/bind9-primary -n dns-system
kubectl rollout restart statefulset/bind9-secondary -n dns-system

# Delete old secret
kubectl delete secret rndc-key -n dns-system

Step 2.3: Preserve Evidence

# Save pod logs before deletion
kubectl logs -n dns-system <controller-pod> --all-containers > /tmp/forensics/controller-logs-$(date +%s).txt

# Capture pod manifest
kubectl get pod -n dns-system <controller-pod> -o yaml > /tmp/forensics/controller-pod-manifest.yaml

# Save Kubernetes events
kubectl get events -n dns-system --sort-by='.lastTimestamp' > /tmp/forensics/events.txt

# Export audit logs (if available)
# - ServiceAccount API calls
# - Secret access logs
# - DNS zone modifications

Phase 3: Eradication (T+1 hour to T+4 hours)

Step 3.1: Root Cause Analysis

# Analyze logs for initial compromise vector
# Common vectors:
# - Vulnerability in controller code (RCE, memory corruption)
# - Compromised dependency (malicious crate)
# - Supply chain attack (malicious image)
# - Misconfigured RBAC (excessive permissions)

# Check image provenance
kubectl get pod -n dns-system <controller-pod> -o jsonpath='{.spec.containers[0].image}'

# Verify image signature and SBOM
# If signature invalid or SBOM shows unexpected dependencies β†’ supply chain attack

Step 3.2: Patch Vulnerability

  • If controller code vulnerability: Apply patch (see P1)
  • If supply chain attack: Investigate upstream, rollback to known good image
  • If RBAC misconfiguration: Fix RBAC, re-run verification script

Step 3.3: Scan for Backdoors

# Scan all images for malware
trivy image ghcr.io/firestoned/bindy:latest --scanners vuln,secret,misconfig

# Check for unauthorized SSH keys, cron jobs, persistence mechanisms
kubectl exec -n dns-system <new-controller-pod> -- ls -la /root/.ssh/
kubectl exec -n dns-system <new-controller-pod> -- cat /etc/crontab

Phase 4: Recovery (T+4 hours to T+24 hours)

Step 4.1: Deploy Clean Controller

# Verify image integrity
# - Signed commits in Git history
# - Signed container image with provenance
# - Clean vulnerability scan

# Deploy patched controller
kubectl rollout restart deploy/bindy -n dns-system

# Remove quarantine network policy
kubectl delete networkpolicy bindy-controller-quarantine -n dns-system

# Verify health
kubectl get pods -n dns-system -l app.kubernetes.io/name=bindy
kubectl logs -n dns-system -l app.kubernetes.io/name=bindy --tail=100

Step 4.2: Verify DNS Zones

# Restore DNS zones from GitOps (if unauthorized changes detected)
# 1. Revert changes in Git
# 2. Force FluxCD reconciliation
flux reconcile kustomization bindy-system --with-source

# Verify all zones match expected state
kubectl get dnszones --all-namespaces -o yaml | diff - /path/to/gitops/dnszones/

Step 4.3: Validate Service

# Test DNS resolution
dig @<bind9-ip> example.com

# Verify controller reconciliation
kubectl get dnszones --all-namespaces
kubectl describe dnszone -n team-web example-com | grep "Ready.*True"

Phase 5: Post-Incident (T+24 hours to T+1 week)

Step 5.1: Forensic Analysis

  • Engage forensics team if required
  • Analyze preserved logs for IOCs
  • Timeline of compromise (initial access β†’ lateral movement β†’ exfiltration)

Step 5.2: Notify Stakeholders

  • Compliance: Report to SOX/PCI-DSS auditors (security incident)
  • Customers: If DNS records were modified or data exfiltrated
  • Regulators: If required by Basel III (cyber risk event reporting)

Step 5.3: Improve Defenses

  • Short-term: Implement missing network policies (L-1)
  • Medium-term: Add runtime security monitoring (Falco, Tetragon)
  • Long-term: Implement admission controller for image verification

Step 5.4: Update Documentation

  • Update incident playbook with lessons learned
  • Document new IOCs for detection rules
  • Update threat model (docs/security/THREAT_MODEL.md)

Success Criteria

  • βœ… Compromised pod isolated within 15 minutes
  • βœ… No lateral movement to other pods/namespaces
  • βœ… Credentials rotated (RNDC keys)
  • βœ… Root cause identified and patched
  • βœ… DNS service fully restored with verified integrity
  • βœ… Forensic evidence preserved for investigation

P3: DNS Service Outage

Severity: πŸ”΄ CRITICAL Response Time: Immediate (< 15 minutes) Impact: All DNS queries failing, service unavailable

Trigger

  • All BIND9 pods down (CrashLoopBackOff, OOMKilled)
  • DNS queries timing out
  • Monitoring alert: β€œDNS service unavailable”
  • Customer reports: β€œCannot resolve domain names”

Response Procedure

Phase 1: Detection & Analysis (T+0 to T+10 min)

Step 1.1: Confirm Outage

# Test DNS resolution
dig @<bind9-loadbalancer-ip> example.com

# Check pod status
kubectl get pods -n dns-system -l app.kubernetes.io/name=bind9

# Check service endpoints
kubectl get svc -n dns-system bind9-dns -o wide
kubectl get endpoints -n dns-system bind9-dns

Step 1.2: Identify Root Cause

# Check pod logs
kubectl logs -n dns-system <bind9-pod> --tail=200

# Common root causes:
# - OOMKilled (memory exhaustion)
# - CrashLoopBackOff (configuration error, missing ConfigMap)
# - ImagePullBackOff (registry issue, image not found)
# - Pending (insufficient resources, node failure)

# Check events
kubectl describe pod -n dns-system <bind9-pod>

Phase 2: Containment & Quick Fix (T+10 min to T+30 min)

Scenario A: OOMKilled (Memory Exhaustion)

# Increase memory limit
kubectl patch statefulset bind9-primary -n dns-system -p '
spec:
  template:
    spec:
      containers:
      - name: bind9
        resources:
          limits:
            memory: "512Mi"  # Increase from 256Mi
'

# Restart pods
kubectl rollout restart statefulset/bind9-primary -n dns-system

Scenario B: Configuration Error

# Check ConfigMap
kubectl get cm -n dns-system bind9-config -o yaml

# Common issues:
# - Syntax error in named.conf
# - Missing zone file
# - Invalid RNDC key

# Fix configuration (update ConfigMap)
kubectl edit cm bind9-config -n dns-system

# Restart pods to apply new config
kubectl rollout restart statefulset/bind9-primary -n dns-system

Scenario C: Image Pull Failure

# Check image pull secret
kubectl get secret -n dns-system ghcr-pull-secret

# Verify image exists
docker pull ghcr.io/firestoned/bindy:latest

# If image missing, rollback to previous version
kubectl rollout undo statefulset/bind9-primary -n dns-system

Phase 3: Recovery (T+30 min to T+2 hours)

Step 3.1: Verify Service Restoration

# Check all pods healthy
kubectl get pods -n dns-system -l app.kubernetes.io/name=bind9

# Test DNS resolution (all zones)
dig @<bind9-ip> example.com
dig @<bind9-ip> test.example.com

# Check service endpoints
kubectl get endpoints -n dns-system bind9-dns
# Should show all healthy pod IPs

Step 3.2: Validate Data Integrity

# Verify all zones loaded
kubectl exec -n dns-system <bind9-pod> -- rndc status

# Check zone serial numbers (ensure no data loss)
dig @<bind9-ip> example.com SOA

# Compare with expected serial (from GitOps)

Phase 4: Post-Incident (T+2 hours to T+1 week)

Step 4.1: Root Cause Analysis

  • Why did BIND9 exhaust memory? (Too many zones, memory leak, query flood)
  • Why did configuration break? (Controller bug, bad CRD validation, manual change)
  • Why did image pull fail? (Registry downtime, authentication issue)

Step 4.2: Preventive Measures

  • Add horizontal pod autoscaling (HPA based on CPU/memory)
  • Add health checks (liveness/readiness probes for BIND9)
  • Add configuration validation (admission webhook for ConfigMaps)
  • Add chaos engineering tests (kill pods, exhaust memory, test recovery)

Step 4.3: Update SLO/SLA

  • Document actual downtime
  • Calculate availability percentage
  • Update SLA reports for customers

Success Criteria

  • βœ… DNS service restored within 30 minutes
  • βœ… All zones serving correctly
  • βœ… No data loss (zone serial numbers match)
  • βœ… Root cause identified and documented
  • βœ… Preventive measures implemented

P4: RNDC Key Compromise

Severity: πŸ”΄ CRITICAL Response Time: Immediate (< 15 minutes) Impact: Attacker can control BIND9 (reload zones, freeze service, etc.)

Trigger

  • RNDC key found in logs, Git commit, or public repository
  • Unauthorized RNDC commands detected (audit logs)
  • Security scan detects secret in code or environment variables

Response Procedure

Phase 1: Detection & Analysis (T+0 to T+15 min)

Step 1.1: Confirm Compromise

# Search for leaked key in logs
grep -r "rndc-key" /var/log/ /tmp/

# Search Git history for accidentally committed keys
git log -S "rndc-key" --all

# Check GitHub secret scanning alerts
# GitHub β†’ Security β†’ Secret scanning alerts

Step 1.2: Assess Impact

# Check BIND9 logs for unauthorized RNDC commands
kubectl logs -n dns-system <bind9-pod> --tail=1000 | grep "rndc command"

# Check for malicious activity:
# - rndc freeze (stop zone updates)
# - rndc reload (load malicious zone)
# - rndc querylog on (enable debug logging for reconnaissance)

Phase 2: Containment (T+15 min to T+1 hour)

Step 2.1: Rotate RNDC Key (Emergency)

# Generate new RNDC key
tsig-keygen -a hmac-sha256 rndc-key-emergency > /tmp/rndc-key-new.conf

# Extract key from generated file
cat /tmp/rndc-key-new.conf

# Create new Kubernetes secret
kubectl create secret generic rndc-key-rotated \
  --from-literal=key="<new-key-here>" \
  -n dns-system

# Update controller deployment to use new secret
kubectl set env deploy/bindy -n dns-system RNDC_KEY_SECRET=rndc-key-rotated

# Update BIND9 StatefulSets
kubectl set volume statefulset/bind9-primary -n dns-system \
  --add --name=rndc-key \
  --type=secret \
  --secret-name=rndc-key-rotated \
  --mount-path=/etc/bind/rndc.key \
  --sub-path=rndc.key

# Restart all BIND9 pods
kubectl rollout restart statefulset/bind9-primary -n dns-system
kubectl rollout restart statefulset/bind9-secondary -n dns-system

# Delete compromised secret
kubectl delete secret rndc-key -n dns-system

Step 2.2: Block Network Access (if attacker active)

# Apply network policy to block RNDC port (953) from external access
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: bind9-rndc-deny-external
  namespace: dns-system
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: bind9
  policyTypes:
  - Ingress
  ingress:
  # Allow DNS queries (port 53)
  - from:
    - namespaceSelector: {}
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Allow RNDC only from controller
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/name: bindy
    ports:
    - protocol: TCP
      port: 953
EOF

Phase 3: Eradication (T+1 hour to T+4 hours)

Step 3.1: Remove Leaked Secrets

If secret in Git:

# Remove from Git history (use BFG Repo-Cleaner)
git clone --mirror git@github.com:firestoned/bindy.git
bfg --replace-text passwords.txt bindy.git
cd bindy.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push --force

# Notify all team members to re-clone repository

If secret in logs:

# Rotate logs immediately
kubectl delete pod -n dns-system <controller-pod>  # Forces log rotation

# Purge old logs from log aggregation system
# (Depends on logging backend: Elasticsearch, CloudWatch, etc.)

Step 3.2: Audit All Secret Access

# Review Kubernetes audit logs
# Find all ServiceAccounts that read rndc-key secret in last 30 days
# Check if any unauthorized access occurred

Phase 4: Recovery (T+4 hours to T+24 hours)

Step 4.1: Verify Key Rotation

# Test RNDC with new key
kubectl exec -n dns-system <controller-pod> -- \
  rndc -s <bind9-ip> -k /etc/bindy/rndc/rndc.key status

# Expected: Command succeeds with new key

# Test DNS service
dig @<bind9-ip> example.com

# Expected: DNS queries work normally

Step 4.2: Update Documentation

# Update secret rotation procedure in SECURITY.md
# Document rotation frequency (e.g., quarterly, or after incident)

Phase 5: Post-Incident (T+24 hours to T+1 week)

Step 5.1: Implement Secret Detection

# Add pre-commit hook to detect secrets
# .git/hooks/pre-commit:
#!/bin/bash
git diff --cached --name-only | xargs grep -E "(rndc-key|BEGIN RSA PRIVATE KEY)" && {
  echo "ERROR: Secret detected in commit. Aborting."
  exit 1
}

# Enable GitHub secret scanning (if not already enabled)
# GitHub β†’ Settings β†’ Code security and analysis β†’ Secret scanning: Enable

Step 5.2: Automate Key Rotation

# Implement automated quarterly key rotation
# Add CronJob to generate and rotate keys every 90 days

Step 5.3: Improve Secret Management

  • Consider external secret manager (HashiCorp Vault, AWS Secrets Manager)
  • Implement secret access audit trail (H-3)
  • Add alerts on unexpected secret reads

Success Criteria

  • βœ… RNDC key rotated within 1 hour
  • βœ… Leaked secret removed from all locations
  • βœ… No unauthorized RNDC commands executed
  • βœ… DNS service fully functional with new key
  • βœ… Secret detection mechanisms implemented
  • βœ… Audit trail reviewed and documented

P5: Unauthorized DNS Changes

Severity: 🟠 HIGH Response Time: < 1 hour Impact: DNS records modified without approval, potential traffic redirection

Trigger

  • Unexpected changes to DNSZone custom resources
  • DNS records pointing to unknown IP addresses
  • GitOps detects drift (actual state β‰  desired state)
  • User reports: β€œDNS not resolving correctly”

Response Procedure

Phase 1: Detection & Analysis (T+0 to T+30 min)

Step 1.1: Identify Unauthorized Changes

# Get current DNSZone state
kubectl get dnszones --all-namespaces -o yaml > /tmp/current-dnszones.yaml

# Compare with GitOps source of truth
diff /tmp/current-dnszones.yaml /path/to/gitops/dnszones/

# Check Kubernetes audit logs for who made changes
# Look for: kubectl apply, kubectl edit, kubectl patch on DNSZone resources

Step 1.2: Assess Impact

# Which zones were modified?
# What records changed? (A, CNAME, MX, TXT)
# Where is traffic being redirected?

# Test DNS resolution
dig @<bind9-ip> suspicious-domain.com

# Check if malicious IP is reachable
nslookup suspicious-domain.com
curl -I http://<suspicious-ip>/

Phase 2: Containment (T+30 min to T+1 hour)

Step 2.1: Revert Unauthorized Changes

# Revert to known good state (GitOps)
kubectl apply -f /path/to/gitops/dnszones/team-web/example-com.yaml

# Force controller reconciliation
kubectl annotate dnszone -n team-web example-com \
  reconcile-at="$(date +%s)" --overwrite

# Verify zone restored
kubectl get dnszone -n team-web example-com -o yaml | grep "status"

Step 2.2: Revoke Access (if compromised user)

# Identify user who made unauthorized change (from audit logs)
# Example: user=alice, namespace=team-web

# Remove user's RBAC permissions
kubectl delete rolebinding dnszone-editor-alice -n team-web

# Force user to re-authenticate
# (Depends on authentication provider: OIDC, LDAP, etc.)

Phase 3: Eradication (T+1 hour to T+4 hours)

Step 3.1: Root Cause Analysis

  • Compromised user credentials? Rotate passwords, check for MFA bypass
  • RBAC misconfiguration? User had excessive permissions
  • Controller bug? Controller reconciled incorrect state
  • Manual kubectl change? Bypassed GitOps workflow

Step 3.2: Fix Root Cause

# Example: RBAC was too permissive
# Fix RoleBinding to limit scope
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dnszone-editor-alice
  namespace: team-web
subjects:
- kind: User
  name: alice
roleRef:
  kind: Role
  name: dnszone-editor  # Role only allows CRUD on DNSZones, not deletion
  apiGroup: rbac.authorization.k8s.io
EOF

Phase 4: Recovery (T+4 hours to T+24 hours)

Step 4.1: Verify DNS Integrity

# Test all zones
for zone in $(kubectl get dnszones --all-namespaces -o jsonpath='{.items[*].spec.zoneName}'); do
  echo "Testing $zone"
  dig @<bind9-ip> $zone SOA
done

# Expected: All zones resolve correctly with expected serial numbers

Step 4.2: Restore User Access (if revoked)

# After confirming user is not compromised, restore access
kubectl apply -f /path/to/gitops/rbac/team-web/alice-rolebinding.yaml

Phase 5: Post-Incident (T+24 hours to T+1 week)

Step 5.1: Implement Admission Webhooks

# Add ValidatingWebhook to prevent suspicious DNS changes
# Example: Block A records pointing to private IPs (RFC 1918)
# Example: Require approval for changes to critical zones (*.bank.com)

Step 5.2: Add Drift Detection

# Implement automated GitOps drift detection
# Alert if cluster state β‰  Git state for > 5 minutes
# Tool: FluxCD notification controller + Slack webhook

Step 5.3: Enforce GitOps Workflow

# Remove direct kubectl access for users
# Require all changes via Pull Requests in GitOps repo
# Implement branch protection: 2+ reviewers required

Success Criteria

  • βœ… Unauthorized changes reverted within 1 hour
  • βœ… Root cause identified (user, RBAC, controller bug)
  • βœ… Access revoked/fixed to prevent recurrence
  • βœ… DNS integrity verified (all zones correct)
  • βœ… Drift detection and admission webhooks implemented

P6: DDoS Attack

Severity: 🟠 HIGH Response Time: < 1 hour Impact: DNS service degraded or unavailable due to query flood

Trigger

  • High query rate (> 10,000 QPS per pod)
  • BIND9 pods high CPU/memory utilization
  • Monitoring alert: β€œDNS response time elevated”
  • Users report: β€œDNS slow or timing out”

Response Procedure

Phase 1: Detection & Analysis (T+0 to T+15 min)

Step 1.1: Confirm DDoS Attack

# Check BIND9 query rate
kubectl exec -n dns-system <bind9-pod> -- rndc status | grep "queries resulted"

# Check pod resource utilization
kubectl top pods -n dns-system -l app.kubernetes.io/name=bind9

# Analyze query patterns
kubectl exec -n dns-system <bind9-pod> -- rndc dumpdb -zones
kubectl exec -n dns-system <bind9-pod> -- cat /var/cache/bind/named_dump.db | head -100

Step 1.2: Identify Attack Type

  • Volumetric attack: Millions of queries from many IPs (botnet)
  • Amplification attack: Abusing AXFR or ANY queries
  • NXDOMAIN attack: Flood of queries for non-existent domains

Phase 2: Containment (T+15 min to T+1 hour)

Step 2.1: Enable Rate Limiting (BIND9)

# Update BIND9 configuration
kubectl edit cm -n dns-system bind9-config

# Add rate-limit directive:
# named.conf:
rate-limit {
    responses-per-second 10;
    nxdomains-per-second 5;
    errors-per-second 5;
    window 10;
};

# Restart BIND9 to apply config
kubectl rollout restart statefulset/bind9-primary -n dns-system

Step 2.2: Scale Up BIND9 Pods

# Horizontal scaling
kubectl scale statefulset bind9-secondary -n dns-system --replicas=5

# Vertical scaling (if needed)
kubectl patch statefulset bind9-primary -n dns-system -p '
spec:
  template:
    spec:
      containers:
      - name: bind9
        resources:
          requests:
            cpu: "1000m"
            memory: "1Gi"
          limits:
            cpu: "2000m"
            memory: "2Gi"
'

Step 2.3: Block Malicious IPs (if identifiable)

# If attack comes from small number of IPs, block at firewall/LoadBalancer
# Example: AWS Network ACL, GCP Cloud Armor

# Add NetworkPolicy to block specific CIDRs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-attacker-ips
  namespace: dns-system
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: bind9
  policyTypes:
  - Ingress
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.0.2.0/24  # Attacker CIDR
        - 198.51.100.0/24  # Attacker CIDR
EOF

Phase 3: Eradication (T+1 hour to T+4 hours)

Step 3.1: Engage DDoS Protection Service

# If volumetric attack (> 10 Gbps), edge DDoS protection required
# Options:
# - CloudFlare DNS (proxy DNS through CloudFlare)
# - AWS Shield Advanced
# - Google Cloud Armor

# Migrate DNS to CloudFlare (example):
# 1. Add zone to CloudFlare
# 2. Update NS records at domain registrar
# 3. Configure CloudFlare β†’ Origin (BIND9 backend)

Step 3.2: Implement Response Rate Limiting (RRL)

# BIND9 RRL configuration (more aggressive)
rate-limit {
    responses-per-second 5;
    nxdomains-per-second 2;
    referrals-per-second 5;
    nodata-per-second 5;
    errors-per-second 2;
    window 5;
    log-only no;  # Actually drop packets (not just log)
    slip 2;  # Send truncated response every 2nd rate-limited query
    max-table-size 20000;
};

Phase 4: Recovery (T+4 hours to T+24 hours)

Step 4.1: Monitor Service Health

# Check query rate stabilized
kubectl exec -n dns-system <bind9-pod> -- rndc status

# Check pod resource utilization
kubectl top pods -n dns-system

# Test DNS resolution
dig @<bind9-ip> example.com

# Expected: Normal response times (< 50ms)

Step 4.2: Scale Down (if attack subsided)

# Return to normal replica count
kubectl scale statefulset bind9-secondary -n dns-system --replicas=2

Phase 5: Post-Incident (T+24 hours to T+1 week)

Step 5.1: Implement Permanent DDoS Protection

  • Edge DDoS protection: CloudFlare, AWS Shield, Google Cloud Armor
  • Anycast DNS: Distribute load across multiple geographic locations
  • Autoscaling: HPA based on query rate, CPU, memory

Step 5.2: Improve Monitoring

# Add Prometheus metrics for query rate
# Add alerts:
# - Query rate > 5000 QPS per pod
# - NXDOMAIN rate > 50%
# - Response time > 100ms (p95)

Step 5.3: Document Attack Details

  • Attack duration: ____ hours
  • Peak query rate: ____ QPS
  • Attack type: Volumetric / Amplification / NXDOMAIN
  • Attack sources: IP ranges, ASNs, geolocation
  • Mitigation effectiveness: RRL / Scaling / Edge protection

Success Criteria

  • βœ… DNS service restored within 1 hour
  • βœ… Query rate normalized (< 1000 QPS per pod)
  • βœ… Response times < 50ms (p95)
  • βœ… Permanent DDoS protection implemented (CloudFlare, etc.)
  • βœ… Autoscaling and monitoring in place

P7: Supply Chain Compromise

Severity: πŸ”΄ CRITICAL Response Time: Immediate (< 15 minutes) Impact: Malicious code in controller, backdoor access, data exfiltration

Trigger

  • Malicious commit detected in Git history
  • Dependency vulnerability with active exploit (supply chain attack)
  • Image signature verification fails
  • SBOM shows unexpected dependency or binary

Response Procedure

Phase 1: Detection & Analysis (T+0 to T+30 min)

Step 1.1: Identify Compromised Component

# Check Git commit signatures
git log --show-signature | grep "BAD signature"

# Check image provenance
docker buildx imagetools inspect ghcr.io/firestoned/bindy:latest --format '{{ json .Provenance }}'

# Expected: Valid signature from GitHub Actions

# Check SBOM for unexpected dependencies
# Download SBOM from GitHub release artifacts
curl -L https://github.com/firestoned/bindy/releases/download/v1.0.0/sbom.json | jq '.components[].name'

# Expected: Only known dependencies from Cargo.toml

Step 1.2: Assess Impact

# Check if compromised version deployed to production
kubectl get deploy -n dns-system bindy -o jsonpath='{.spec.template.spec.containers[0].image}'

# If compromised image is running β†’ **CRITICAL** (proceed to containment)
# If compromised image NOT deployed β†’ **HIGH** (patch and prevent deployment)

Phase 2: Containment (T+30 min to T+2 hours)

Step 2.1: Isolate Compromised Controller

# Scale down compromised controller
kubectl scale deploy -n dns-system bindy --replicas=0

# Apply network policy to block egress (prevent exfiltration)
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: bindy-quarantine
  namespace: dns-system
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: bindy
  policyTypes:
  - Egress
  egress: []
EOF

Step 2.2: Preserve Evidence

# Save pod logs
kubectl logs -n dns-system -l app.kubernetes.io/name=bindy --all-containers > /tmp/forensics/controller-logs.txt

# Save compromised image for analysis
docker pull ghcr.io/firestoned/bindy:compromised-tag
docker save ghcr.io/firestoned/bindy:compromised-tag > /tmp/forensics/compromised-image.tar

# Scan for malware
trivy image ghcr.io/firestoned/bindy:compromised-tag --scanners vuln,secret,misconfig

Step 2.3: Rotate All Credentials

# Rotate RNDC keys
# See P4: RNDC Key Compromise

# Rotate ServiceAccount tokens (if controller potentially stole them)
kubectl delete secret -n dns-system $(kubectl get secrets -n dns-system | grep bindy-token | awk '{print $1}')
kubectl rollout restart deploy/bindy -n dns-system  # Will generate new token

Phase 3: Eradication (T+2 hours to T+8 hours)

Step 3.1: Root Cause Analysis

# Identify how malicious code was introduced:
# - Compromised developer account?
# - Malicious dependency in Cargo.toml?
# - Compromised CI/CD pipeline?
# - Insider threat?

# Check Git history for unauthorized commits
git log --all --show-signature

# Check CI/CD logs for anomalies
# GitHub Actions β†’ Workflow runs β†’ Check for unusual activity

# Check dependency sources
cargo tree | grep -v "crates.io"
# Expected: All dependencies from crates.io (no git dependencies)

Step 3.2: Clean Git History (if malicious commit)

# Identify malicious commit
git log --all --oneline | grep "suspicious"

# Revert malicious commit
git revert <malicious-commit-sha>

# Force push (if malicious code not yet merged to main)
git push --force origin feature-branch

# If malicious code merged to main β†’ Contact GitHub Security
# Request help with incident response and forensics

Step 3.3: Rebuild from Clean Source

# Checkout known good commit (before compromise)
git checkout <last-known-good-commit>

# Rebuild binaries
cargo build --release

# Rebuild container image
docker build -t ghcr.io/firestoned/bindy:clean-$(date +%s) .

# Scan for vulnerabilities
cargo audit
trivy image ghcr.io/firestoned/bindy:clean-$(date +%s)

# Expected: All clean

# Push to registry
docker push ghcr.io/firestoned/bindy:clean-$(date +%s)

Phase 4: Recovery (T+8 hours to T+24 hours)

Step 4.1: Deploy Clean Controller

# Update deployment manifest
kubectl set image deploy/bindy -n dns-system \
  bindy=ghcr.io/firestoned/bindy:clean-$(date +%s)

# Remove quarantine network policy
kubectl delete networkpolicy bindy-quarantine -n dns-system

# Verify health
kubectl get pods -n dns-system -l app.kubernetes.io/name=bindy
kubectl logs -n dns-system -l app.kubernetes.io/name=bindy --tail=100

Step 4.2: Verify Service Integrity

# Test DNS resolution
dig @<bind9-ip> example.com

# Verify all zones correct
kubectl get dnszones --all-namespaces -o yaml | diff - /path/to/gitops/dnszones/

# Expected: No drift

Phase 5: Post-Incident (T+24 hours to T+1 week)

Step 5.1: Implement Supply Chain Security

# Enable Dependabot security updates
# .github/dependabot.yml:
version: 2
updates:
  - package-ecosystem: "cargo"
    directory: "/"
    schedule:
      interval: "daily"
    open-pull-requests-limit: 10

# Pin dependencies by hash (Cargo.lock already does this)
# Verify Cargo.lock is committed to Git

# Implement image signing verification
# Add admission controller (Kyverno, OPA Gatekeeper) to verify image signatures before deployment

Step 5.2: Implement Code Review Enhancements

# Require 2+ reviewers for all PRs (already implemented)
# Add CODEOWNERS for sensitive files:
# .github/CODEOWNERS:
/Cargo.toml @security-team
/Cargo.lock @security-team
/Dockerfile @security-team
/.github/workflows/ @security-team

Step 5.3: Notify Stakeholders

  • Users: Email notification about supply chain incident
  • Regulators: Report to SOX/PCI-DSS auditors (security incident)
  • GitHub Security: Report compromised dependency or account

Step 5.4: Update Documentation

  • Document supply chain incident in threat model
  • Update supply chain security controls in SECURITY.md
  • Add supply chain attack scenarios to threat model

Success Criteria

  • βœ… Compromised component identified within 30 minutes
  • βœ… Malicious code removed from Git history
  • βœ… Clean controller deployed within 24 hours
  • βœ… All credentials rotated
  • βœ… Supply chain security improvements implemented
  • βœ… Stakeholders notified and incident documented

Post-Incident Activities

Post-Incident Review (PIR) Template

Incident ID: INC-YYYY-MM-DD-XXXX Severity: πŸ”΄ / 🟠 / 🟑 / πŸ”΅ Incident Commander: [Name] Date: [YYYY-MM-DD] Duration: [Detection to resolution]

Summary

[1-2 paragraph summary of incident]

Timeline

TimeEventAction Taken
T+0[Detection event][Action]
T+15min[Analysis][Action]
T+1hr[Containment][Action]
T+4hr[Eradication][Action]
T+24hr[Recovery][Action]

Root Cause

[Detailed root cause analysis]

What Went Well βœ…

  • [Detection was fast]
  • [Playbook was clear]
  • [Team communication was effective]

What Could Improve ❌

  • [Monitoring gaps]
  • [Playbook outdated]
  • [Slow escalation]

Action Items

ActionOwnerDue DateStatus
[Implement network policies]Platform Team2025-01-15πŸ”„ In Progress
[Add monitoring alerts]SRE Team2025-01-10βœ… Complete
[Update playbook]Security Team2025-01-05βœ… Complete

Metrics

  • MTTD (Mean Time To Detect): [X] minutes
  • MTTR (Mean Time To Remediate): [X] hours
  • SLA Met: βœ… Yes / ❌ No
  • Downtime: [X] minutes
  • Customers Impacted: [N]

References


Last Updated: 2025-12-17 Next Review: 2025-03-17 (Quarterly) Approved By: Security Team