Skip to main content

CoreDNS Denial-of-Service in Production Cluster

A misconfigured CoreDNS forward plugin caused cascading DNS failures across all production workloads for 47 minutes.

critical
kubernetesdnssecuritynetworking

Summary

On April 13 2024, a routine CoreDNS configuration change introduced a forwarding loop that exhausted DNS resolver capacity. All production services relying on cluster DNS experienced degraded name resolution for 47 minutes, resulting in elevated 5xx error rates across customer-facing APIs.

Timeline

Time (UTC)Event
14:02Engineer applies updated CoreDNS ConfigMap to enable conditional forwarding for internal.corp zone
14:05Monitoring detects spike in CoreDNS pod CPU and memory usage
14:08First PagerDuty alert fires: CoreDNS latency > 500ms
14:12API gateway error rate crosses 10% threshold; customer impact begins
14:18On-call engineer begins investigation, suspects upstream resolver issue
14:31Root cause identified: forward plugin creates loop between CoreDNS and itself on :53
14:34ConfigMap rolled back to previous version
14:42CoreDNS pods stabilize; DNS resolution latency returns to baseline
14:49All downstream services recover; incident marked resolved

Root Cause

The updated ConfigMap added a forward block for the internal.corp zone that pointed to the cluster’s own kube-dns service IP. Because CoreDNS is the backend for kube-dns, this created a forwarding loop. Each incoming query generated another query to itself, consuming CPU and socket buffers until the pods hit OOM limits and began crash-looping.

internal.corp:53 {
    forward . 10.96.0.10   # <- this IS kube-dns / CoreDNS itself
}

The correct target should have been the corporate DNS resolver at 10.200.0.53.

Impact

  • Duration: 47 minutes (14:02 – 14:49 UTC)
  • User-facing: ~12% of API requests returned 5xx during the window
  • Internal: All cluster-internal service-to-service calls using DNS experienced intermittent failures
  • Data loss: None confirmed

Action Items

  • Add OPA/Kyverno policy to reject CoreDNS ConfigMap changes that reference kube-dns cluster IP as a forward target
  • Implement canary rollout for CoreDNS config changes using a staging namespace
  • Add synthetic DNS probe that alerts within 60s of resolution failure
  • Document CoreDNS change procedure in the how-to guide

Lessons Learned