Possible causes of sudden DNS resolution failure and emergency repair steps
DNS resolution plays a crucial role in the stable operation of a website. If resolution suddenly fails, visitors will be unable to access the server via the domain name, potentially leading to business interruption, user loss, or all API requests failing. Many website owners, when encountering DNS failures, only notice that the domain name is inaccessible. However, to truly resolve the problem, it's essential to identify the root cause of the DNS failure. These causes can stem from the domain name itself, the DNS provider, record configuration, server network, ISP policies, or even malicious hijacking and poisoning. To avoid prolonged downtime, quickly locating and urgently repairing the issue is a core capability for maintaining business continuity.
In actual failures, the most common cause of sudden DNS resolution failure is the accidental deletion or overwriting of domain records. For example, accidentally modifying host records, TTL, or resolution lines in the domain control panel, or errors in API automation scripts can lead to record anomalies. For users using automated certificates or dynamic resolution, script execution errors can easily lead to the accidental deletion of existing A records, causing direct domain name resolution failure. Checking whether the record exists and correctly points to the server is the first step in troubleshooting. The following command can be used to verify the current resolution:
dig yourdomain.com +trace
nslookup yourdomain.com
If an empty string or an incorrect IP address is returned, it indicates a problem with the DNS record, requiring immediate restoration of historical records.
DNS server failure is also a common cause of sudden DNS resolution failures. If using a third-party DNS provider, large-scale service outages or cluster anomalies may cause resolution delays, failures, or incorrect IP addresses. In such cases, access failures are typically seen globally for all users, not just in specific regions. Accessing the DNS provider's status page can quickly confirm if a systemic failure has occurred. If a provider issue is confirmed, the fastest way is to immediately switch to an alternative DNS, such as switching the domain's NS record to the new DNS platform, thereby restoring resolution capabilities. While NS switching takes a long time to take effect, it is a necessary measure in the event of a severe failure.
Domain expiration is another significant cause. Many users do not pay attention to domain expiration information in real time, especially with high business volume or a large number of domains. Once a domain expires, resolution will immediately fail. Worse still, some registrars will resolve expired domains to advertising pages or stop all resolution services. Users can check the domain status using Whois.
whois yourdomain.com
When the status shows "Redemption" or "Expired," you must renew your subscription immediately to restore DNS resolution. Normal DNS resolution usually resumes within tens of minutes after renewal, but some registrars may experience delays.
DNS cache pollution is another major cause of sudden DNS resolution failures, especially for websites targeting users in mainland China. When DNS is incorrectly cached by the ISP or hijacked, user access will be resolved to the wrong IP address, causing the website to be inaccessible or even redirecting to an unfamiliar page. In this case, local query results may differ across regions. Troubleshooting can be done using DNS tools for multiple regions, such as:
ping yourdomain.com
If different IPs are returned from multiple regions, it indicates DNS pollution or hijacking. DNSSEC should be enabled immediately, switching to a high-security DNS service and avoiding the use of default public DNS servers from domestic ISPs whenever possible. For global DNS platforms, the probability of pollution is usually lower.
Server network issues can also make DNS appear unresponsive. When the server is down, the network is blocked, or a firewall is blocking ports 80 and 443, DNS resolution may appear normal, but users still cannot access the site, easily leading to the misconception that resolution has failed. Therefore, when troubleshooting DNS, it is essential to simultaneously check the server's network responsiveness.
ping your_server_ip
telnet your_server_ip 80
If the server is unresponsive, prioritize fixing server network or bandwidth issues rather than modifying DNS.
Setting the TTL (Time To Live) too long can also cause DNS changes to appear "ineffective." For example, if a server is migrated to a new IP address, but the record TTL is still set to 600 seconds or longer, users will still resolve to the old IP address for several minutes after the change. This appears to be a DNS error, but it's actually due to the cache not having expired. To handle emergency switches, it's recommended to reduce the TTL to 30-60 seconds before migrating services.
Incorrectly configured network resolution policies can also cause resolution failures in certain regions. For example, if "domestic network," "overseas network," and "China Mobile/China Telecom/China Unicom" resolutions are configured, but a default network is missing, some resolution nodes will discard records when there is no network match, causing intermittent access failures. Checking if the resolution platform has the following message is crucial: "No effective record for the current network." Simply adding an A record for the default network will restore access.
Sometimes, DNS resolution failures are not caused by systemic errors but by malicious tampering. Attackers exploit API key theft, DNS service vulnerabilities, and stolen domain accounts to redirect DNS resolution to malicious IPs, causing websites to become inaccessible or compromised with phishing pages. To prevent such issues, first, protect your domain registrar's account security and immediately enable two-step verification. Second, if using a DNS service that supports APIs, prioritize role-based subkeys and grant only the minimum necessary privileges. For example, Cloudflare supports granting edit permissions only to specified domains, preventing attackers from exploiting the keys. If records are found to have been modified, immediately lock the domain, revoke all API keys, restore DNS records, and check for backdoor scripts.
To ensure rapid DNS service recovery, emergency remediation steps are essential. First, check if DNS records are returning correctly; use `dig` to verify if the expected IP is being returned. If records are missing, immediately restore historical records. Ensure that A, AAAA, and CNAME records are configured correctly and that `http://` or other incorrect content is not entered. Next, check the DNS provider's status; if it is experiencing problems, quickly switch to an alternative DNS, for example, by setting the NS to:
ns1.dnspod.net
ns2.dnspod.net
Choosing a globally accelerated DNS service based on your needs will contribute to a more stable response. Next, check the domain status to confirm if it has expired or is in the redemption period. If the domain has expired, you must renew it immediately and wait for automatic DNS recovery. If it is suspected of being poisoned, you can enable DNSSEC or switch to a less susceptible service, while also shortening the TTL to accelerate node updates.
If server network failures cause access problems, log in to the server as soon as possible to check network connectivity and port status. If using Nginx, check if it is running normally:
systemctl status nginx
For issues caused by firewall blocking, ensure that necessary ports are open:
firewall-cmd --add-port=80/tcp --permanent
firewall-cmd --add-port=443/tcp --permanent
firewall-cmd --reload
If the DNS changes still do not take effect, clear the local DNS cache. In Windows, execute:
ipconfig /flushdns
Execute in Linux:
systemd-resolve --flush-caches
Mobile users can enable and then disable airplane mode to refresh the ISP's DNS cache. For CDN acceleration users, verify that the CDN origin server's DNS configuration is correct, especially when the server IP changes. If the CDN still points to the old IP, the website will become inaccessible. After updating the CDN origin server configuration, wait for the nodes to synchronize.
To prevent sudden DNS failures, optimization should be implemented from three aspects: security, stability, and structure. First, enable DNSSEC to prevent hijacking and use a high-security DNS provider. Second, it is recommended to enable domain lockout mode to prevent malicious tampering after account theft. Third, ensure DNS record backups so that they can be quickly restored in case of accidental deletion. For large businesses, it is best to enable redundancy with multiple DNS service providers. When one service fails, the backup DNS can automatically take over the resolution task. In addition, avoid setting the TTL too high to ensure that the entire network cache can be quickly refreshed in the event of a failure.
CN
EN