How to fix a DNS server malfunction? Don't restart it!
DNS server anomalies are perhaps one of the most nerve-wracking problems in a system administrator's career. Their danger lies not in their difficulty to fix, but in their symptoms resembling network paralysis, making troubleshooting particularly convoluted. You might spend an hour checking the server, firewall, and switch, only to find out it's just an extra space in the DNS configuration. Today, we'll discuss how to properly fix DNS server anomalies. From symptom identification to root cause location, from emergency repairs to complete restoration, we'll aim to ensure you're prepared and capable when facing similar situations in the future.
I. First, Diagnose the Problem: Is it Really a "DNS Anomaly"? Don't Be Fooled by Illusions
Many people make a mistake in the very first step—seeing a website inaccessible, they immediately assume "DNS is down" and rush to perform a series of operations. The result is often a server crash or a local network outage, leaving them with wasted time and effort.
The core symptom of a DNS anomaly is only one: domain name resolution failure.
How to verify? A few simple commands:
nslookup yourdomain.com
dig yourdomain.com
ping yourdomain.com
If it returns "server can't find" or "connection timed out," then it's indeed a DNS-level problem. If it returns an IP address, but the website is inaccessible, it's a server or network issue, not related to DNS.
Another classic scenario: the website is accessible in some regions but not in others. This is often not due to your DNS server being down, but rather DNS propagation delays or inconsistent recursive DNS caches in different regions. In this case, modifying your own server won't help; you can only wait.
After confirming it's a DNS problem, the next step is to distinguish: is the authoritative DNS down, or the recursive DNS down?
If these two concepts are confused, the repair direction will be completely wrong.
Authoritative DNS: Provided by your own domain registrar or DNS service provider, responsible for answering "what IP address corresponds to my domain name."
Recursive DNS: Used by users when accessing the internet, such as 114.114.114.114 from your ISP, responsible for querying the authoritative DNS on behalf of the user.
If your website is inaccessible only to you, but accessible to everyone else, the problem lies with your recursive DNS (local network address). Changing your DNS address will solve the issue. If your website is inaccessible to everyone, the problem is with your authoritative DNS, which is where the real issue needs fixing.
II. Five Common Reasons for Authoritative DNS Anomalies
The authoritative DNS is like the "registration office" for your domain name on the internet. When someone asks, "Where is example.com?", the authoritative DNS provides the answer. If the authoritative DNS goes down, users worldwide will be unable to find your website.
I will categorize the common reasons for authoritative DNS anomalies into five types.
1. DNS Provider Failure
This is the most "unjustly accused" type—the problem isn't with you, but with your DNS provider. For example, the large-scale DNS failure of Alibaba Cloud in 2021 and the DNS resolution anomaly of Cloudflare in 2023 both caused countless websites to crash simultaneously.
Symptoms: Monitoring shows that the resolution of all domain names fails simultaneously, or the resolution latency spikes to several seconds. Checking the provider's status page, it indeed displays "investigating incident."
Repair Methods:
If your service provider offers backup DNS addresses (e.g., two NS records, one primary and one backup), confirm that your domain has both configured.
If the outage lasts a long time, consider temporarily switching DNS providers. In your domain registrar's backend, change the NS record to an address from another DNS provider. DNS record changes typically take several hours to a day to take effect globally, but this will resolve the issue.
The most drastic measure: If your domain registrar's built-in DNS resolution is still working, simply switch back to the registrar's DNS. Although it has fewer features, it's more stable.
Prevention: Don't put all your eggs in one basket. Configure at least two NS records for your domain, using DNS from different providers. Although most DNS providers claim 100% availability, having a backup is always a good thing.
2. DNS Configuration Errors
This is a major area of human error. Changing a single NS record—adding or omitting a dot, or entering a non-existent IP address—can cause the entire domain to crash.
Common disastrous actions:
A record pointing to an internal IP address (192.168.x.x or 127.0.0.1)
NS record entered incorrectly, pointing to a non-existent DNS server
Domain name resolution suspended (e.g., lack of real-name authentication, expired domain)
The last NS record was mistakenly deleted, resulting in "no authoritative server available for lookup"
Repair methods:
Log in to your DNS service provider's console and check each DNS record. Pay special attention to A, AAAA, CNAME, and NS records.
Use `dig +trace yourdomain.com` to view the DNS resolution path and see where it gets stuck. If it gets stuck on an unresponsive NS server, then that NS record is the problem.
If the NS record points to the wrong address, first go to your domain registrar's backend and change the NS record back to the correct address.
Prevention: Before changing DNS configurations, verify in a test environment. Do not experiment directly on an online domain. Take screenshots before making important changes so you can quickly roll back if you make a mistake.
3. DNS Server Resource Exhaustion (DDoS Attack)
DNS servers are the "address system" of the internet and a prime target for DDoS attacks. Attackers flood the server with massive amounts of forged DNS requests, saturating its bandwidth, CPU, and connections, preventing legitimate requests from getting through.
Symptoms: Monitoring shows that DNS server traffic suddenly spikes to its bandwidth limit, CPU usage reaches 100%, and logs are filled with abnormal requests from different IPs.
Remediation Methods:
If your DNS provider offers DDoS protection, enable it immediately. Most commercial DNS providers have built-in protection, but there may be trigger thresholds, requiring manual activation of "high-defense mode."
If you are using a self-built DNS server, the situation is more challenging. Temporary measures: Implement rate limiting on the firewall or block abnormal IP ranges using iptables. However, self-built servers are largely ineffective against large-scale DDoS attacks, ultimately requiring upstream scrubbing.
Prevention: The DDoS protection capabilities of commercial DNS providers are far superior to those of your self-built server. If your business has high availability requirements, do not host your authoritative DNS on a small, self-built server.
4. DNS Server Downtime
Users with self-built authoritative DNS servers may encounter this issue. The server itself may have crashed, or the service process may have stopped.
Symptoms: `dig` queries return "connection refused" or "no response".
Repair methods:
SSH into the DNS server and use `systemctl status named` (or `systemctl status unbound`, `systemctl status dnsmasq`) to check the process status.
If the process has crashed, use `systemctl restart` to restart it. Before restarting, remember to `tail -f /var/log/messages` to check for errors. If the startup failure was due to a configuration error, restarting will not solve the problem.
If the server has crashed, recover the server first. If it's a cloud server, try a forced restart from the console. If it's a physical machine, contact the data center.
If the server cannot be recovered quickly, immediately go to your domain registrar's backend and change the NS record to an alternate DNS (if available), or temporarily change it to the address of your cloud DNS service provider.
Prevention: Authoritative DNS servers must be configured as primary and backup servers. At least two servers should be located in different data centers and configured with different NS records. If one fails, the other can still take over.
5. DNSSEC Configuration Error
DNSSEC is a security extension of DNS, adding digital signatures. However, its configuration is complex, and if an error occurs, recursive DNS servers with DNSSEC verification enabled will refuse to resolve your domain name.
Symptoms: It resolves with a regular DNS server, but fails with 8.8.8.8. The `dig + dnssec` command shows the verification failure error.
Repair Method:
Log in to your DNS service provider's backend and check the DNSSEC configuration. If it's not needed, disable it.
If it needs to be enabled, confirm that the key configuration is correct and that the DS record is also configured synchronously in the domain registrar's backend. Many people only configure it on one end, leading to verification failure.
The most direct method: Delete the DS record in the domain registrar's backend and disable the DNSSEC function. Reconfigure it again after resolution returns to normal.
III. Recursive DNS Anomaly: Your "Navigator" is Broken
Recursive DNS is the user's "navigator". The GPS navigator is broken, the road is still there, but the user doesn't know how to get there. This situation is more common than authoritative DNS anomalies and usually only affects a specific user group.
1. ISP DNS Failure
The stability of recursive DNS used by some smaller domestic ISPs is questionable. Periodic glitches, slow responses, and occasional incorrect results are common occurrences.
Symptoms: Users on a specific network (e.g., broadband users in a certain neighborhood) cannot access your website, while other networks function normally. Users themselves may also experience slow access to other websites.
Repair Method (User Side):
Change DNS. Manually change the DNS on your computer or router to a public DNS, such as 114.114.114.114, 223.5.5.5, or 119.29.29.29.
After changing, clear the cache using `ipconfig /flushdns` (Windows) or `sudo systemctl restart systemd-resolved` (Linux), and then try again.
Repair Methods (Website Owner's Perspective):
You can't fix your ISP's DNS for users. However, you can suggest users change their DNS, for example, by writing in the website announcement, "If you encounter access abnormalities, please try changing your DNS to 114.114.114.114."
You can also use third-party monitoring to check the DNS resolution status in different regions and with different ISPs. If a particular ISP experiences widespread abnormalities, it may be a problem with that ISP.
2. Local DNS Cache Pollution
Sometimes it's not a problem with the DNS server, but rather that your own computer or router is caching incorrect results.
Symptoms: The IP address resolved by the domain name is clearly incorrect (e.g., returning 127.0.0.1, or a completely unrelated IP).
Repair Methods:
Windows: Run `ipconfig /flushdns` as administrator, then `ipconfig /registerdns`.
macOS: `sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder`.
Linux: `sudo systemctl restart systemd-resolved`, or directly restart nscd.
Router: Restart the router, or clear the DNS cache in the router's management interface (if this option is available).
3. Hosts file tampered with
This is an easily overlooked problem. Mappings in the hosts file take precedence over DNS. If malware or accidental deletion adds `127.0.0.1 example.com` to the hosts file, this domain name will always resolve to the local machine.
Repair method:
Windows: Open `C:\Windows\System32\drivers\etc\hosts` with Notepad and check for any abnormal entries. Delete or comment them out (add `#` before them).
macOS/Linux: Check `/etc/hosts` in the same way.
IV. Practical Troubleshooting Process: A Six-Step Method from Confusion to Locating the Problem
If the above categories are a bit confusing, don't worry. I've compiled a general troubleshooting process. Follow the steps, and you're unlikely to miss anything.
Step 1: Identify the Scope of the Problem
Is it only you who can't access the site, or is everyone unable to access it? Use online tools to test DNS resolution from across the country. If only you have the problem, the issue is local; if resolution fails nationwide, the problem lies with the authoritative DNS.
Step 2: Check Domain Status
Log in to your domain registrar's backend and confirm that the domain is not expired, not locked, and that real-name authentication is complete. If these three basic requirements are not met, even a perfectly configured DNS will be useless.
Step 3: Check NS Records
Use `dig +trace yourdomain.com` to view the DNS resolution path. If it gets stuck on a particular NS server, the problem lies with that server. If the NS record doesn't point to your configured DNS provider, the NS configuration in your domain registrar's backend may have been modified.
Step 4: Check the DNS Provider's Control Panel
Log in to your DNS provider's backend and check if the DNS resolution records are normal, and whether there are any notifications of unpaid fees or service suspension. Many free versions of DNS providers have request limits; exceeding these limits will result in service suspension.
Step 5: Test Different Recursive DNS Servers
Compare `dig @8.8.8.8 yourdomain.com` and `dig @114.114.114.114 yourdomain.com`. If 8.8.8.8 can resolve the domain, but 114 cannot, it means the recursive DNS server at 114 has a problem, not your authoritative DNS.
Step 6: Check the Logs
If it's a self-hosted DNS server, use `tail -f /var/log/messages` or the DNS service process log to check for errors. Common errors include: incorrect configuration syntax, incorrect file permissions, port being occupied, and upstream recursive DNS being unreachable.
DNS server troubleshooting essentially boils down to two things: locating the problematic DNS server and replacing it.
If the authoritative DNS server is down, switch to an alternative DNS server or switch service providers. If the recursive DNS server is down, switch to a public DNS server or clear the cache. If the configuration is incorrect, correct it. If it's under attack, use DDoS protection. If the domain has expired, renew it.
It sounds simple, but when you actually encounter this, people often panic because of the alarms and user complaints, and then skip the troubleshooting steps and just "guess blindly." As a result, they restart the server, restart the router, and spend ages changing configurations, only to find out in the end that they just forgot to renew the domain name.
CN
EN