We use NextDNS as our DNS resolver. While this hasn’t been an issue, recently, we decided to change the server so that it utilizes DoT and validates DNSSec (i.e. DNSSEC=yes). This change started causing significant and recurring problems with the server.
The problem wasn’t with NextDNS though. Their service has been working flawlessly for us. It’s a long-standing, well-known, and largely ignored by the maintainers, problem with systemd-resolved. What happens, is that when you first enable DNSSec with systemd-resolved, everything works fine for a while. However, after some period, systemd-resolved will stop resolving altogether with the error DNSSEC validation failed … incompatible-server.
The workaround for this is to restart the systemd-resolved service. DNS resolution will be working again — until it doesn’t, again… Rinse-and-repeat this cycle. When systemd-resolved is in its degraded state and not resolving anything, the server would think that the whole of the Internet is borked. Without some sort of automated mitigation to correct this failure, the server wouldn’t be able to resolve a single hostname until somebody took notice and manually restarted the systemd-resolved service.
Seemingly, systemd-resolved receives a SERVFAIL from its upstream resolver and decides to panic, downgrades itself, and fails to resolve all future queries until the service is restarted. This could be avoided by not using DNSSec, but that’s not really a solution though. DNSSec has its advantages and we wanted to continue using it.
The workaround that we’d decided to implement was to create a watchdog service of sorts that monitors for such a degradation of systemd-resolved. When it detects it, it immediately restarts the system-resolved service.
The watchdog does this by firing the resolvectl reset-server-features command. In doing so, it flushes all feature level information the resolver learnt about specific servers and ensures that the server feature probing logic is started from the beginning with the next look-up request. This is mostly equivalent to sending the SIGRTMIN+1 to the systemd-resolved service.
[Unit] Description=Auto-Fix for DNSSec With systemd-resolved StartLimitIntervalSec=0 [Service] ExecStart=sh -c 'journalctl -n0 -fu systemd-resolved | grep -m1 "DNSSEC validation failed.*incompatible-server" && resolvectl flush-caches && resolvectl reset-server-features' Restart=always User=root [Install] WantedBy=systemd-resolved.service
Create the above file and then run each of the following commands to get it up-and-running:
sudo chown root:root /etc/systemd/system/systemd-resolved-autofix-dnssec.service sudo chmod 644 /etc/systemd/system/systemd-resolved-autofix-dnssec.service sudo systemctl start systemd-resolved-autofix-dnssec.service sudo systemctl enable systemd-resolved-autofix-dnssec.service sudo systemctl status systemd-resolved-autofix-dnssec.service
Command #5 above should show its status as being Active: active (running).
With the watchdog up-and-running, should systemd-resolved decide to downgrade itself to the point of failure because of this particular DNSSec issue again, the watchdog will automagically slap systemd-resolved back to reality. While this is by no means a fix for systemd-resolved, it’s a somewhat viable workaround for this problem with it anyhow.