checker: report transient apex-lookup failures as Unknown, not Crit
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/tag Build is passing

apexLookupRule mapped every findApex failure to Crit, including transport
and resolver faults like "lookup nemunai.re on 127.0.0.11:53: server
misbehaving" — a flaky recursive resolver, not a broken delegation. That
made the check flap into Crit whenever the resolver hiccuped, the same
class of false negative the chain path already fixed.

Mark apex-lookup failures that stem from a transport/resolver fault
(resolveZoneNSAddrs net errors, recursiveExchange transport errors, and
SERVFAIL/REFUSED seen during the SOA walk) as transient via a typed
error, surface it as ApexLookupTransient, and have apexLookupRule report
Unknown for those. Definitive failures (NXDOMAIN-only walk, no resolvable
NS) still drive Crit.
This commit is contained in:
nemunaire 2026-06-18 10:05:51 +09:00
commit da6def100c
7 changed files with 123 additions and 23 deletions

View file

@ -2,6 +2,8 @@ package checker
import (
"context"
"errors"
"fmt"
"net"
"testing"
@ -23,6 +25,22 @@ func TestIsTransientRcode(t *testing.T) {
}
}
func TestIsTransientApexError(t *testing.T) {
wrapped := transientApexError{errors.New("server misbehaving")}
if !isTransientApexError(wrapped) {
t.Errorf("transientApexError should be classified as transient")
}
if !isTransientApexError(fmt.Errorf("wrapped: %w", wrapped)) {
t.Errorf("error wrapping a transientApexError should be transient")
}
if isTransientApexError(errors.New("could not locate apex of example.com.")) {
t.Errorf("plain error should not be classified as transient")
}
if isTransientApexError(nil) {
t.Errorf("nil error should not be classified as transient")
}
}
// startTestServer spins up a UDP DNS server that answers every query with the
// given handler, returning its address and a shutdown func.
func startTestServer(t *testing.T, handler dns.HandlerFunc) (string, func()) {