validating ... bad cache hit

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

validating ... bad cache hit

Bind-Users forum mailing list
Hi,

we got reports about a temporary resolution failure for some names
under norid.no this morning.  Digging through the logs, the first
instance appears to be

Apr 24 08:35:02 resolver named[244]: validating zabbix-test.norid.no/CNAME: bad cache hit (norid.no/DNSKEY)

and a couple of minutes later, a rash of entries pointing to the same
bad cache hit.  The last entry after this pattern was some 10 minutes
later.

Looking at the code in BIND 9.14.10 (BIND 9.16.2 doesn't appear to be
significantly different in this regard), there appears to be a "cache
of bad records" implemented by lib/dns/badcache.c.  There are two
invocations of dns_resolver_addbadcache() in lib/dns/resolver.c, with
fairly complicated preconditions to reach each of those two points.
However, it appears that if I have not turned on query tracing (we
have not; I think we did previously, but found it to be a severe and
noticeable performance hit), I will not get any logging indicating
which of the two conditions hit, or why, so the trace for the root
cause for why norid.no/DNSKEY was temporarily marked bad goes cold at
this point as far as I can see.

Our logging is configured to (among other things) log the
"dnssec" and "security" categories at severity info and higher.

Is there something which can be done to improve the diagnostics for
such situations?  I don't suppose there is anything more to be found
for this particular problem at the moment?

Regards,

- Håvard
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: validating ... bad cache hit

Tony Finch
Havard Eidnes via bind-users <[hidden email]> wrote:
>
> Looking at the code in BIND 9.14.10 (BIND 9.16.2 doesn't appear to be
> significantly different in this regard), there appears to be a "cache
> of bad records" implemented by lib/dns/badcache.c.  There are two
> invocations of dns_resolver_addbadcache() in lib/dns/resolver.c, with
> fairly complicated preconditions to reach each of those two points.

I've had a very quick look at the code, and it looks to me like one case
is due to lack of authoritative server IP addresses, and one is due to
validation failure. I think you could work out whether it is the first
case by dumping the cache and looking for relevant adb entries for the
zone's nameservers. (But you might have to do so within the 10 minute lame
TTL.)

Tony.
--
f.anthony.n.finch  <[hidden email]>  http://dotat.at/
Shetland Isles: Northeast 3 to 5, becoming variable 3 or less. Slight or
moderate. Fair. Good.
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: validating ... bad cache hit

Bind-Users forum mailing list
>> [...] There are two invocations of dns_resolver_addbadcache() in
>> lib/dns/resolver.c, with fairly complicated preconditions to reach
>> each of those two points.
>
> I've had a very quick look at the code, and it looks to me like one
> case is due to lack of authoritative server IP addresses, and one is
> due to validation failure. I think you could work out whether it is
> the first case by dumping the cache and looking for relevant adb
> entries for the zone's nameservers. (But you might have to do so
> within the 10 minute lame TTL.)

Hm.

I find both of these quite unlikely.  norid.no is seved by 4 name
servers with 8 addresses (v4 + v6), and the TTLs for the addresses are
typically quite staggered.

If it was due to validation failure, I would have thought that it
would be more persistent than only last for 10 minutes.  The current
RRSIGs over the CNAME in question has validity-start 20200422025456
(two-days-ago-ish) and for the pointed-to A record, validity-start is
20200423055129 (one-day-ago-ish), so it seems to me to be likely that
they were published in that state in the DNS also this morning.  Plus
the signer (OpenDNSSEC) last re-signed the zone 06:51 this morning,
and then next on 08:51.

So I'm still quite confused as to why this happened.

Regards,

- Håvard
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: validating ... bad cache hit

Tony Finch
Havard Eidnes via bind-users <[hidden email]> wrote:
>
> If it was due to validation failure, I would have thought that it
> would be more persistent than only last for 10 minutes.

Looking for vaguely plausible causes I guess what might have happened is
there was a DNSKEY lookup failure (transient network problem?), so
validator.c:fetch_callback_dnskey() returned DNS_R_BROKENCHAIN, so the
DNSKEY query was added to the bad cache. After that happens, validation
attempts failed for lame-ttl seconds because of the
dns_resolver_getbadcache() check in validator.c:view_find().

Tony.
--
f.anthony.n.finch  <[hidden email]>  http://dotat.at/
West Forties, Cromarty, Forth, Tyne, Dogger: Variable, mainly northerly, 2 to
4. Slight or moderate. Fog patches in Forth and Tyne. Moderate or good,
occasionally very poor in Forth and Tyne.
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users