Name resolution failure on a caching server -- many '; pending-answer' records in the cache

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Name resolution failure on a caching server -- many '; pending-answer' records in the cache

TPCbind
Dear All,
     I run a caching server on a section of the departmental LAN.
Occasionally network congestion results in timeouts & name resolution
failures.  Lookups performed on name servers outside my LAN section
fail with NXDOMAIN.  Querying my name server for items not in its
cache gets the same result.

My problem is that long after the congestion has subsided, queries to
my name server still result in NXDOMAIN failure.  AFAICT this
situation remains indefinitely, until the cache is flushed 'rndc
flush' or the bind restarted.  When it is in this state dumping the
cache with 'rndc dumpdb' shows numerous entries like this,

--------------------------------------------------------------------------------------------
; pending-additional
thdow.bbc.co.uk.        76632   NS      ns3.bbc.net.uk.
                        76632   NS      ns4.bbc.co.uk.
                        76632   NS      ns4.bbc.net.uk.
                        76632   NS      ns3.bbc.co.uk.
; pending-answer
ns0.thdow.bbc.co.uk.    2082    \-AAAA  ;-$NXRRSET
; thdow.bbc.co.uk. SOA ns.bbc.co.uk. hostmaster.bbc.co.uk. 2015122100 1800 600 864000 86400
; pending-answer
                        76632   A       212.58.240.162
; pending-answer
www.bbc.co.uk.          30      CNAME   www.bbc.net.uk.
; glue
--------------------------------------------------------------------------------------------

and attempts to lookup eg. www.bbc.co.uk result in NXDOMAIN.

Browsing the documentation I noticed the parameter 'max-ncache-ttl'
which is unset in my named.conf and apparently defaults to 3hours.
However the problem persists long after 3hours has elapsed following
incidents of network congestion.

I could setup a cronjob to check name resolution on external domains
and flush the cache when it fails?  I am assuming there must be better
solution!  Should I set max-ncache-ttl to something fairly short in my
named.conf and hope that the default value is for some reason actually
>> 3hours?

BTW I there a way to dump out all the parameters from a running named
-- just to see all their values ?


Any ideas on how to solve or further diagnose the problem?

Many thanks
Tom Crane

System details:
OS:    Scientific Linux CERN SLC release 6.7 (Carbon) [NB: SLC is a derivative of RHEL]
BIND:  bind-9.8.2-0.37.rc1.el6_7.5.x86_64

Ps. I originally posted in Usenet NG comp.protocols.dns.bind but
got no followups and then noticed all messages in that NG had this
ML's fields 'NNTP-Posting-Host: lists.isc.org' and 'X-Original-To:
[hidden email]' etc. in their headers.  Is c.p.d.b
actually a moderated group now or exclusively tied to this ML via
a mail2news gateway?

--
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England.
Email:  T dot Crane at rhul dot ac dot uk

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

RE: Name resolution failure on a caching server -- many '; pending-answer' records in the cache

Kevin Darcy
NXDOMAIN is not a "failure" response. Are you *sure* you're getting NXDOMAIN? If you're using nslookup to test, be aware that it will do suffix searching by default, so if the original query, e.g. www.bbc.co.uk  fails, it'll quietly (unless debug-mode is in effect) start appending suffixes. Looking up those suffixed names, e.g. www.bbc.co.uk.example.com, mostly likely gets an NXDOMAIN, so nslookup reports NXDOMAIN as the overall result of the query. So, it's basically a misreporting of the error by nslookup.

Note that only 1 of the records in your cache dump is actually relevant -- the CNAME from www.bbc.co.uk to www.bbc.net.uk -- and the others are for a different part of the namespaces (thdow.bbc.co.uk).

If you do an explicit query of the CNAME, when the problem is occurring, does it resolve? I would expect, even though the cache entry is marked "pending-answer", it will still resolve. But, without the target of the CNAME also resolving, the lookup as a whole cannot succeed.

                                                                                                        - Kevin

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of [hidden email]
Sent: Tuesday, January 26, 2016 8:02 PM
To: [hidden email]
Subject: Name resolution failure on a caching server -- many '; pending-answer' records in the cache

Dear All,
     I run a caching server on a section of the departmental LAN.
Occasionally network congestion results in timeouts & name resolution failures.  Lookups performed on name servers outside my LAN section fail with NXDOMAIN.  Querying my name server for items not in its cache gets the same result.

My problem is that long after the congestion has subsided, queries to my name server still result in NXDOMAIN failure.  AFAICT this situation remains indefinitely, until the cache is flushed 'rndc flush' or the bind restarted.  When it is in this state dumping the cache with 'rndc dumpdb' shows numerous entries like this,

--------------------------------------------------------------------------------------------
; pending-additional
thdow.bbc.co.uk.        76632   NS      ns3.bbc.net.uk.
                        76632   NS      ns4.bbc.co.uk.
                        76632   NS      ns4.bbc.net.uk.
                        76632   NS      ns3.bbc.co.uk.
; pending-answer
ns0.thdow.bbc.co.uk.    2082    \-AAAA  ;-$NXRRSET
; thdow.bbc.co.uk. SOA ns.bbc.co.uk. hostmaster.bbc.co.uk. 2015122100 1800 600 864000 86400 ; pending-answer
                        76632   A       212.58.240.162
; pending-answer
www.bbc.co.uk.          30      CNAME   www.bbc.net.uk.
; glue
--------------------------------------------------------------------------------------------

and attempts to lookup eg. www.bbc.co.uk result in NXDOMAIN.

Browsing the documentation I noticed the parameter 'max-ncache-ttl'
which is unset in my named.conf and apparently defaults to 3hours.
However the problem persists long after 3hours has elapsed following incidents of network congestion.

I could setup a cronjob to check name resolution on external domains and flush the cache when it fails?  I am assuming there must be better solution!  Should I set max-ncache-ttl to something fairly short in my named.conf and hope that the default value is for some reason actually
>> 3hours?

BTW I there a way to dump out all the parameters from a running named
-- just to see all their values ?


Any ideas on how to solve or further diagnose the problem?

Many thanks
Tom Crane

System details:
OS:    Scientific Linux CERN SLC release 6.7 (Carbon) [NB: SLC is a derivative of RHEL]
BIND:  bind-9.8.2-0.37.rc1.el6_7.5.x86_64

Ps. I originally posted in Usenet NG comp.protocols.dns.bind but got no followups and then noticed all messages in that NG had this ML's fields 'NNTP-Posting-Host: lists.isc.org' and 'X-Original-To:
[hidden email]' etc. in their headers.  Is c.p.d.b actually a moderated group now or exclusively tied to this ML via a mail2news gateway?

--
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England.
Email:  T dot Crane at rhul dot ac dot uk

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Name resolution failure on a caching server -- many '; pending-answer' records in the cache

TPCbind
Thanks for the followup.

>
> NXDOMAIN is not a "failure" response. Are you *sure* you're getting NXDOMAIN?

Yes. Pretty sure. With hindsight I should have run the tests inside a 'script' session.

> If you're using nslookup to test, be aware that it will do suffix searching by default, so if the original query, e.g. www.bbc.co.uk  fails, it'll quietly (unless debug-mode is in effect) start appending suffixes. Looking up those suffixed names, e.g. www.bbc.co.uk.example.com, mostly likely gets an NXDOMAIN, so nslookup reports NXDOMAIN as the overall result of the query. So, it's basically a misreporting of the error by nslookup.

Yes. I was mostly using nslookup.  I'll try dig too next time this occurs.

>
> Note that only 1 of the records in your cache dump is actually relevant -- the CNAME from www.bbc.co.uk to www.bbc.net.uk -- and the others are for a different part of the namespaces (thdow.bbc.co.uk).

I'll contact you privately with a link to the whole cache.  Every entry tagged 'pending-*' in the cache which I tried querying failed to resolve when queried, many hours after the network congestion had ended.

>
> If you do an explicit query of the CNAME, when the problem is occurring, does it resolve? I would expect, even though the cache entry is marked "pending-answer", it will still resolve. But, without the target of the CNAME also resolving, the lookup as a whole cannot succeed.

I'll try that next time.

Regards
Tom.

>
> - Kevin
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of [hidden email]
> Sent: Tuesday, January 26, 2016 8:02 PM
> To: [hidden email]
> Subject: Name resolution failure on a caching server -- many '; pending-answer' records in the cache
>
> Dear All,
>      I run a caching server on a section of the departmental LAN.
> Occasionally network congestion results in timeouts & name resolution failures.  Lookups performed on name servers outside my LAN section fail with NXDOMAIN.  Querying my name server for items not in its cache gets the same result.
>
> My problem is that long after the congestion has subsided, queries to my name server still result in NXDOMAIN failure.  AFAICT this situation remains indefinitely, until the cache is flushed 'rndc flush' or the bind restarted.  When it is in this state dumping the cache with 'rndc dumpdb' shows numerous entries like this,
>
> --------------------------------------------------------------------------------------------
> ; pending-additional
> thdow.bbc.co.uk.        76632   NS      ns3.bbc.net.uk.
>                         76632   NS      ns4.bbc.co.uk.
>                         76632   NS      ns4.bbc.net.uk.
>                         76632   NS      ns3.bbc.co.uk.
> ; pending-answer
> ns0.thdow.bbc.co.uk.    2082    \-AAAA  ;-$NXRRSET
> ; thdow.bbc.co.uk. SOA ns.bbc.co.uk. hostmaster.bbc.co.uk. 2015122100 1800 600 864000 86400 ; pending-answer
>                         76632   A       212.58.240.162
> ; pending-answer
> www.bbc.co.uk.          30      CNAME   www.bbc.net.uk.
> ; glue
> --------------------------------------------------------------------------------------------
>
> and attempts to lookup eg. www.bbc.co.uk result in NXDOMAIN.
>
> Browsing the documentation I noticed the parameter 'max-ncache-ttl'
> which is unset in my named.conf and apparently defaults to 3hours.
> However the problem persists long after 3hours has elapsed following incidents of network congestion.
>
> I could setup a cronjob to check name resolution on external domains and flush the cache when it fails?  I am assuming there must be better solution!  Should I set max-ncache-ttl to something fairly short in my named.conf and hope that the default value is for some reason actually
> >> 3hours?
>
> BTW I there a way to dump out all the parameters from a running named
> -- just to see all their values ?
>
>
> Any ideas on how to solve or further diagnose the problem?
>
> Many thanks
> Tom Crane
>
> System details:
> OS:    Scientific Linux CERN SLC release 6.7 (Carbon) [NB: SLC is a derivative of RHEL]
> BIND:  bind-9.8.2-0.37.rc1.el6_7.5.x86_64
>
> Ps. I originally posted in Usenet NG comp.protocols.dns.bind but got no followups and then noticed all messages in that NG had this ML's fields 'NNTP-Posting-Host: lists.isc.org' and 'X-Original-To:
> [hidden email]' etc. in their headers.  Is c.p.d.b actually a moderated group now or exclusively tied to this ML via a mail2news gateway?
>
> --
> Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
> Egham, Surrey, TW20 0EX, England.
> Email:  T dot Crane at rhul dot ac dot uk
>
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> bind-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/bind-users
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> bind-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/bind-users
>


--
--
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England.
Email:  [hidden email]
Fax:    +44 (0) 1784 472794
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users