resolv.conf question / timeout behaviour

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

resolv.conf question / timeout behaviour

Bind-Users forum mailing list
Hi,

at my work place we have a three resolver setup in /etc/resolv.conf.

We had sometimes, though rarely, response times for DNS like 14000ms,
due to the fact that the *first* listed resolver is down for maintenance
reasons. The application we test this with is Oracle/TNSPing.
As a mitigation we therefore put in timeout:1, but we just recently got
again a TNSPing response of 9000ms.

I noticed in man resolv.conf this section on "timeout":

              timeout:n
                     Sets the amount of time the resolver will wait for
                     a response from a remote name server before
                     retrying the query via a different name server.
|                    This may not be the total time taken by any
|                    resolver API call and there is no guarantee that a
|                    single resolver API call maps to a single timeout.
                     Measured in seconds, the default is RES_TIMEOUT
                     (currently 5, see <resolv.h>).  The value for this
                     option is silently capped to 30.

I am intrigued by the above sentence marked with "|". Does anybody
know what that means in detail, can anybody explain that please?

I explained the reason for the 9000ms so that Oracle and its many processes
all come together to resolve the DNS name and they *keep hitting* the first
resolver - and "timeout" can't kick in due to parallel requests from different
processes, hence the high overall response time.


Kind Regards

Thomas Preissler
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: resolv.conf question / timeout behaviour

Matus UHLAR - fantomas
On 31.03.21 10:56, Tom Preissler via bind-users wrote:
>at my work place we have a three resolver setup in /etc/resolv.conf.

resolv.conf is not a BIND thing, it's configuration of system libraries.

>We had sometimes, though rarely, response times for DNS like 14000ms,
>due to the fact that the *first* listed resolver is down for maintenance
>reasons. The application we test this with is Oracle/TNSPing.

if this is an issue, you can run local caching DNS server like BIND or
dnsmasq. They can handle such timeouts better than most libraries.

>As a mitigation we therefore put in timeout:1, but we just recently got
>again a TNSPing response of 9000ms.
>
>I noticed in man resolv.conf this section on "timeout":
>
>              timeout:n
>                     Sets the amount of time the resolver will wait for
>                     a response from a remote name server before
>                     retrying the query via a different name server.
>|                    This may not be the total time taken by any
>|                    resolver API call and there is no guarantee that a
>|                    single resolver API call maps to a single timeout.
>                     Measured in seconds, the default is RES_TIMEOUT
>                     (currently 5, see <resolv.h>).  The value for this
>                     option is silently capped to 30.
>
>I am intrigued by the above sentence marked with "|". Does anybody
>know what that means in detail, can anybody explain that please?
>
>I explained the reason for the 9000ms so that Oracle and its many processes
>all come together to resolve the DNS name and they *keep hitting* the first
>resolver - and "timeout" can't kick in due to parallel requests from different
>processes, hence the high overall response time.


--
Matus UHLAR - fantomas, [hidden email] ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Spam = (S)tupid (P)eople's (A)dvertising (M)ethod
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: resolv.conf question / timeout behaviour

Tony Finch
In reply to this post by Bind-Users forum mailing list
Tom Preissler <[hidden email]> wrote:
>
> at my work place we have a three resolver setup in /etc/resolv.conf.
>
> We had sometimes, though rarely, response times for DNS like 14000ms,
> due to the fact that the *first* listed resolver is down for maintenance
> reasons.

Sadly the traditional unix stub resolver behaves REALLY BADLY if any of
its servers are unavailable. It does not keep enough information about
server performance and isn't really designed to be able to do that. The
resolv.conf tuning options are too coarse to help in any meaningful way.

Because of this, if it's important for you to avoid multi-second DNS
lookup times (and it usually is!), you need to design your system so that
the libc resolver never tries to talk to a DNS server that isn't
available.

As Matus Uhlar said, one way is to run a resolver daemon (e.g. BIND
configured to forward to your recursive servers) on each machine. Resolver
daemons are better able to keep track of which server is up, and they are
less likely to be unavailable when the client software needs them since
they are on the same machine. Most operating systems have resolver daemons
now; it's bascially only oldskool unix that needs extra setup.

Another way is a high availability setup for your recursive servers. I use
keepalived (my servers are on a resilient layer 2 network that spans
multiple locations); or you can use anycast if you need to do failover at
layer 3.

Of course, you can do both :-)

Tony.
--
f.anthony.n.finch  <[hidden email]>  https://dotat.at/
Faeroes: North backing west 5 or 6, decreasing 3 or 4 for a time.
Moderate or rough. Fair. Good.

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: resolv.conf question / timeout behaviour

Bind-Users forum mailing list
On 3/31/21 10:00 AM, Tony Finch wrote:
> Because of this, if it's important for you to avoid multi-second
> DNS lookup times ... you need to design your system so that the libc
> resolver never tries to talk to a DNS server that isn't available.

I've seen various client OSs fail in really weird ways when the first
DNS server in the list doesn't respond quick enough, much less never.

> Another way is a high availability setup for your recursive servers.

+1 to something like VRRP / CARP / routing tricks to make sure that the
Virtual / Service IP that client's use as the first DNS server is always
available.  Even if the first and second IP are on the same system for a
few minutes while the other is patched.



--
Grant. . . .
unix || die


_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users

smime.p7s (5K) Download Attachment