strange problem with query being dropped/ignored by the BIND process

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

strange problem with query being dropped/ignored by the BIND process

Marc Richter
Hi,

we have a setup here consisting of a recursive DNS server and two
monitoring servers. The monitoring servers sent a test query to the DNS
server once every two minutes to check if it is answering properly.

We now have the problems that these test queries are timing out from time
to time, (correctly) resulting in alarms in our monitoring system.

I have checked this now and noticed that each time we see that alarm, the
query sent by the monitoring server is not being answered at all.
To debug that I ran tcpdump on both the monitoring server and the recursive
DNS server. I see the query being sent out on the monitoring server and I
also see the query being received on the DNS server, however there is no
response sent to this query at all.
Looking at the query log, which I enabled temporarily, the query is also
not logged there so it looks like BIND is ignoring that query somewhere,
although it is properly received by the IP stack of the server.

Do you have any suggestions how to debug this further, to hopefully find
out where these queries are stuck/dropped/ignored, as I have run out of ideas ?

The environment is:
BIND 9.9.9-P5 (Extended Support Version) <id:1ab232a>
running on SunOS sun4v 5.11 11.3


Thanks !
Marc
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: strange problem with query being dropped/ignored by the BIND process

Ben Croswell
Have you checked deeper at the OS level? I have seen on Linux DNS servers silent drops of queries on very busy servers that were exhausting UDP receive buffers.

On Jun 28, 2017 10:26 AM, "Marc Richter" <[hidden email]> wrote:
Hi,

we have a setup here consisting of a recursive DNS server and two
monitoring servers. The monitoring servers sent a test query to the DNS
server once every two minutes to check if it is answering properly.

We now have the problems that these test queries are timing out from time
to time, (correctly) resulting in alarms in our monitoring system.

I have checked this now and noticed that each time we see that alarm, the
query sent by the monitoring server is not being answered at all.
To debug that I ran tcpdump on both the monitoring server and the recursive
DNS server. I see the query being sent out on the monitoring server and I
also see the query being received on the DNS server, however there is no
response sent to this query at all.
Looking at the query log, which I enabled temporarily, the query is also
not logged there so it looks like BIND is ignoring that query somewhere,
although it is properly received by the IP stack of the server.

Do you have any suggestions how to debug this further, to hopefully find
out where these queries are stuck/dropped/ignored, as I have run out of ideas ?

The environment is:
BIND 9.9.9-P5 (Extended Support Version) <id:1ab232a>
running on SunOS sun4v 5.11 11.3


Thanks !
Marc
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users


_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [E] Re: strange problem with query being dropped/ignored by the BIND process

Marc Richter
Hi Ben,

thanks for the answer.

Yeah, I think you are right. I see a lot of udpInOverflows on the system,
which suggest that the receive buffer is too small indeed.

Is there any kind of recommendation or best-practice advice what the
buffers should ideally be set to on Solaris ?
I did search the ISC Knowledge Base, but didn't find any useful advice.

Regards
arc

On 06/28/17 14:37, Ben Croswell wrote:

> Have you checked deeper at the OS level? I have seen on Linux DNS servers
> silent drops of queries on very busy servers that were exhausting UDP
> receive buffers.
>
> On Jun 28, 2017 10:26 AM, "Marc Richter" <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi,
>
>     we have a setup here consisting of a recursive DNS server and two
>     monitoring servers. The monitoring servers sent a test query to the DNS
>     server once every two minutes to check if it is answering properly.
>
>     We now have the problems that these test queries are timing out from time
>     to time, (correctly) resulting in alarms in our monitoring system.
>
>     I have checked this now and noticed that each time we see that alarm, the
>     query sent by the monitoring server is not being answered at all.
>     To debug that I ran tcpdump on both the monitoring server and the recursive
>     DNS server. I see the query being sent out on the monitoring server and I
>     also see the query being received on the DNS server, however there is no
>     response sent to this query at all.
>     Looking at the query log, which I enabled temporarily, the query is also
>     not logged there so it looks like BIND is ignoring that query somewhere,
>     although it is properly received by the IP stack of the server.
>
>     Do you have any suggestions how to debug this further, to hopefully find
>     out where these queries are stuck/dropped/ignored, as I have run out of
>     ideas ?
>
>     The environment is:
>     BIND 9.9.9-P5 (Extended Support Version) <id:1ab232a>
>     running on SunOS sun4v 5.11 11.3
>
>
>     Thanks !
>     Marc
>     _______________________________________________
>     Please visit https://lists.isc.org/mailman/listinfo/bind-users
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwMFaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=wDgZv-d1RrBMzWr_7pSF_09ZAXIr59EgoXQU4ctOHMk&m=t6jk-SZ5v_kNlupaNbpfob7Dm6Iddy_gUndDBwWnkmc&s=Ko40xVILMIdx3tQ9ElkdPqboTH8RpH1ZKJ4ZXcGp9NM&e=>
>     to unsubscribe from this list
>
>     bind-users mailing list
>     [hidden email] <mailto:[hidden email]>
>     https://lists.isc.org/mailman/listinfo/bind-users
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwMFaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=wDgZv-d1RrBMzWr_7pSF_09ZAXIr59EgoXQU4ctOHMk&m=t6jk-SZ5v_kNlupaNbpfob7Dm6Iddy_gUndDBwWnkmc&s=Ko40xVILMIdx3tQ9ElkdPqboTH8RpH1ZKJ4ZXcGp9NM&e=>
>
>

--
Marc Richter
Engr III Cslt-Ntwk Eng&Ops

Sebrathweg 20
44149 Dortmund
Germany

O +49 231 972 1293
F +49 231 972 2587
E [hidden email]
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: strange problem with query being dropped/ignored by the BIND process

Marc Richter
Hi again,

I have checked this again today.

Send & receive buffers are both 1MB, the Server has 8 CPUs and during
startup BIND is reporting this:

        found 8 CPUs, using 8 worker threads
        using 7 UDP listeners per interface
        using up to 32768 sockets

We only have about 1.500 queries per second on this server. CPU(30%) and
memory(50%) usage also is not an issue here.

Now Oracle support is saying that the buffer sizes are fine and we need to
"speed up the application" to read the data faster from the receive buffer
and this prevent package drops.

Do you think that is a reasonable statement in this environment ?
What would be the best way to "speed up the application" ? Just increase
the worker threads ?

Regards
Marc


On 06/28/17 15:31, Marc Richter wrote:

> Hi Ben,
>
> thanks for the answer.
>
> Yeah, I think you are right. I see a lot of udpInOverflows on the system,
> which suggest that the receive buffer is too small indeed.
>
> Is there any kind of recommendation or best-practice advice what the
> buffers should ideally be set to on Solaris ?
> I did search the ISC Knowledge Base, but didn't find any useful advice.
>
> Regards
> arc
>
> On 06/28/17 14:37, Ben Croswell wrote:
>> Have you checked deeper at the OS level? I have seen on Linux DNS servers
>> silent drops of queries on very busy servers that were exhausting UDP
>> receive buffers.
>>
>> On Jun 28, 2017 10:26 AM, "Marc Richter" <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>     Hi,
>>
>>     we have a setup here consisting of a recursive DNS server and two
>>     monitoring servers. The monitoring servers sent a test query to the DNS
>>     server once every two minutes to check if it is answering properly.
>>
>>     We now have the problems that these test queries are timing out from time
>>     to time, (correctly) resulting in alarms in our monitoring system.
>>
>>     I have checked this now and noticed that each time we see that alarm, the
>>     query sent by the monitoring server is not being answered at all.
>>     To debug that I ran tcpdump on both the monitoring server and the recursive
>>     DNS server. I see the query being sent out on the monitoring server and I
>>     also see the query being received on the DNS server, however there is no
>>     response sent to this query at all.
>>     Looking at the query log, which I enabled temporarily, the query is also
>>     not logged there so it looks like BIND is ignoring that query somewhere,
>>     although it is properly received by the IP stack of the server.
>>
>>     Do you have any suggestions how to debug this further, to hopefully find
>>     out where these queries are stuck/dropped/ignored, as I have run out of
>>     ideas ?
>>
>>     The environment is:
>>     BIND 9.9.9-P5 (Extended Support Version) <id:1ab232a>
>>     running on SunOS sun4v 5.11 11.3
>>
>>
>>     Thanks !
>>     Marc
>>     _______________________________________________
>>     Please visit https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwICAg&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=wDgZv-d1RrBMzWr_7pSF_09ZAXIr59EgoXQU4ctOHMk&m=b8p_t6atDvFHu2tWe4Jgw_EvLufZakMUJL0w06aA3V0&s=bXYnQq1IzLGZG6xbey81qsaTVpqiLVlwxazV8CXVP_A&e= 
>>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwMFaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=wDgZv-d1RrBMzWr_7pSF_09ZAXIr59EgoXQU4ctOHMk&m=t6jk-SZ5v_kNlupaNbpfob7Dm6Iddy_gUndDBwWnkmc&s=Ko40xVILMIdx3tQ9ElkdPqboTH8RpH1ZKJ4ZXcGp9NM&e=>
>>     to unsubscribe from this list
>>
>>     bind-users mailing list
>>     [hidden email] <mailto:[hidden email]>
>>     https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwICAg&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=wDgZv-d1RrBMzWr_7pSF_09ZAXIr59EgoXQU4ctOHMk&m=b8p_t6atDvFHu2tWe4Jgw_EvLufZakMUJL0w06aA3V0&s=bXYnQq1IzLGZG6xbey81qsaTVpqiLVlwxazV8CXVP_A&e= 
>>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwMFaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=wDgZv-d1RrBMzWr_7pSF_09ZAXIr59EgoXQU4ctOHMk&m=t6jk-SZ5v_kNlupaNbpfob7Dm6Iddy_gUndDBwWnkmc&s=Ko40xVILMIdx3tQ9ElkdPqboTH8RpH1ZKJ4ZXcGp9NM&e=>
>>
>>
>

--
Marc Richter
Engr III Cslt-Ntwk Eng&Ops

Sebrathweg 20
44149 Dortmund
Germany

O +49 231 972 1293
F +49 231 972 2587
E [hidden email]
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: strange problem with query being dropped/ignored by the BIND process

Dennis Clarke
On 06/29/2017 12:52 PM, Marc Richter wrote:

> Hi again,
>
> I have checked this again today.
>
> Send & receive buffers are both 1MB, the Server has 8 CPUs and during
> startup BIND is reporting this:
>
> found 8 CPUs, using 8 worker threads
> using 7 UDP listeners per interface
> using up to 32768 sockets
>
> We only have about 1.500 queries per second on this server. CPU(30%) and
> memory(50%) usage also is not an issue here.

Do you have any adjustments in /etc/system ?

I will assume you don't have ip_forwarding messed with and let's just
look at your network stack config. You don't need to publish your
results to the maillist but have a look at :

# ndd -get /dev/ip \? | grep "read"
# ndd -get /dev/tcp \? | grep "read"

Here you have the full range of stack kernel tunables. At the very least
the ones you can read data from.

You probably already did this but create a quick script :

#!/bin/sh
/usr/bin/printf "\n"

/usr/bin/printf "tcp_wscale_always = "
ndd -get /dev/tcp tcp_wscale_always

/usr/bin/printf "tcp_tstamp_if_wscale = "
ndd -get /dev/tcp tcp_tstamp_if_wscale

/usr/bin/printf "tcp_max_buf = "
ndd -get /dev/tcp tcp_max_buf

/usr/bin/printf "tcp_cwnd_max = "
ndd -get /dev/tcp tcp_cwnd_max

/usr/bin/printf "tcp_xmit_hiwat = "
ndd -get /dev/tcp tcp_xmit_hiwat

/usr/bin/printf "tcp_recv_hiwat = "
ndd -get /dev/tcp tcp_recv_hiwat


Run that.


What I see here on three diff Sol10 servers for various purposes is :

M5 # /tmp/foo.sh

tcp_wscale_always = 1
tcp_tstamp_if_wscale = 1
tcp_max_buf = 1048576
tcp_cwnd_max = 1048576
tcp_xmit_hiwat = 49152
tcp_recv_hiwat = 49152


st0 # /tmp/foo.sh

tcp_wscale_always = 1
tcp_tstamp_if_wscale = 1
tcp_max_buf = 1048576
tcp_cwnd_max = 1048576
tcp_xmit_hiwat = 49152
tcp_recv_hiwat = 49152


st1 #

tcp_wscale_always = 1
tcp_tstamp_if_wscale = 1
tcp_max_buf = 16777216
tcp_cwnd_max = 8388608
tcp_xmit_hiwat = 65535
tcp_recv_hiwat = 65535


The first two are defaults whereas the last unit needs to sling around
terabytes daily.  I am curious what your system thinks it is doing
with its tcp/ip stack.

Since you are on contract ( me too .. arn't we all these days ) then I
have to assume you have reasonable kernel updates and tcp patches in
this Solaris server ?

Dennis




_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: strange problem with query being dropped/ignored by the BIND process

Marc Richter
Hi Dennis,

> Do you have any adjustments in /etc/system ?

No. And as mentioned before this is a Solaris 11 system, so /etc/system is
(mostly) irrelevant, as the IP settings are all done with ipadm now.

>
> # ndd -get /dev/ip \? | grep "read"
> # ndd -get /dev/tcp \? | grep "read"
>

That, as well as the script and examples you provided, won't help me a lot,
as I am looking at UDP receive buffer overflows, not TCP.

I have set udp_max_buf to 4MB now and udp_send_buf & udp_recv_buf to 2MB
each, then restarted BIND.
It seems to be working better now as I don't see that much receive buffer
overflows anymore.

However, the initial question still stands. How can I reconfigure BIND to
pick up the data faster from the receive buffer ?

> Since you are on contract ( me too .. arn't we all these days ) then I
> have to assume you have reasonable kernel updates and tcp patches in
> this Solaris server ?

Yes, of course.

Regards
Marc
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: strange problem with query being dropped/ignored by the BIND process

Bob Harold

On Thu, Jun 29, 2017 at 9:51 AM, Marc Richter <[hidden email]> wrote:
Hi Dennis,

> Do you have any adjustments in /etc/system ?

No. And as mentioned before this is a Solaris 11 system, so /etc/system is
(mostly) irrelevant, as the IP settings are all done with ipadm now.

>
> # ndd -get /dev/ip \? | grep "read"
> # ndd -get /dev/tcp \? | grep "read"
>

That, as well as the script and examples you provided, won't help me a lot,
as I am looking at UDP receive buffer overflows, not TCP.

I have set udp_max_buf to 4MB now and udp_send_buf & udp_recv_buf to 2MB
each, then restarted BIND.
It seems to be working better now as I don't see that much receive buffer
overflows anymore.

However, the initial question still stands. How can I reconfigure BIND to
pick up the data faster from the receive buffer ?

> Since you are on contract ( me too .. arn't we all these days ) then I
> have to assume you have reasonable kernel updates and tcp patches in
> this Solaris server ?

Yes, of course.

Regards
Marc

I tend to distrust  "CPU(30%)" if it is averaged over more than one cpu.  Could you run "top" and hit the number "1" so that it shows each cpu separately?  With 8 cpu's, "30%" could be one cpu at 100% and others lower, where the one cpu at 100% is your bottleneck.

Just a guess at something to look at.

--
Bob Harold


_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [E] Re: strange problem with query being dropped/ignored by the BIND process

Marc Richter
Hi Bob,

> I tend to distrust  "CPU(30%)" if it is averaged over more than one
> cpu.  Could you run "top" and hit the number "1" so that it shows each
> cpu separately?  With 8 cpu's, "30%" could be one cpu at 100% and others
> lower, where the one cpu at 100% is your bottleneck.

I checked that with mpstat earlier already and the load is evenly
distributed amongst all CPUs. None of the CPUs is overloaded.

Regards
Marc

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Loading...