Tuning suggestions for high-core-count Linux servers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Tuning suggestions for high-core-count Linux servers

Browne, Stuart
Hi,

I've been able to get my hands on some rather nice servers with 2 x 12 core Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND to respond at a faster rate.

I'm seeing that pretty much cpu count beyond a single die doesn't get any real improvement. I understand the NUMA boundaries etc., but this hasn't been my experience on previous iterations of the Intel CPU's, at least not this dramatically. When I use more than a single die, CPU utilization continues to match the core count however throughput doesn't increase to match.

All the testing I've been doing for now (dnsperf from multiple sources for now) seems to be plateauing around 340k qps per BIND host.

Some notes:
- Primarily looking at UDP throughput here
- Intention is for high-throughput, authoritative only
- The zone files used for testing are fairly small and reside completely in-memory; no disk IO involved
- RHEL7, bind 9.10 series, iptables 'NOTRACK' firmly in place
- Current configure:

built by make with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--localstatedir=/var' '--with-libtool' '--enable-threads' '--enable-ipv6' '--with-pic' '--enable-shared' '--disable-static' '--disable-openssl-version-check' '--with-tuning=large' '--with-libxml2' '--with-libjson' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS= -DDIG_SIGCHASE -fPIC'

Things tried:
- Using 'taskset' to bind to a single CPU die and limiting BIND to '-n' cpu's doesn't improve much beyond letting BIND make its own decision
- NIC interfaces are set for TOE
- rmem & wmem changes (beyond a point) seem to do little to improve performance, mainly just make throughput more consistent

I've yet to investigate the switch throughput or tweaking (don't yet have access to it).

So, any thoughts?

Stuart
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tuning suggestions for high-core-count Linux servers

MURTARI, JOHN
Stuart,
        You didn't mention what OS you are using, I assume some version of Linux.  What you are seeing may not be a BIND limit, but the OS.  One thing we noted with Redhat is that the kernel just couldn't keep up with the inbound UDP packets (queue overflow).    The kernel does keep a count of dropped UDP packets; unfortunately, I can't recall the command we used to monitor.   Found this on Google, https://linux-tips.com/t/udp-packet-drops-and-packet-receive-error-difference/237 .

        Perhaps the other folks have better details.
        Best regards!
John

------------------------------

Message: 4
Date: Wed, 31 May 2017 07:25:44 +0000
From: "Browne, Stuart" <[hidden email]>
To: "[hidden email]" <[hidden email]>
Subject: Tuning suggestions for high-core-count Linux servers
Message-ID:
        <[hidden email]>
Content-Type: text/plain; charset="us-ascii"

Hi,

I've been able to get my hands on some rather nice servers with 2 x 12 core Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND to respond at a faster rate.

I'm seeing that pretty much cpu count beyond a single die doesn't get any real improvement. I understand the NUMA boundaries etc., but this hasn't been my experience on previous iterations of the Intel CPU's, at least not this dramatically. When I use more than a single die, CPU utilization continues to match the core count however throughput doesn't increase to match.

All the testing I've been doing for now (dnsperf from multiple sources for now) seems to be plateauing around 340k qps per BIND host.

Some notes:
- Primarily looking at UDP throughput here
- Intention is for high-throughput, authoritative only
- The zone files used for testing are fairly small and reside completely in-memory; no disk IO involved
- RHEL7, bind 9.10 series, iptables 'NOTRACK' firmly in place
- Current configure:

built by make with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--localstatedir=/var' '--with-libtool' '--enable-threads' '--enable-ipv6' '--with-pic' '--enable-shared' '--disable-static' '--disable-openssl-version-check' '--with-tuning=large' '--with-libxml2' '--with-libjson' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS= -DDIG_SIGCHASE -fPIC'

Things tried:
- Using 'taskset' to bind to a single CPU die and limiting BIND to '-n' cpu's doesn't improve much beyond letting BIND make its own decision
- NIC interfaces are set for TOE
- rmem & wmem changes (beyond a point) seem to do little to improve performance, mainly just make throughput more consistent

I've yet to investigate the switch throughput or tweaking (don't yet have access to it).

So, any thoughts?

Stuart


------------------------------

Subject: Digest Footer

_______________________________________________
bind-users mailing list
[hidden email]
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers&d=DwICAg&c=LFYZ-o9_HUMeMTSQicvjIg&r=W2EXdPialiHj_h1mirDQrw&m=VpL_2rBI-dAN9CcyM5ItcBMHm2oWJh1OkP57nWgbtko&s=wj2zSZ6MlLY27Z-9Hjrg_BU-0GSQgQkvy89cfNbHNfQ&e= 

------------------------------

End of bind-users Digest, Vol 2662, Issue 1
*******************************************
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Reindl Harald

Am 31.05.2017 um 14:42 schrieb MURTARI, JOHN:
> Stuart, You didn't mention what OS you are using

Subject: RE: Tuning suggestions for high-core-count Linux servers

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Mathew Ian Eis
In reply to this post by Browne, Stuart
360k qps is actually quite good… the best I have heard of until now on EL was 180k [1]. There, it was recommended to manually tune the number of subthreads with the -U parameter.

Since you’ve mentioned rmem/wmem changes, specifically you want to:

1. check for send buffer overflow; as indicated in named logs:
31-Mar-2017 12:30:55.521 client: warning: client 10.0.0.5#51342 (test.com): error sending response: unset

fix: increase rmem via sysctl:
net.core.rmem_max
net.core.rmem_default

2. check for receive buffer overflow; as indicated by netstat:
# netstat -u -s
Udp:
    34772479 packet receive errors

fix: increase wmem and backlog via sysctl:
net.core.wmem_max
net.core.wmem_default

… and other ideas:

3. check 2nd column in /proc/net/softnet_stat for any non-zero numbers (indicating dropped packets).
If any are non-zero, increase net.core.netdev_max_backlog

4. You may also want to want to increase net.unix.max_dgram_qlen (although since EL7 has default this to 512, this is not much of an issue - double check that it is 512).

5. Try running dropwatch to see where packets are being lost. If it shows nothing then you need to look outside the system. If it shows something you may have a hint where to tune next.

Please post your outcomes in any case, since you are already having some excellent results.

[1] https://lists.dns-oarc.net/pipermail/dns-operations/2014-April/011543.html

Regards,

Mathew Eis
Northern Arizona University
Information Technology Services

-----Original Message-----
From: bind-users <[hidden email]> on behalf of "Browne, Stuart" <[hidden email]>
Date: Wednesday, May 31, 2017 at 12:25 AM
To: "[hidden email]" <[hidden email]>
Subject: Tuning suggestions for high-core-count Linux servers

    Hi,
   
    I've been able to get my hands on some rather nice servers with 2 x 12 core Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND to respond at a faster rate.
   
    I'm seeing that pretty much cpu count beyond a single die doesn't get any real improvement. I understand the NUMA boundaries etc., but this hasn't been my experience on previous iterations of the Intel CPU's, at least not this dramatically. When I use more than a single die, CPU utilization continues to match the core count however throughput doesn't increase to match.
   
    All the testing I've been doing for now (dnsperf from multiple sources for now) seems to be plateauing around 340k qps per BIND host.
   
    Some notes:
    - Primarily looking at UDP throughput here
    - Intention is for high-throughput, authoritative only
    - The zone files used for testing are fairly small and reside completely in-memory; no disk IO involved
    - RHEL7, bind 9.10 series, iptables 'NOTRACK' firmly in place
    - Current configure:
   
    built by make with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--localstatedir=/var' '--with-libtool' '--enable-threads' '--enable-ipv6' '--with-pic' '--enable-shared' '--disable-static' '--disable-openssl-version-check' '--with-tuning=large' '--with-libxml2' '--with-libjson' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS= -DDIG_SIGCHASE -fPIC'
   
    Things tried:
    - Using 'taskset' to bind to a single CPU die and limiting BIND to '-n' cpu's doesn't improve much beyond letting BIND make its own decision
    - NIC interfaces are set for TOE
    - rmem & wmem changes (beyond a point) seem to do little to improve performance, mainly just make throughput more consistent
   
    I've yet to investigate the switch throughput or tweaking (don't yet have access to it).
   
    So, any thoughts?
   
    Stuart
    _______________________________________________
    Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
   
    bind-users mailing list
    [hidden email]
    https://lists.isc.org/mailman/listinfo/bind-users
   


_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tuning suggestions for high-core-count Linux servers

Browne, Stuart
Cheers Matthew.

1)  Not seeing that error, seeing this one instead:

01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would block

Only seeing a few of them per run (out of ~70 million requests).

Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c in doio_send), I don't understand the underlying reason for it being set (errno == EWOULDBLOCK || errno == EAGAIN).

I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), but no real difference after tweaks. I did another run with stupidly-large core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit so over tuning isn't good either. Need to figure out a good balance here.

I'd love to figure out what the math here should be.  'X number of simultaneous connections multiplied by Y socket memory size = rmem' or some such.

2) I am still seeing some udp receive errors and receive buffer errors; about 1.3% of received packets.

From a 'netstat' point of view, I see:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
udp   382976  17664 192.168.1.21:53         0.0.0.0:*

The numbers in the receive queue stay in the 200-300k range whilst the send-queue floats around the 20-40k range. wmem already bumped.

3) Huh, didn't know about this one. Bumped up the backlog, small increase in throughput for my tests. Still need to figure out how to read sofnet_stat. More google-fu in my future.

After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in the 2nd column.

4) Yes, max_dgram_qlen is already set to 512.

5) Oo! new tool! :)

--
...
11 drops at location 0xffffffff815df171
854 drops at location 0xffffffff815e1c64
12 drops at location 0xffffffff815df171
822 drops at location 0xffffffff815e1c64
...
--

I'm pretty sure it's just showing more details of the 'netstat -u -s'. More google-fu to figure out how to use that information for good rather than, well, .. frustration? .. :)

Will keep spinning test but using smaller increments to the wmem/rmem values, see if I can eek anything more than 360k out of it.

Thanks for your suggestions Matthew!

Stuart


-----Original Message-----
From: Mathew Ian Eis [mailto:[hidden email]]
Sent: Thursday, 1 June 2017 10:30 AM
To: [hidden email]
Cc: Browne, Stuart
Subject: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

360k qps is actually quite good… the best I have heard of until now on EL was 180k [1]. There, it was recommended to manually tune the number of subthreads with the -U parameter.



Since you’ve mentioned rmem/wmem changes, specifically you want to:



1. check for send buffer overflow; as indicated in named logs:

31-Mar-2017 12:30:55.521 client: warning: client 10.0.0.5#51342 (test.com): error sending response: unset



fix: increase rmem via sysctl:

net.core.rmem_max

net.core.rmem_default



2. check for receive buffer overflow; as indicated by netstat:

# netstat -u -s

Udp:

    34772479 packet receive errors



fix: increase wmem and backlog via sysctl:

net.core.wmem_max

net.core.wmem_default



… and other ideas:



3. check 2nd column in /proc/net/softnet_stat for any non-zero numbers (indicating dropped packets).

If any are non-zero, increase net.core.netdev_max_backlog



4. You may also want to want to increase net.unix.max_dgram_qlen (although since EL7 has default this to 512, this is not much of an issue - double check that it is 512).



5. Try running dropwatch to see where packets are being lost. If it shows nothing then you need to look outside the system. If it shows something you may have a hint where to tune next.



Please post your outcomes in any case, since you are already having some excellent results.



[1] https://lists.dns-oarc.net/pipermail/dns-operations/2014-April/011543.html


Regards,



Mathew Eis

Northern Arizona University

Information Technology Services



-----Original Message-----

From: bind-users <[hidden email]> on behalf of "Browne, Stuart" <[hidden email]>

Date: Wednesday, May 31, 2017 at 12:25 AM

To: "[hidden email]" <[hidden email]>

Subject: Tuning suggestions for high-core-count Linux servers



    Hi,

   

    I've been able to get my hands on some rather nice servers with 2 x 12 core Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND to respond at a faster rate.

   

    I'm seeing that pretty much cpu count beyond a single die doesn't get any real improvement. I understand the NUMA boundaries etc., but this hasn't been my experience on previous iterations of the Intel CPU's, at least not this dramatically. When I use more than a single die, CPU utilization continues to match the core count however throughput doesn't increase to match.

   

    All the testing I've been doing for now (dnsperf from multiple sources for now) seems to be plateauing around 340k qps per BIND host.

   

    Some notes:

    - Primarily looking at UDP throughput here

    - Intention is for high-throughput, authoritative only

    - The zone files used for testing are fairly small and reside completely in-memory; no disk IO involved

    - RHEL7, bind 9.10 series, iptables 'NOTRACK' firmly in place

    - Current configure:

   

    built by make with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--localstatedir=/var' '--with-libtool' '--enable-threads' '--enable-ipv6' '--with-pic' '--enable-shared' '--disable-static' '--disable-openssl-version-check' '--with-tuning=large' '--with-libxml2' '--with-libjson' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS= -DDIG_SIGCHASE -fPIC'

   

    Things tried:

    - Using 'taskset' to bind to a single CPU die and limiting BIND to '-n' cpu's doesn't improve much beyond letting BIND make its own decision

    - NIC interfaces are set for TOE

    - rmem & wmem changes (beyond a point) seem to do little to improve performance, mainly just make throughput more consistent

   

    I've yet to investigate the switch throughput or tweaking (don't yet have access to it).

   

    So, any thoughts?

   

    Stuart





_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Plhu

  Hello Stuart,
a few simple ideas to your tests:
 - have you inspected the per-thread CPU? Aren't some of the threads overloaded?
 - have you tried to get the statistics from the Bind server using the
 XML or JSON interface? It may bring you another insight to the errors.
 - I may have missed the connection count you use for testing - can you
 post it? More, how may entries do you have in your database? Can you
 share your named.conf (without any compromising entries)?
 - what is your network environment? How many switches/routers are there
 between your simulator and the Bind server host?
 - is Bind the only running process on the tested server?
 - what CPUs is the Bind server being run on?
 - is there numad running and while trying the taskset, have you
 selected the CPUs on the same processor? What does numastat show during
 the test?
 - how many UDP sockets are in use during your test?

Curious for the responses.

  Lukas

Browne, Stuart <[hidden email]> writes:

> Cheers Matthew.
>
> 1)  Not seeing that error, seeing this one instead:
>
> 01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would block
>
> Only seeing a few of them per run (out of ~70 million requests).
>
> Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c in doio_send), I don't understand the underlying reason for it being set (errno == EWOULDBLOCK || errno == EAGAIN).
>
> I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), but no real difference after tweaks. I did another run with stupidly-large core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit so over tuning isn't good either. Need to figure out a good balance here.
>
> I'd love to figure out what the math here should be.  'X number of simultaneous connections multiplied by Y socket memory size = rmem' or some such.
>
> 2) I am still seeing some udp receive errors and receive buffer errors; about 1.3% of received packets.
>
> From a 'netstat' point of view, I see:
>
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address           Foreign Address         State
> udp   382976  17664 192.168.1.21:53         0.0.0.0:*
>
> The numbers in the receive queue stay in the 200-300k range whilst the send-queue floats around the 20-40k range. wmem already bumped.
>
> 3) Huh, didn't know about this one. Bumped up the backlog, small increase in throughput for my tests. Still need to figure out how to read sofnet_stat. More google-fu in my future.
>
> After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in the 2nd column.
>
> 4) Yes, max_dgram_qlen is already set to 512.
>
> 5) Oo! new tool! :)
>
> --
> ...
> 11 drops at location 0xffffffff815df171
> 854 drops at location 0xffffffff815e1c64
> 12 drops at location 0xffffffff815df171
> 822 drops at location 0xffffffff815e1c64
> ...
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Mathew Ian Eis
In reply to this post by Browne, Stuart
Howdy Stuart,

>  Re: net.core.rmem - I'd love to figure out what the math here should be. 'X number of simultaneous connections multiplied by Y socket memory size = rmem' or some such.

Basically the math here is “large enough that you can queue up the 9X.XXXth percentile of traffic bursts without dropping them, but not so large that you waste processing time fiddling with the queue”. Since that percentile varies widely across environments it’s not easy to provide a specific formula. And on that note:

> Will keep spinning test but using smaller increments to the wmem/rmem values

Tightening is nice for finding some theoretical limits but in practice not so much. Be careful about making them too tight, lest under your “bursty” production loads you drop all sorts of queries without intending to.

> Re: dropwatch - Oo! new tool! More google-fu to figure out how to use that information for good

dropwatch is an easy indicator of whether the throughput issue is on or off the system. Seeing packets being dropped in the system combined with apparently low CPU usage suggests you might be able to increase throughput. `dropwatch -l kas` should tell you the methods that are dropping the packets, which can help you understand where in the kernel they are being dropped and why. For anything beyond that, I expect your Google-fu is as good as mine ;-)

If your CPU utilization is still apparently low, you might be onto something with taskset/numa… Related things I have toyed with but don’t currently have in production:

increasing kernel.sched_migration_cost a couple of orders of magnitude
setting kernel.sched_autogroup_enabled=0
systemctl stop irqbalance

Lastly (mostly for posterity for the list, please don’t take this as “rtfm” if you’ve seen them already) here are some very useful in-depth (but generalized) performance tuning guides:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Performance_Tuning_Guide/
https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf

… and for one last really crazy idea, you could try running a pair of named instances on the machine and fronting them with nginx’s supposedly scalable UDP load balancer. (As long as you don’t get a performance hit, it also opens up other interesting possibilities like being able to shift production load for maintenance on the named backends).

Best of luck! Let us know where you cap out!

Regards,

Mathew Eis
Northern Arizona University
Information Technology Services

-----Original Message-----
From: "Browne, Stuart" <[hidden email]>
Date: Thursday, June 1, 2017 at 12:27 AM
To: Mathew Ian Eis <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: RE: Tuning suggestions for high-core-count Linux servers

    Cheers Matthew.
   
    1)  Not seeing that error, seeing this one instead:
   
    01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would block
   
    Only seeing a few of them per run (out of ~70 million requests).
   
    Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c in doio_send), I don't understand the underlying reason for it being set (errno == EWOULDBLOCK || errno == EAGAIN).
   
    I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), but no real difference after tweaks. I did another run with stupidly-large core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit so over tuning isn't good either. Need to figure out a good balance here.
   
    I'd love to figure out what the math here should be.  'X number of simultaneous connections multiplied by Y socket memory size = rmem' or some such.
   
    2) I am still seeing some udp receive errors and receive buffer errors; about 1.3% of received packets.
   
    From a 'netstat' point of view, I see:
   
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State
    udp   382976  17664 192.168.1.21:53         0.0.0.0:*
   
    The numbers in the receive queue stay in the 200-300k range whilst the send-queue floats around the 20-40k range. wmem already bumped.
   
    3) Huh, didn't know about this one. Bumped up the backlog, small increase in throughput for my tests. Still need to figure out how to read sofnet_stat. More google-fu in my future.
   
    After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in the 2nd column.
   
    4) Yes, max_dgram_qlen is already set to 512.
   
    5) Oo! new tool! :)
   
    --
    ...
    11 drops at location 0xffffffff815df171
    854 drops at location 0xffffffff815e1c64
    12 drops at location 0xffffffff815df171
    822 drops at location 0xffffffff815e1c64
    ...
    --
   
    I'm pretty sure it's just showing more details of the 'netstat -u -s'. More google-fu to figure out how to use that information for good rather than, well, .. frustration? .. :)
   
    Will keep spinning test but using smaller increments to the wmem/rmem values, see if I can eek anything more than 360k out of it.
   
    Thanks for your suggestions Matthew!
   
    Stuart
   
   
    -----Original Message-----
    From: Mathew Ian Eis [mailto:[hidden email]]
    Sent: Thursday, 1 June 2017 10:30 AM
    To: [hidden email]
    Cc: Browne, Stuart
    Subject: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers
   
    360k qps is actually quite good… the best I have heard of until now on EL was 180k [1]. There, it was recommended to manually tune the number of subthreads with the -U parameter.
   
   
   
    Since you’ve mentioned rmem/wmem changes, specifically you want to:
   
   
   
    1. check for send buffer overflow; as indicated in named logs:
   
    31-Mar-2017 12:30:55.521 client: warning: client 10.0.0.5#51342 (test.com): error sending response: unset
   
   
   
    fix: increase rmem via sysctl:
   
    net.core.rmem_max
   
    net.core.rmem_default
   
   
   
    2. check for receive buffer overflow; as indicated by netstat:
   
    # netstat -u -s
   
    Udp:
   
        34772479 packet receive errors
   
   
   
    fix: increase wmem and backlog via sysctl:
   
    net.core.wmem_max
   
    net.core.wmem_default
   
   
   
    … and other ideas:
   
   
   
    3. check 2nd column in /proc/net/softnet_stat for any non-zero numbers (indicating dropped packets).
   
    If any are non-zero, increase net.core.netdev_max_backlog
   
   
   
    4. You may also want to want to increase net.unix.max_dgram_qlen (although since EL7 has default this to 512, this is not much of an issue - double check that it is 512).
   
   
   
    5. Try running dropwatch to see where packets are being lost. If it shows nothing then you need to look outside the system. If it shows something you may have a hint where to tune next.
   
   
   
    Please post your outcomes in any case, since you are already having some excellent results.
   
   
   
    [1] https://lists.dns-oarc.net/pipermail/dns-operations/2014-April/011543.html
   
   
    Regards,
   
   
   
    Mathew Eis
   
    Northern Arizona University
   
    Information Technology Services
   
   
   
    -----Original Message-----
   
    From: bind-users <[hidden email]> on behalf of "Browne, Stuart" <[hidden email]>
   
    Date: Wednesday, May 31, 2017 at 12:25 AM
   
    To: "[hidden email]" <[hidden email]>
   
    Subject: Tuning suggestions for high-core-count Linux servers
   
   
   
        Hi,
   
       
   
        I've been able to get my hands on some rather nice servers with 2 x 12 core Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND to respond at a faster rate.
   
       
   
        I'm seeing that pretty much cpu count beyond a single die doesn't get any real improvement. I understand the NUMA boundaries etc., but this hasn't been my experience on previous iterations of the Intel CPU's, at least not this dramatically. When I use more than a single die, CPU utilization continues to match the core count however throughput doesn't increase to match.
   
       
   
        All the testing I've been doing for now (dnsperf from multiple sources for now) seems to be plateauing around 340k qps per BIND host.
   
       
   
        Some notes:
   
        - Primarily looking at UDP throughput here
   
        - Intention is for high-throughput, authoritative only
   
        - The zone files used for testing are fairly small and reside completely in-memory; no disk IO involved
   
        - RHEL7, bind 9.10 series, iptables 'NOTRACK' firmly in place
   
        - Current configure:
   
       
   
        built by make with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--localstatedir=/var' '--with-libtool' '--enable-threads' '--enable-ipv6' '--with-pic' '--enable-shared' '--disable-static' '--disable-openssl-version-check' '--with-tuning=large' '--with-libxml2' '--with-libjson' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS= -DDIG_SIGCHASE -fPIC'
   
       
   
        Things tried:
   
        - Using 'taskset' to bind to a single CPU die and limiting BIND to '-n' cpu's doesn't improve much beyond letting BIND make its own decision
   
        - NIC interfaces are set for TOE
   
        - rmem & wmem changes (beyond a point) seem to do little to improve performance, mainly just make throughput more consistent
   
       
   
        I've yet to investigate the switch throughput or tweaking (don't yet have access to it).
   
       
   
        So, any thoughts?
   
       
   
        Stuart
   
   
   
   
   
   

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tuning suggestions for high-core-count Linux servers

Browne, Stuart
<lots of stuff snipped out>

> -----Original Message-----
> From: Mathew Ian Eis [mailto:[hidden email]]
>
<snip>
>
> Basically the math here is “large enough that you can queue up the
> 9X.XXXth percentile of traffic bursts without dropping them, but not so
> large that you waste processing time fiddling with the queue”. Since that
> percentile varies widely across environments it’s not easy to provide a
> specific formula. And on that note:

Yup. Experimentation seems to the be name of the day.

> > Will keep spinning test but using smaller increments to the wmem/rmem
> > values
>
> Tightening is nice for finding some theoretical limits but in practice
> not so much. Be careful about making them too tight, lest under your
> “bursty” production loads you drop all sorts of queries without intending
> to.

Yup.

> dropwatch is an easy indicator of whether the throughput issue is on or
> off the system. Seeing packets being dropped in the system combined with
> apparently low CPU usage suggests you might be able to increase
> throughput. `dropwatch -l kas` should tell you the methods that are
> dropping the packets, which can help you understand where in the kernel
> they are being dropped and why. For anything beyond that, I expect your
> Google-fu is as good as mine ;-)

Like the '-l kas':

830 drops at udp_queue_rcv_skb+374 (0xffffffff815e1c64)
 15 drops at __udp_queue_rcv_skb+91 (0xffffffff815df171)

Well and truly buried in the code.

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#udpqueuercvskb

This seems like a nice explanation as to what's going on. Still reading through it all.


> If your CPU utilization is still apparently low, you might be onto
> something with taskset/numa… Related things I have toyed with but don’t
> currently have in production:
>
> increasing kernel.sched_migration_cost a couple of orders of magnitude
>
> setting kernel.sched_autogroup_enabled=0
>
> systemctl stop irqbalance

I've had irqbalance stopped previously, and sched_autogroup_enabled is already set to 0. Initial mucking about a bit with sched_migration_cost gets a few more QPS through, so will run more tests.

Thanks for this one, hadn't used it before.

> > Lastly (mostly for posterity for the list, please don’t take this as
> “rtfm” if you’ve seen them already) here are some very useful in-depth
> (but generalized) performance tuning guides:

Will give them a read. I do like manuals :P

<urls stripped due to stupid spam filtering corrupting their readability>

> … and for one last really crazy idea, you could try running a pair of
> named instances on the machine and fronting them with nginx’s supposedly
> scalable UDP load balancer. (As long as you don’t get a performance hit,
> it also opens up other interesting possibilities like being able to shift
> production load for maintenance on the named backends).

Yeah, I've had this thought.

I'm pretty sure I've pretty much reached the limit of what BIND can do in a single NUMA node for the moment.

I will report back if any great inspiration or successful increases in throughput occur.

Stuart
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

Browne, Stuart
In reply to this post by Plhu


> -----Original Message-----
> From: Plhu [mailto:[hidden email]]


> a few simple ideas to your tests:
>  - have you inspected the per-thread CPU? Aren't some of the threads
> overloaded?

I've tested both the auto-calculated values (one thread per available core) and explicitly overridden this. NUMA boundaries seem to be where things get wonky.

>  - have you tried to get the statistics from the Bind server using the
>  XML or JSON interface? It may bring you another insight to the errors.


>  - I may have missed the connection count you use for testing - can you
>  post it? More, how may entries do you have in your database? Can you
>  share your named.conf (without any compromising entries)?

I'm testing to flood, so approximately 5 x 400 client count (dnsperf) with a 500 query backlog per test instance.

Theoretically this should mean up to 4k5 active or back-logged connections (or just 2k5 if I read that documentation wrong).

>  - what is your network environment? How many switches/routers are there
>  between your simulator and the Bind server host?

This is a very closed environment. Server-Switch-Server, all 10Git or 25Gbit. Verified the switch stats today, capable of 10x what I'm pushing through it currently.

>  - is Bind the only running process on the tested server?

As always, there's the rest of the OS helper stuff, but BIND is the only thing actively doing anything (beyond the monitoring I'm doing). So no, nothing else is drawing massive amounts of either CPU or network resources.

>  - what CPUs is the Bind server being run on?

From procinfo:
        Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

2 of them.


>  - is there numad running and while trying the taskset, have you
>  selected the CPUs on the same processor? What does numastat show during
>  the test?

I was manually issuing taskset after confirming the CPU allocations:

taskset 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,46,47 /usr/sbin/named -u named -n 24 -f

This is all of the cores (including HT) on the 2nd socket. There wwas almost no performance difference between 12 (just the actual cores, no HT's) and 24 (with the HT's).

>  - how many UDP sockets are in use during your test?

See above.

>
> Curious for the responses.
>
>   Lukas

Stuart
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Browne, Stuart
In reply to this post by Browne, Stuart
Just some interesting investigation results. One of the URL's Matthew Ian Eis linked to talked about using a tool called 'perf'. For the hell of it, I gave it a shot.

Sure enough it tells some very interesting things.

When BIND was restricted to using a single NUMA node, the biggest call (to _raw_spin_lock) showed 7.05% overhead.

When BIND was allowed to use both NUMA nodes, the same call showed 49.74% overhead; an astonishing difference.

As it was running unrestricted, memory from both nodes was more used:

[root@kr20s2601 ~]# numastat -p 22441

Per-node process memory usage (in MBs) for PID 22441 (named)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.45            0.12            0.57
Stack                        0.71            0.64            1.35
Private                      5.28         9415.30         9420.57
----------------  --------------- --------------- ---------------
Total                        6.43         9416.07         9422.50

Given the numbers here, you wouldn't think it should make much of a difference.

Sadly, I didn't get which CPU the UDP listener was attached to.

Anyway, what I've changed so far:

    vm.swappines = 0
    vm.dirty_ratio = 1
    vm.dirty_background_ratio = 1
    kernel.sched_min_granularity_ns = 10000000
    kernel.sched_migration_cost_ns = 5000000

Query rate thus far reached (on 24 cores, numa node restricted): 426k qps
Query rate thus far reached (on 48 cores, numa nodes unrestricted): 321k qps

Stuart

'perf' data collected during a 3 minute test run:

[root@kr20s2601 ~]# ls -al perf.data*
-rw-------. 1 root root  717350012 Jun  2 08:36 perf.data.24
-rw-------. 1 root root 1366620296 Jun  2 08:53 perf.data.48

'perf' top 5 (24 cores, numa restricted):

Overhead  Command  Shared Object         Symbol
   7.05%  named    [kernel.kallsyms]     [k] _raw_spin_lock
   6.96%  named    libpthread-2.17.so    [.] pthread_mutex_lock
   3.84%  named    libc-2.17.so          [.] vfprintf
   2.36%  named    libdns.so.165.0.7     [.] dns_name_fullcompare
   2.02%  named    libisc.so.160.1.2     [.] isc_log_wouldlog

'perf' top 5 (48 cores):

Overhead  Command  Shared Object         Symbol
  49.74%  named    [kernel.kallsyms]     [k] _raw_spin_lock
   4.52%  named    libpthread-2.17.so    [.] pthread_mutex_lock
   3.09%  named    libisc.so.160.1.2     [.] isc_log_wouldlog
   1.84%  named    [kernel.kallsyms]     [k] _raw_spin_lock_bh
   1.56%  named    libc-2.17.so          [.] vfprintf
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Phil Mayers
On 02/06/17 08:12, Browne, Stuart wrote:
> Just some interesting investigation results. One of the URL's Matthew
> Ian Eis linked to talked about using a tool called 'perf'. For the
> hell of it, I gave it a shot.

perf is super-powerful.

On a sufficiently recent kernel you can also do interesting things with
the enhanced eBPF-based tracing - see:

http://www.brendangregg.com/ebpf.html

...but those are not going to be usable on a RH7 kernel I believe :o(
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Ray Bellis
In reply to this post by Browne, Stuart
On 02/06/2017 08:12, Browne, Stuart wrote:

> Query rate thus far reached (on 24 cores, numa node restricted): 426k qps
> Query rate thus far reached (on 48 cores, numa nodes unrestricted): 321k qps

In our internal Performance Lab I've achieved nearly 900 kqps on small
authoritative zones when we had hyperthreading enabled, and 700 kqps
without.

The lab uses Dell R430s running Fedora Core 23 with Intel X710 10GB NICs
and each populated with a single Xeon E5-2680 v3 2.5 GHz 12-core CPU.

These systems have had *negligible* tuning applied - the vast majority
of the system settings changes I've made have been to improve the
repeatability of results, not the absolute performance.

The only major setting I've found which both helps performance and
improves consistency is to ensure that each NIC rx/tx queue IRQ is
assigned to a specific CPU core, with irqbalance disabled.

This is with a _single_ dnsperf client, too.  The settings I use are
-c24 -q82 -T6 -x2048.   However I do use a tweaked version of dnsperf
which assigns each thread pair (it uses separate threads for rx and tx)
to its own core.

You may find the presentation I made at the recent DNS-OARC workshop of
interest:

<https://indico.dns-oarc.net/event/26/session/3/contribution/18>

You didn't mention precisely which 9.10 series version you're running.
Note that versions prior to 9.10.4 defaulted to a -U value of ncores/2,
but investigation showed that on modern systems this was sub-optimal so
it was changed to ncores-1.  This makes a *very* big difference.

kind regards,

Ray Bellis
ISC Research Fellow
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Ray Bellis
In reply to this post by Mathew Ian Eis
On 01/06/2017 23:26, Mathew Ian Eis wrote:

> … and for one last really crazy idea, you could try running a pair of
> named instances on the machine and fronting them with nginx’s
> supposedly scalable UDP load balancer. (As long as you don’t get a
> performance hit, it also opens up other interesting possibilities
> like being able to shift production load for maintenance on the named
> backends).

It's relatively trivial to patch the BIND source to enable SO_REUSEPORT
on the more recent Linux kernels that support it (3.8+, ISTR?) so that
you can just start two BIND instances listening on the exact same ports
and the kernel will do the load balancing for you.

For a NUMA system, make sure each instance is locked to one die, but
beware of NUMA bus transfers caused by incoming packet buffers being
handled by a kernel task running on one die but then delivered to a BIND
instance running on another.

In the meantime we're also looking at SO_REUSEPORT even for single
instance installations because it appears to offer an advantage over
letting multiple threads all fight over one shared file descriptor.

Ray
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tuning suggestions for high-core-count Linux servers

Paul Kosinski-2
In reply to this post by Browne, Stuart
It's been some years now, but I had worked on developing code for a high
throughput network server (not BIND). We found that on multi-socketed
NUMA machines we could have similar contention problems, and it was
quite important to make sure that threads which needed access to the
same memory areas weren't split across sockets. Luckily, the various
services being run were sufficiently separate that we could assign the
service processes to different sockets and avoid a lot of contention.

With BIND, it's basically all one service, so this is not directly
possible.

It might be possible, however, to run two (or more) *separate*
instances of BIND and do some strictly internal routing of the IP
traffic to those separate instances, or even to have separate NICs
feeding the separate processes. In other words, have several BIND
servers in one chassis, each with its own NUMA memory area.



On Fri, 2 Jun 2017 07:12:09 +0000
"Browne, Stuart" <[hidden email]> wrote:

> Just some interesting investigation results. One of the URL's Matthew
> Ian Eis linked to talked about using a tool called 'perf'. For the
> hell of it, I gave it a shot.
>
> Sure enough it tells some very interesting things.
>
> When BIND was restricted to using a single NUMA node, the biggest
> call (to _raw_spin_lock) showed 7.05% overhead.
>
> When BIND was allowed to use both NUMA nodes, the same call showed
> 49.74% overhead; an astonishing difference.
>
> As it was running unrestricted, memory from both nodes was more used:
>
> [root@kr20s2601 ~]# numastat -p 22441
>
> Per-node process memory usage (in MBs) for PID 22441 (named)
>                            Node 0          Node 1           Total
>                   --------------- --------------- ---------------
> Huge                         0.00            0.00            0.00
> Heap                         0.45            0.12            0.57
> Stack                        0.71            0.64            1.35
> Private                      5.28         9415.30         9420.57
> ----------------  --------------- --------------- ---------------
> Total                        6.43         9416.07         9422.50
>
> Given the numbers here, you wouldn't think it should make much of a
> difference.
>
> Sadly, I didn't get which CPU the UDP listener was attached to.
>
> Anyway, what I've changed so far:
>
>     vm.swappines = 0
>     vm.dirty_ratio = 1
>     vm.dirty_background_ratio = 1
>     kernel.sched_min_granularity_ns = 10000000
>     kernel.sched_migration_cost_ns = 5000000
>
> Query rate thus far reached (on 24 cores, numa node restricted): 426k
> qps Query rate thus far reached (on 48 cores, numa nodes
> unrestricted): 321k qps
>
> Stuart
>
> 'perf' data collected during a 3 minute test run:
>
> [root@kr20s2601 ~]# ls -al perf.data*
> -rw-------. 1 root root  717350012 Jun  2 08:36 perf.data.24
> -rw-------. 1 root root 1366620296 Jun  2 08:53 perf.data.48
>
> 'perf' top 5 (24 cores, numa restricted):
>
> Overhead  Command  Shared Object         Symbol
>    7.05%  named    [kernel.kallsyms]     [k] _raw_spin_lock
>    6.96%  named    libpthread-2.17.so    [.] pthread_mutex_lock
>    3.84%  named    libc-2.17.so          [.] vfprintf
>    2.36%  named    libdns.so.165.0.7     [.] dns_name_fullcompare
>    2.02%  named    libisc.so.160.1.2     [.] isc_log_wouldlog
>
> 'perf' top 5 (48 cores):
>
> Overhead  Command  Shared Object         Symbol
>   49.74%  named    [kernel.kallsyms]     [k] _raw_spin_lock
>    4.52%  named    libpthread-2.17.so    [.] pthread_mutex_lock
>    3.09%  named    libisc.so.160.1.2     [.] isc_log_wouldlog
>    1.84%  named    [kernel.kallsyms]     [k] _raw_spin_lock_bh
>    1.56%  named    libc-2.17.so          [.] vfprintf
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
>
> bind-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/bind-users
>
>
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

Browne, Stuart
In reply to this post by Ray Bellis
Ugh, let me try that again (apologies if you got the half-composed version).

<lots of snip>

> The lab uses Dell R430s running Fedora Core 23 with Intel X710 10GB NICs
> and each populated with a single Xeon E5-2680 v3 2.5 GHz 12-core CPU.

R630 chassis I believe, same NIC's, smaller processor (E5-2650v4@2.2Ghz).

<snip>

> The only major setting I've found which both helps performance and
> improves consistency is to ensure that each NIC rx/tx queue IRQ is
> assigned to a specific CPU core, with irqbalance disabled.

I've been stopping irqbalance, and have confirmed that the rx/tx queue IRQ's aren't jumping around.

> This is with a _single_ dnsperf client, too.  The settings I use are
> -c24 -q82 -T6 -x2048.   However I do use a tweaked version of dnsperf
> which assigns each thread pair (it uses separate threads for rx and tx)
> to its own core.

I didn't think of using -T. *tries that* ..

> You may find the presentation I made at the recent DNS-OARC workshop of
> interest:
>
> https://indico.dns-oarc.net/event/26/session/3/contribution/18

Reading it now. Many thanks.
 
> You didn't mention precisely which 9.10 series version you're running.
> Note that versions prior to 9.10.4 defaulted to a -U value of ncores/2,
> but investigation showed that on modern systems this was sub-optimal so
> it was changed to ncores-1.  This makes a *very* big difference.

BIND 9.10.4-P8.

> kind regards,
>
> Ray Bellis
> ISC Research Fellow

Stuart
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tuning suggestions for high-core-count Linux servers

Browne, Stuart
In reply to this post by Browne, Stuart
So, different tact today, namely the monitoring of '/proc/net/softnet_stat' to try reduce potential errors on the interface.

End result: 517k qps.

Final changes for the day:
sysctl -w net.core.netdev_max_backlog=32768
sysctl -w net.core.netdev_budget=2700
/root/nic_balance.sh em1 0 2

netdev_max_backlog:

An increase to this value is indicated by an increase in the 2nd column of /proc/net/softnet_stat. The default value starts at a reasonable amount, however even 500k qps pushes the limits of this buffer when pinning IRQ's to cores. Doubled it.

netdev_budget:

An increase to this value is indicated by an increase in the 3rd column of /proc/net/softnet_stat. The default value is quite low (300) and this is easily blown away, especially if all of the NIC IRQ's are pinned to a single CPU core. Tried various values until the increase was small (at 2700).

As the best numbers have been when using 2 cores however, this number can probably be lowered. It seems stable at 2700 however, so didn't re-test at lower numbers.

'/root/nic_balance.sh em1 0 2':
(Custom Script based off of RH 20150325_network_performance_tuning.pdf)

Pin all the IRQ's for the 'em1' NIC to the first 2 CPU cores of the local NUMA node.

This had the most noticeable effects. By default, the 'irqbalance' service and the system in general will create numerous rx/tx listening threads for the NIC, each with a soft interrupt. When spread across the multiple NUMA nodes, each ingress packet gets delayed as it gets switched to the NUMA node where the rest of the process is living.

At low throughput, this isn't a concern. At high throughput, this becomes quite noticeable; roughly 100k qps difference.

I tried various levels of tuning (spread across 12 cores, spread across 8, 4 and pinned to a single core), finding 2 cores the best on the bare-metal node.

...

Whilst 'softnet_stat' didn't show any dropped packets (2nd column), 'netstat -s -u' still shows 'packet receive errors'. Still uncertain how they differ and how I can fix netstat's problem.

Stuart
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Loading...