Trying again on SERVFAIL

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Trying again on SERVFAIL

Alessandro Vesely
Hi,

is there a way to know that a query has already been tried a few minutes ago,
and failed?

It happens seldomly, but sometimes the DKIM mail filter gets a SERVFAIL when it
tries to authenticate an incoming message.  SERVFAIL occurs when DNSSEC check
fails.  Trying again is useless, it has to be treated as a permanent error.

Any idea about how to tell a really temporary error?

Best
Ale
--










_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Bind-Users forum mailing list
> is there a way to know that a query has already been tried a few
> minutes ago, and failed?

From whose perspective?

A well-behaved application could remember it asked the same query
a short while ago, of course, but that's up to the application.

Or is the perspective that of a recursive resolver?  As far as I
remember, BIND used as a recursive resolver will "cache" this
knowledge, but I'm not entirely certain for how long, since it
can't use the method from an NXDOMAIN reply which includes the
SOA record (and uses the re-purposed "minimum" field for the TTL
for the negative cache entry).

> It happens seldomly, but sometimes the DKIM mail filter gets a
> SERVFAIL when it tries to authenticate an incoming message.
> SERVFAIL occurs when DNSSEC check fails.

...or when none of the name servers for the containing zone
responds with an answer.  I.e. it's not *just* DNSSEC failure
which can trigger SERVFAIL.

> Trying again is useless, it has to be treated as a permanent
> error.

Well, now...  Basically nothing in the DNS is permanent, because
it is not completely static; hence most information in the DNS
has a TTL attached to it.  So the question then becomes how an
application, say a mail server should treat SERVFAIL.  It may
very well be that the "maximum retry time" of the mail server is
far longer than any of the TTLs for the pieces of DNS data that
you could not look up, so it may be appropriate to treat SERVFAIL
as a signal to "re-queue the message and try again in 30
minutes", so in essence converting SERVFAIL into a "temporary
failure" in the context of the mail server.

SERVFAIL doesn't mean that the domain name you tried to look up
currently doesn't exist in the DNS, you just can't know one way
or the other.

> Any idea about how to tell a really temporary error?

You again have to specify the context.

Regards,

- Håvard
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Alessandro Vesely
Hi Havard,

thanks for your reply.

On Tue 09/Feb/2021 18:15:43 +0100 Havard Eidnes wrote:
>> is there a way to know that a query has already been tried a few
>> minutes ago, and failed?
>
>  From whose perspective?
>
> A well-behaved application could remember it asked the same query
> a short while ago, of course, but that's up to the application.


For an application, caching queries feels like stealing the resolver's job.


> Or is the perspective that of a recursive resolver?  As far as I
> remember, BIND used as a recursive resolver will "cache" this
> knowledge, but I'm not entirely certain for how long, since it
> can't use the method from an NXDOMAIN reply which includes the
> SOA record (and uses the re-purposed "minimum" field for the TTL
> for the negative cache entry).


I too recall that NXDOMAIN can be cached for a while.  I'd guess some kinds of
failures are also cached.


>> It happens seldomly, but sometimes the DKIM mail filter gets a
>> SERVFAIL when it tries to authenticate an incoming message.
>> SERVFAIL occurs when DNSSEC check fails.
>
> ...or when none of the name servers for the containing zone
> responds with an answer.  I.e. it's not *just* DNSSEC failure
> which can trigger SERVFAIL.


Yes, of course.  Yet, however sporadic, DNSSEC failure seems to be the most
frequent case.


>> Trying again is useless, it has to be treated as a permanent
>> error.
>
> Well, now...  Basically nothing in the DNS is permanent, because
> it is not completely static; hence most information in the DNS
> has a TTL attached to it.  So the question then becomes how an
> application, say a mail server should treat SERVFAIL.  It may
> very well be that the "maximum retry time" of the mail server is
> far longer than any of the TTLs for the pieces of DNS data that
> you could not look up, so it may be appropriate to treat SERVFAIL
> as a signal to "re-queue the message and try again in 30
> minutes", so in essence converting SERVFAIL into a "temporary
> failure" in the context of the mail server.


That's what I've been doing.  For an incoming message, a temporary failure
means replying a 4xx code.  The sender keeps the message in its queue, and
eventually gives up.  Once upon a time, MTAs used to retry sending for five
days.  Nowadays, several servers don't let queued messages grow older than one day.

In the most severe case, a failed DKIM signature might entail a reject.  So the
best course of action seems to be to reserve temporary failures to this case.

Still, being able to differentiate a local network congestion from a remote bad
configuration would help.


Best
Ale
--


















_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

J Doe
On 2021-02-10 3:05 a.m., Alessandro Vesely wrote:
> Hi Havard,
>

<snip>

>
> That's what I've been doing.  For an incoming message, a temporary
> failure means replying a 4xx code.  The sender keeps the message in its
> queue, and eventually gives up.  Once upon a time, MTAs used to retry
> sending for five days.  Nowadays, several servers don't let queued
> messages grow older than one day.
>
> In the most severe case, a failed DKIM signature might entail a reject.  
> So the best course of action seems to be to reserve temporary failures
> to this case.
>
> Still, being able to differentiate a local network congestion from a
> remote bad configuration would help.
>
>
> Best
> Ale

Hi Ale and list,

This isn't an answer to your original question, but I was curious about
something you mentioned near the end of your message, where you wrote:
"Once upon a time . . . Nowadays, several servers don't let queued
messages grow older than one day".

Out of curiosity, what servers have you encountered that no longer use
the five day cutoff ?

Thanks,

- J
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Bind-Users forum mailing list
In reply to this post by Alessandro Vesely
> Still, being able to differentiate a local network congestion from a
> remote bad configuration would help.

That's true.  There's

  https://tools.ietf.org/html/draft-ietf-dnsop-extended-error-16

which look promising, trying to make it possible to distinguish
between the various reasons a recursor might choose to return a
SERVFAIL response.  It uses an EDNS option to communicate the
additional information.

As for its implementation status in general or in BIND in
particular I'll admit that I don't know off-hand.

Regards,

- Håvard
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Alessandro Vesely
On Thu 11/Feb/2021 10:44:58 +0100 Havard Eidnes wrote:

>> Still, being able to differentiate a local network congestion from a
>> remote bad configuration would help.
>
> That's true.  There's
>
>    https://tools.ietf.org/html/draft-ietf-dnsop-extended-error-16
>
> which look promising, trying to make it possible to distinguish
> between the various reasons a recursor might choose to return a
> SERVFAIL response.  It uses an EDNS option to communicate the
> additional information.


Commendable effort!


> As for its implementation status in general or in BIND in
> particular I'll admit that I don't know off-hand.


Yeah, by the time it lands on Debian's glibc we'll have grown a long long
beard.  I'm still missing RES_TRUSTAD...


Best
Ale
--




















_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Alessandro Vesely
In reply to this post by J Doe
On Wed 10/Feb/2021 22:38:05 +0100 J Doe wrote:
>
> Out of curiosity, what servers have you encountered that no longer use the five
> day cutoff ?


I didn't take note, but I read discussions on the topic.  Users expect mail to be delivered almost instantly.  The "warning, still trying" messages should come sometime in between.  If it comes the next day, by various people's experience, it is unacceptably too late.  If you reduce that to a few hours, the total max queue lifetime cannot remain five days.

At mine, although I keep the default 5d, I cut queue time for specific messages, such as complaints or dmarc reports, to ten hours.

Quoting from the web:

     Queue lifetimes over a day is just Cargo Cult system administration, and a
     holdover from when the internet was much less "always on".
     https://serverfault.com/questions/735269/is-it-a-good-idea-to-reduce-the-give-up-time-for-e-mail-delivery#answer-826351


Best
Ale
--

















_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Mark Andrews
Machines still fall over. They take the same amount of time to fix now as they did 30 years ago.

You still have to diagnose the fault. You still have to get the replacement part. You still have to potentially restore from backups. Sometimes you can switch to a standby machine which makes things faster.

I’ve seem day long outages in the last 7 days. They still happen. Personally I was happy the emails queued.
--
Mark Andrews

> On 11 Feb 2021, at 23:26, Alessandro Vesely <[hidden email]> wrote:
>
> On Wed 10/Feb/2021 22:38:05 +0100 J Doe wrote:
>> Out of curiosity, what servers have you encountered that no longer use the five day cutoff ?
>
>
> I didn't take note, but I read discussions on the topic.  Users expect mail to be delivered almost instantly.  The "warning, still trying" messages should come sometime in between.  If it comes the next day, by various people's experience, it is unacceptably too late.  If you reduce that to a few hours, the total max queue lifetime cannot remain five days.
>
> At mine, although I keep the default 5d, I cut queue time for specific messages, such as complaints or dmarc reports, to ten hours.
>
> Quoting from the web:
>
>    Queue lifetimes over a day is just Cargo Cult system administration, and a
>    holdover from when the internet was much less "always on".
>    https://serverfault.com/questions/735269/is-it-a-good-idea-to-reduce-the-give-up-time-for-e-mail-delivery#answer-826351
>
>
> Best
> Ale
> --
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.
>
>
> bind-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/bind-users

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Ondřej Surý
Mark is right. The internet isn’t always on and it isn’t only composed of big tech companies with lots of resources.

The internet consists of lot small systems made by people like you and me and we don’t have infinite resources to keep everything always on.

And honestly I find your quote about Cargo Cult very offensive to all those normal people maintaining the rest of the internet infrastructure that isn’t the current <n>-umvirate.

Ondrej
--
Ondřej Surý (He/Him)
[hidden email]

> On 11. 2. 2021, at 14:13, Mark Andrews <[hidden email]> wrote:
>
> Machines still fall over. They take the same amount of time to fix now as they did 30 years ago.
>
> You still have to diagnose the fault. You still have to get the replacement part. You still have to potentially restore from backups. Sometimes you can switch to a standby machine which makes things faster.
>
> I’ve seem day long outages in the last 7 days. They still happen. Personally I was happy the emails queued.
> --
> Mark Andrews
>
>> On 11 Feb 2021, at 23:26, Alessandro Vesely <[hidden email]> wrote:
>>
>> On Wed 10/Feb/2021 22:38:05 +0100 J Doe wrote:
>>> Out of curiosity, what servers have you encountered that no longer use the five day cutoff ?
>>
>>
>> I didn't take note, but I read discussions on the topic.  Users expect mail to be delivered almost instantly.  The "warning, still trying" messages should come sometime in between.  If it comes the next day, by various people's experience, it is unacceptably too late.  If you reduce that to a few hours, the total max queue lifetime cannot remain five days.
>>
>> At mine, although I keep the default 5d, I cut queue time for specific messages, such as complaints or dmarc reports, to ten hours.
>>
>> Quoting from the web:
>>
>>   Queue lifetimes over a day is just Cargo Cult system administration, and a
>>   holdover from when the internet was much less "always on".
>>   https://serverfault.com/questions/735269/is-it-a-good-idea-to-reduce-the-give-up-time-for-e-mail-delivery#answer-826351
>>
>>
>> Best
>> Ale
>> --
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>>
>> ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.
>>
>>
>> bind-users mailing list
>> [hidden email]
>> https://lists.isc.org/mailman/listinfo/bind-users
>
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.
>
>
> bind-users mailing list
> [hidden email]
> https://lists.isc.org/mailman/listinfo/bind-users

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Alessandro Vesely
On Thu 11/Feb/2021 14:47:13 +0100 Ondřej Surý wrote:
> Mark is right. The internet isn’t always on and it isn’t only composed of big tech companies with lots of resources.
>
> The internet consists of lot small systems made by people like you and me and we don’t have infinite resources to keep everything always on.


100% agreed.


> And honestly I find your quote about Cargo Cult very offensive to all those normal people maintaining the rest of the internet infrastructure that isn’t the current <n>-umvirate.


I don't share that point of view.  I cited it as evidence of a way of thinking.

I find it somewhat green, happy-go-lucky, but not offensive.  After all, if you
limit the range to personal messages, it's a legitimate way to conceive email
services.


Best
Ale
--














_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Bind-Users forum mailing list
In reply to this post by Alessandro Vesely
> Yeah, by the time it lands on Debian's glibc we'll have grown a long
> long beard.  I'm still missing RES_TRUSTAD...

Oh, this set me off on a tangent.  I hadn't heard of RES_TRUSTAD
before, so I found

  https://man7.org/linux/man-pages/man5/resolv.conf.5.html

which under "trust-ad" contains this text:

          If the trust-ad option is active, the stub resolver
          sets the AD bit in outgoing DNS queries (to enable AD
          bit support), [...]

I could not get that to rhyme with what I had perceived to be the
semantics of the AD bit, so I looked up RFC 4035 where near the
end of section 3 (just before 3.1), I find this text:

   The AD bit is controlled by name servers; a security-aware
   name server MUST ignore the setting of the AD bit in queries.

So ... I can't get the glibc behaviour to mesh with the standard
on this particular point.

Regards,

- Håvard
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: Trying again on SERVFAIL

Brett Delmage
In reply to this post by Ondřej Surý
>  The internet isn’t always on and it isn’t only composed of big tech
> companies with lots of resources.

like Google's gmail, which has had hours-long service outages from time to
time? ;-)
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

RES_TRUSTAD, was Trying again on SERVFAIL

Alessandro Vesely
In reply to this post by Bind-Users forum mailing list
On Thu 11/Feb/2021 17:44:20 +0100 Havard Eidnes wrote:

>> Yeah, by the time it lands on Debian's glibc we'll have grown a long
>> long beard.  I'm still missing RES_TRUSTAD...
>
> Oh, this set me off on a tangent.  I hadn't heard of RES_TRUSTAD
> before, so I found
>
>    https://man7.org/linux/man-pages/man5/resolv.conf.5.html
>
> which under "trust-ad" contains this text:
>
>            If the trust-ad option is active, the stub resolver
>            sets the AD bit in outgoing DNS queries (to enable AD
>            bit support), [...]


It's similar to dig's man page:

       +[no]adflag
            Set [do not set] the AD (authentic data) bit in the query.
            This requests the server to return whether all of the answer
            and authority sections have all been validated as secure
            according to the security policy of the server. AD=1
            indicates that all records have been validated as secure and
            the answer is not from a OPT-OUT range. AD=0 indicate that
            some part of the answer was insecure or not validated. This
            bit is set by default.


> I could not get that to rhyme with what I had perceived to be the
> semantics of the AD bit, so I looked up RFC 4035 where near the
> end of section 3 (just before 3.1), I find this text:
>
>     The AD bit is controlled by name servers; a security-aware
>     name server MUST ignore the setting of the AD bit in queries.


That's the name server, not the resolver.


> So ... I can't get the glibc behaviour to mesh with the standard
> on this particular point.


It's set in RFC 6840:

5.7.  Setting the AD Bit on Queries

    The semantics of the Authentic Data (AD) bit in the query were
    previously undefined.  Section 4.6 of [RFC4035] instructed resolvers
    to always clear the AD bit when composing queries.

    This document defines setting the AD bit in a query as a signal
    indicating that the requester understands and is interested in the
    value of the AD bit in the response.  This allows a requester to
    indicate that it understands the AD bit without also requesting
    DNSSEC data via the DO bit.



Best
Ale
--












_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users
Reply | Threaded
Open this post in threaded view
|

Re: RES_TRUSTAD, was Trying again on SERVFAIL

Bind-Users forum mailing list
>> So ... I can't get the glibc behaviour to mesh with the standard
>> on this particular point.
>
> It's set in RFC 6840:

I stand corrected, thanks.

- Håvard
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list

ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/bind-users