NServiceBus supports two different types of retries, first level retries (FLR) that happens immediately by bouncing the message back to the top of the input queue and Second Level Retries (SLR) that happens after a configurable amount of time.
In short the FLR aims to solve any transient exceptions like deadlocks while SLR solves more semi-transient ones like the database being down. Usually this all this works out of the box without you doing any thing but in some scenarios you might want to skip the SLR for exceptions where you know that that there is no chance of success.
Why you usually don’t need to worry about this
I usually suggest to avoid tweeking the retries as long as you can since exceptions in production that isn’t solvable through retries should be rare. To achieve this we should always design our messages to have “no good reason to fail” we can refine that to “no good business reason to fail”. If messages have no business reason to fail the only failures you’ll get is infrastructure failures which is very likely to be solvable through retries. I usually classify my messages into events and “other messages” (commands + request/response) and when looking at those a event should definitely never have a business reason to fail since a event is something important that has happened in the business so it’s not like we can roll it back.
If we follow the guidelines above fiddling with the retries is most likely a case of premature optimization.
Kijana has posted a lot of good insights regarding this over at the NServiceBus group.
All that said the purist approach is always, well a purist approach:)
Why you might need fine grained control over the SLR
While messages that have no business reason to fail is always achievable in theory, practically it’s not always applicable to every real life situation so lets talk about a few scenarious where you might wan’t to get more control of what gets retried.
If you have messages that can fail for a business reason excessive retries can cause issues with:
- Performance – Retrying causes extra load on your infrastructure like databases and if again if we fail fast we decrease this type of load
- SLA’s – If you have a tight SLA for the given message type all time spent retrying is lost time when it comes to “manually” fixing the problem within your SLA. In these situations we want to fail fast and let NServiceBus move the message to the error queue so that our maintenance team can take action.
Enough talk, here’s how to do it.
Disabling SLR for certain exceptions
SLR allows users to supply a custom retry policy. A custom retry policy gives you the ability to control the delay between retries for a given message. In this case we’ll combine this with the fact that if you specify the delay to be TimeSpan.MinValue NServiceBus will interpret this as “don’t do any more retries”. When creating a custom retry policy we get access to the TransportMessage that failed and fortunately NServiceBus will add headers to that message that contains the exception details. Those headers are:
So to create a custom policy that supresses SLR if the exception type is of the type MyBusinessException the only thing you have to do is this:
|1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25||
As you can see we just look at the headers and return TimeSpan.MinValue to abort the SLR. If you have other requirements just modify the policy to look at any of the other headers listed above.
Hope this helps!