In the latest release of NServiceBus (3.2) we have introduced the concept of Second Level Retries (SLR). We've always had retries in NServiceBus but those retries  where aimed at transient errors like deadlocks and therefor performed instantly. Other more transient errors where just forwarded to a central error queue where they would be managed by the team maintaining the application. The SLR aims to solve the type of errors that can be classified as semi-transient. Examples would be a database being down for short period of time, a web-service being down for a few seconds etc. In previous versions of NServiceBus those errors would have been moved to the error queue but with SLR there is reasonably good chance that they will be retried successfully and there by not burdening the maintenance team.

Use with care

As I mention in my post on errorhandling in a message oriented world retrying is potentially dangerous since while retrying no one will know about the problem and if the error is non-transient you could potentially retry forever. Bottom line is that retries risk breaking your SLA's without you even knowing about it. So as a rule of thumb you should only retry until you reach:

MaxRetryTime = SLA for the particular messagetype - (Current time - Time the message was sent) - Your average response time for failures sent to the error q

This will give you a chance to correct non-transient errors before you break your SLA.

NServiceBus will help you as much as it can by not retrying non-transient errors like deserialization exceptions etc but since we can't classify exceptions from user code this is of course not bullet proof.

I'm not going into the details on how to configure SLR since that is covered by the NServiceBus documentation:

Second Level Retries

We also included a sample that walks you through the full chain of error management in NServiceBus.

Happy retrying!