NServiceBus v4 – Beta1 is out

I just want to let you know that we released the first beta of the upcoming v4 of NServiceBus last friday. The main focus for v4 has been to make NServiceBus run on a wider range of queuing infrastructures while still giving you the same developer experience.

Out of the box v4 will support ActiveMq, RabbitMq, WebSphereMq and SqlServer, yes using your old and trusty database as a queuing platform. While adding support for all those new transports we had to do quite a lot of refactoring deep in the bowels of NServiceBus and the end result is a much cleaner codebase and its my hope that adding new transports should be a breeze going forwards. There is already a Redis transport in the works by our amazing community.

We have of course made a lot of other improvements as well so please take a look in the release notes for the full scope.

From now on we’ll be focusing on stabilizing the release and ramp up on documentation so hopefully you’ll see an increase in v4 related blog post here as well.

You can grab the new bits either as a msi download over at our site or via nuget.

If all goes well we hope to get the final version out of the door within the next 3-4 weeks.

Go ahead and take it for a spin!

 

 

 

Posted in NServiceBus | Comments closed

Pluralsight interview

I was interviewed by the good folks over at Pluralsight a few weeks ago and the result is now online. If you’re interested in a few war stories from my dark past and also a glimpse into what’s coming up in NServiceBus vNext you can listen to the full interview here:

http://blog.pluralsight.com/2012/12/04/meet-the-author-andreas-ohlund-on-introduction-to-nservicebus/

 

 

Posted in NServiceBus | Comments closed

Monitoring your dead letter queues

NServiceBus relies heavily on the underlying queuing system to make sure that messages gets delivered in a robust and timely fashion. In order to spot miss configurations it’s very important to monitor the dead letter queues (DLQ’s). You can think of the dead letter queue as a dumping ground for messages that can’t be delivered by the queuing system. This post focus on MSMQ but the general ideas apply to the other queuing systems as well.

Can all messages end up in the DLQ?

If your using NServiceBus this would be all messages since we make sure to set the required flags that tells MSMQ to enable the DLQ (negative source journaling in msmq lingo). This means there is no way you can loose valuable business data even if the queuing system is miss configured. More info on the dead letter queues in MSMQ can be found here.

When do they end up there?

Messages gets moved to  the DLQ when the queuing system has tried to deliver a message for a while but decides to give up. This time is usually configurable and in MSMQ this happens when either time-to-reach-queue(TTRQ)  or time-to-be-received(TTBR) for the message has expired.

The TTRQ is usually 4 days by default but can be adjusted per machine. TTBR is also 4 days by default but can be controlled on a message per message basis by adding the [TimeToBeReceived] attribute to your NServiceBus message definitions.

So how do I monitor this?

While the error queue in NServiceBus is usually system wide the DLQ’s are machine specific. This means that you have to monitor all your machines to detect messages ending up in any of the DLQ’s. There are many ways to do this, write a powershell script that periodically looks at the DLQ(s) , MSMQ has 2 different ones,  and sound the alarm. To do this just use the System.Messaging.MessageQueue class in your script. The address is:

DIRECT=OS:{your machine}\SYSTEM$;DEADLETTER 
DIRECT=OS:{your machine}\SYSTEM$;DEADXACT

Another option is to use your favorite monitoring tool and watch the “Msmq Queue -> Computer Queues ->Messages in Queue ” performance counter.

The image below shows this on my machine.

Note that the “Messages in queue” counter gives you the number of messages in both the DLQ’s.

TL;DR;

Make sure to monitor your DLQ’s to detect message queuing issues!

Hope this helps!

Posted in Messaging, NServiceBus, Ops | Comments closed

How to debug RavenDB through fiddler using NServiceBus

This is mostly a note to self since I always seems to forget how to do it:)

To setup a NServiceBus endpoint to make all calls to RavenDB through fiddler you need to do the following:

Configure the proxy for your endpoint by adding the following to your app.config

1 2 3 4 5
<system.net>
<defaultProxy>
<proxy usesystemdefault="False" bypassonlocal="True" proxyaddress="http://127.0.0.1:8888"/>
</defaultProxy>
</system.net>
view raw proxy.xml This Gist brought to you by GitHub.

With the proxy setup we just need to change the Raven connection string to go through fiddler by adding:

1 2 3
<connectionStrings>
<add name="NServiceBus.Persistence" connectionString="url=http://localhost.fiddler:8080"/>
</connectionStrings>

That should be it, happy debugging!

Posted in NServiceBus | Comments closed

What message should start a saga?

One common mistake people do when building sagas is being to restrictive when it comes to defining the messages that are allowed to start a saga. NServiceBus allows you to have multiple messages start given saga for a very good reason.

I’ve touched on this subject in a previous post about ordering of messages. Lets revisit the underlying reason why ordering can’t be assumed.

The network is reliable

As we all know this isn’t really true and is indeed the first fallacy of distributed computing. This means that:

You can’t assume when as message will arrive and even worse, you can’t assume that a message will arrive at all!

It might seem like we’re stuck between a rock and a hard place but this is where sagas can come to our rescue. Sagas ability to serialize message processing across a correlated dimension and the way they enable us to issue compensating actions is just what we need to handle this situation. The tricky part is realizing that this problem even exists.

Things are not always what they appear

Our brains seems to be built to order things in the way they logically appear and that is what trip us up in these situations. Lets illustrate this with an example.

Imagine a saga handling the shipment of a given order, the business rule is that we can ship when the order has been accepted (sales) and payed for (billing). This would lead us to a saga like this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
public class ShippingPolicy : Saga<ShippingPolicyData>,
IAmStartedByMessages<OrderAccepted>,
IHandleMessages<OrderBilled>
{
public void Handle(OrderAccepted message)
{
Data.OrderId = message.OrderId;
Data.Accepted = true;
DispatchOrder();
}
 
public void Handle(OrderBilled message)
{
Data.OrderId = message.OrderId;
Data.Billed = true;
DispatchOrder();
}
 
void DispatchOrder()
{
if (Data.Accepted && Data.Billed)
Bus.Send<ShipOrder>(m => m.OrderId = Data.OrderId);
}
 
public override void ConfigureHowToFindSaga()
{
ConfigureMapping<OrderAccepted>(s=>s.OrderId,m=>m.OrderId);
ConfigureMapping<OrderBilled>(s => s.OrderId, m => m.OrderId);
}
}

As we can see we start the saga for  OrderAccepted and when OrderBilled arrives we start our shipping process. All is good until we start getting bug reports from the business claiming that orders sometimes gets “stuck” without being shipped.

After a grueling time looking at logs and audit messages a what seems to be a impossible scenario appears, the OrderBilled event seems to sometimes arrive before the OrderAccepted. Puzzled about this discovery we also overhear our OPS team talking about a faulty network card on one of the backend servers hosting our sales components. After some more digging it seems like the OrderAccepted event is been delayed when sent to the Shipping server due to that network card issue.

Reasons for situations like this can be firewall issues, network issues, messages ending up in error queues, bugs etc. So while logically these messages should appear in a given order, you can’t bill and order that isn’t accepted, in a real life messaging scenario this logical ordering won’t hold true. So looking back at our saga we can see that if OrderBilled arrives first no saga instance will be created and there for when the OrderAccepted arrives a new saga will be created and waiting forever since the OrderBilled has already been “processed”. The fix is easy though, just have both messages start the saga by changing the IHandleMessages to IAmStartedByMessages . This will cause both messages create a new saga instance if needed and solve the issue for us.

In this example we looked at messages (events) coming from 2 different services. But what about events from the same service, surely they will arrive in order?

Unfortunately no, the same problem exists here as well and it’s even harder to spot. Imagine adding the rule that we can’t ship canceled orders. Even though OrderAccepted and OrderCanceled comes from the same server there is nothing to say that OrderCanceled will always arrive last. What if OrderAccepted fails to be processed and ends up in your error q for a few hours?

In closing

As we have seen messages that are expected to hit a existing saga isn’t quite so common that you might think. I would go as far as to say that the only message can’t start a saga is a message sent or initiated by the saga instance it self, this is either timeouts set by the saga or messages being received as a reaction of a message sent by the saga it self. (request/response interactions)

So go back and review your sagas, there are probably a few places where you need to start using IAmStartedByMessages!

 

Posted in Best practices, Messaging, NServiceBus | Comments closed

Video on how to use message mutators

PluralSight has been kind to publish the episode in my course where I go through how message mutators works in NServiceBus.

Enjoy!

 

Posted in NServiceBus, Video | Comments closed

Disabling Second Level Retries for specific exceptions

NServiceBus supports two different types of retries, first level retries (FLR) that happens immediately by bouncing the message back to the top of the input queue and Second Level Retries (SLR) that happens after a configurable amount of time.

In short the FLR aims to solve any transient exceptions like deadlocks while SLR solves more semi-transient ones like the database being down. Usually this all this works out of the box without you doing any thing but in some scenarios you might want to skip the SLR for exceptions where you know that that there is no chance of success.

Why you usually don’t need to worry about this

I usually suggest to avoid tweeking the retries as long as you can since exceptions in production that isn’t solvable through retries should be rare. To achieve this we should always design our messages to have “no good reason to fail” we can refine that to “no good business reason to fail”. If messages have no business reason to fail the only failures you’ll get is infrastructure failures which is very likely to be solvable through retries. I usually classify my messages into events and “other messages” (commands + request/response) and when looking at those a event should definitely never have a business reason to fail since a event is something important that has happened in the business so it’s not like we can roll it back.

If we follow the guidelines above fiddling with the retries is most likely a case of premature optimization.

Kijana has posted a lot of good insights regarding this  over at the NServiceBus group.

All that said the purist approach is always, well a purist approach:)

Why  you might need fine grained control over the SLR

While messages that have no business reason to fail is always achievable in theory, practically it’s not always applicable to every real life situation so lets talk about a few scenarious where you might wan’t to get more control of what gets retried.

If you have messages that can fail for a business reason excessive retries can cause issues with:

  1. Performance – Retrying causes extra load on your infrastructure like databases and if again if we fail fast we decrease this type of load
  2. SLA’s –  If you have a tight SLA for the given message type all time spent retrying is lost time when it comes to “manually” fixing the problem within your SLA. In these situations we want to fail fast and let NServiceBus move the message to the error queue so that our maintenance team can take action.
I believe that the latter is the most valid one since if #1( performance) is an issue, remember that you get throttling out of the box with NServiceBus, this means that your system is regularly throwing business exceptions in production and I think that should be considered a “bug”.
So if SLA’s is the only valid reason to suppress retries there is really no need get fine grained control over FLR since the FLR happens immediately without any delay and wouldn’t effect your SLA’s. If that is a problem just tune the number if FLR’s down, the default is 5. NServiceBus doesn’t actually given you anyway to do more since we don’t want users to get overly smart with the FLR.
SLR is a different story since if you have a 30 min SLA spending 10 of those retrying a business exception only leaves you with 20 minutes to take action.

Enough talk, here’s how to do it.

Disabling SLR for certain exceptions

SLR allows users to supply a  custom retry policy. A custom retry policy gives you the ability to control the delay between retries for a given message. In this case we’ll combine this with the fact that if you specify the delay to be TimeSpan.MinValue NServiceBus will interpret this as “don’t do any more retries”. When creating a custom retry policy we get access to the TransportMessage that failed and fortunately NServiceBus will add headers to that message that contains the exception details. Those headers are:

  • NServiceBus.ExceptionInfo.Reason
  • NServiceBus.ExceptionInfo.ExceptionType
  • NServiceBus.ExceptionInfo.InnerExceptionType
  • NServiceBus.ExceptionInfo.Message
  • NServiceBus.ExceptionInfo.Source
  • NServiceBus.ExceptionInfo.StackTrace
  • NServiceBus.ExceptionInfo.HelpLink

So to create a custom policy that supresses SLR if the exception type is of the type MyBusinessException the only thing you have to do is this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
public class ChangeRetryPolicy : INeedInitialization
{
public void Init()
{
SecondLevelRetries.RetryPolicy = (tm) =>
{
// retry max 3 times
if (TransportMessageHelpers.GetNumberOfRetries(tm) >= 3)
{
// To send back a value less than zero tells the
// SecondLevelRetry satellite not to retry this message
// anymore.
return TimeSpan.MinValue;
}
 
 
 
if (tm.Headers["NServiceBus.ExceptionInfo.ExceptionType"] == typeof(MyBusinessException).FullName)
return TimeSpan.MinValue;
 
 
return TimeSpan.FromSeconds(5);
};
}
}

As you can see we just look at the headers and return TimeSpan.MinValue to abort the SLR. If you have other requirements just modify the policy to look at any of the other headers listed above.

Hope this helps!

Posted in In depth, NServiceBus | Comments closed

NServiceBus sagas and concurrency

The question on how the NServiceBus sagas handle concurreny came of on stackoverflow and my answer got long winded and turned into a blog post.

If your endpoints is running with more than 1 worker thread there is the possibility that multiple messages will hit the same saga instance at the exact same time. To give you ACID semantics in this situation NServiceBus will use the underlying storage to produce consistent behavior by only allowing one of the threads to commit. Most of this is handled automatically for you by NServiceBus but there are a few things you need to be aware of.

We can divide up concurrent access to saga instances into 2 scenarios. The first is when there is no existing saga instance and multiple threads will try to create a new instance of what should be the same saga instance. The other is where a saga instance already exists in storage and multiple threads try to update that same instance . Lets look at both scenarios in detail and what options you have.

Concurrent access to non existing saga instances

Sagas are started by the message types you handle as IAmStartedByMessages of T and if more than one of those are processed concurrently and is mapped to the same saga instance there is a risk that more than one thread will try to create a new saga instance. In this case we can only allow one thread to commit. The others will rollback and the built-in retries in NServiceBus will kick in and on the next retry the saga instance will be found and the race condition is solved there now will be a update on that saga instance instead. This can of course also result in concurrency problems but they are solved as mentioned in the next section. NServiceBus solves this be requiring you to create a unique constraint in your database for the property that you’re correlating on. In NServiceBus 2.X you had to create this constraint your self but in 3.X we have the [Unique] attribute. If you put that attribute on one of your saga data properties NServiceBus will create the constraint for you, this works for both the NHibernate and the RavenDB saga persister. With that constraint in place only one thread will be allowed to create a new saga instance.

In the future we’ll use this constraint to do even smarter things like auto mapping of your messages.

Concurrent access to existing saga instances

For this to work in a predictable way we rely on the optimistic concurrency support by the underlying database. This means that if more than one thread tries to update the same saga instance the database will detect this and only allow one of them to commit. If this happens the retries will occur and the race condition solved.

If you’re using the RavenDB saga persister you don’t have to do anything since we turn on optimistic concurrency automatically for you. When running using the NHibernate saga persister we require you to add a “Version” property to you saga data so that NHibernate can work its magic. In NServiceBus 4.0 will make this even easier for you by enabling the optimistic-all option if no Version property is found.

Another option is to use a transaction isolation level of serializable  but will cause excessive locking so the performance degradation would be considerable. Note that “Serializable” is default isolation level for TransactionScopes. In NServiceBus 4.0 we’ll lower it to “ReadCommitted” for you since we think that is a more sensible default

Hope this helps!

Posted in NServiceBus, Sagas | Comments closed

Introduction to NServiceBus is now available on Pluralsight

My introductory NServiceBus course for Pluralsight is finally live. It’s about 4 hours of content that will get you up to speed with NServiceBus and walk you through how to build message driven systems using a service bus like NServiceBus.

The amount of time spent on recording and edit it could be compared to your average IT-project, initial estimation * PI :)

I’ve learnt a lot while recording it but listening to my own voice week after week will probably require years of therapy to recover from…

Anyway here’s the outline:

http://pluralsight.com/training/Courses/TableOfContents/nservicebus

What’s next

There is no content covering sagas since I hope to do a separate course focusing the Saga concept in general, which business problems it can help solve and how its implemented in NServicebus.

Enjoy!

 

Posted in NServiceBus | Comments closed

Zero downtime deployments

Zero downtime “deployments” or “upgrades” if you like, roughly the same number of hits on google, is that thing that every one says they are doing but when asked about how they are actually doing it there is usually not a clear answer.

Lets try to digg a bit deeper into the concept. By definition it seems to imply that we can deploy|upgrade our system with “Zero” amount of downtime.

Downtime for who?

To answer that question we can follow the money and since business people pay us tech guys to build and run systems for them its clear that the downtime that matters is Business downtime. This means that servers, databases and other infrastructure pieces can be “down” as long as we’re not down from a business perspective. That good, that actually gives us a chance to modify the above things without immediately beeing “down”.

So when are we “down” then?

We are down when when our system no longer delivers the business value that is was built to deliver. This is where Service Level Agreements SLA’s comes into play. SLA’s puts actual numbers on how our system should perform in a wide range of aspects. Examples of SLA’s can be our that our website should respond to user interactions within 10 seconds or be accessible some percentage of the year. Another example that is more business related is that placed orders should be processed within 5 minutes after beeing received from the customer. If we fail to deliver on those agreements we’re by definition “down” in the eyes of the business.

Availability

If we measure the time we’re up and meeting our SLA’s during a period of time, lets say a year, and divide that with the total time of that period we get our availablility. When dealing with availiability we usually talk about “number of nines“, 2 nines means 99% of the year. Shooting for “5 nines”, 99.999%, is quite popular these days and that gives you a massive:) 5.256 minutes of downtime per year. I think we all realizes that if you have only 5 minutes of accepted downtime per year wasting any of it on deployments won’t cut it since we need that buffer to handle unexpected failures no matter if our deployments are lightning fast. Even if we don’t shoot for 5 nines the preasure to deliver business value continously force us to release more often so to pull that off we need to start looking into doing those releases without incurring any downtime. In other words we need to do zero downtime deployments.

Designing for zero downtime deployments

To have a fighting chance to do ZDD you have to design your system for it. Coming up with that design isn’t easy, you have to spend time with your business to help them formalize their SLA’s. You need to realize that not all parts of your system have the same SLA’s and there for not the same need for ZDD. You can then use that knowledge to partition your system in such a way that ZDD only needed for the most important parts of the system where you have the tightest SLA’s.

Making these partitions talk to each other is where messaging shines. The reliability that comes with durable messaging is one of the key things that will enable you to evolve and deploy the different parts of your system independently, ultimately making Zero Downtime Deployments a achievable.

Posted in Uncategorized | Comments closed