Geeks With Blogs
Matt Gilbert An attempt at techy stuff
In my current company, we have a fairly good DR story with our “Global” (UK based) BizTalk platform and it’s something that is regularly tested (successfully I might add :) ). We also have a smaller BizTalk deployment in the US which follows a different model but which was also proven successful this year for the first time. More on that in a bit.
 
The choice of DR model you adopt for BizTalk will depend on a number of factors:
Criticality – how important is the information flowing through or orchestrated by BizTalk, either now or planned?
Recovery Point Objective – Your RPO is dictated by how accurate/up to date your system needs to be when DR is invoked. Can you afford to lose the last 15 minutes of activity? Can you afford to lose nothing?
Recovery Time Objective – Your RTO is dictated by the role BizTalk plays in your environment and is tied to Criticality and RPO. Do you need to be back on line in seconds, minutes, hours or days? Your figures for both RTO and RPO will be driven by the SLA you have with the Business for meeting your DR targets.
Cost – are you paying for a mirror of production or a reduced capacity? Are your servers going to be on cold or hot standby? Do you have Software Assurance on your production licenses? Do you need to extend an existing DR contract to accommodate the BizTalk DR requirements?
 
There are obviously more things to consider but these are some important ones to be starting with. Let’s take a look at each in a little more detail.
 
Criticality
Most BizTalk implementations are at the heart of a business and therefore are usually flagged as a “Class A” system when it comes to planning for DR. What flows through BizTalk can be a mixed bag however, so you may want to consider, as part of the BizTalk team, internally classifying the various interfaces you are dealing with in terms of their priority. The chances are you will have done this already as part of the day-to-day support process (e.g. you know that if interface A breaks at 2am in the morning, people get up to fix it but if a message goes suspended on process B it can wait until the next working day). You can then apply the same thinking to your recovery planning if you need to phase when elements of your solution come back online.
 
RPO
Your RPO represents the data point you are trying to get back to when you recover your systems. This is one of the key things that will force your hand when it comes to deciding on a DR solution. Some things that will dictate you have a very stringent RPO, getting as close to real-time as possible are:
-          a high-throughput of critical transactions which will cost the business money if they are lost
-          long running transactions and complex business processes where the loss of messages or recent activity makes it very hard to replay or resynchronise systems.
However, it might be that the processes you are orchestrating or messages you are routing are considered low priority or easily recovered and replayed. In which case, your target RPO might be minutes or even hours before DR was invoked (if at all).
 
RTO
As mentioned already, RTO is going to be derived from the Criticality and RPO that has been agreed. It’s no good having an RTO of 2 hours if you have critical systems which require you to be back online in less than 5 minutes. Your RTO will help make the decision about the type of DR environment you need to establish and that will then have an effect on the costs involved.
 
Cost
Obviously the more complex, near-production and zero data-loss you aim for in you DR solution, the more expensive it becomes. You also need to consider the licensing implications of having BizTalk servers (and SQL servers…and anything else) at a DR site. Production licenses purchased under Software Assurance get some breaks here as discussed in this document but otherwise more boxes can often mean more money, and not just in terms of hardware.
Low cost options (beyond not bothering with DR) include:
-          Having a cold-standby server ready to fire up when DR is invoked (possible license required)
-          Having hardware provided only and building the platform from scratch (obviously this impacts RTO a great deal)
Some high cost options would be:
-          Matching all your production environment server for server (with licenses)
-          Achieving a near real-time RPO with SAN Mirroring, cross-site clustering or similar
 
Other considerations
Capacity: Does your DR offering need to be as performant as your production environment? Can you get away with lower grade hardware, less memory or less servers? Is the Disk sub-system on the DR SQL server going to be good enough for low latency or high volume requirements you have? If you have fewer servers, does this affect the way you distribute your hosts and workload? Has any known reduction in capability (should DR be invoked) been publicised so you aren’t being asked to meet expectations that are unrealistic?
Peripheral components: Don’t forget those things you own and interact with that are part of your solution like custom databases, web sites/services and repositories. You need you DR story to include them to and they need factoring into you plans to achieve RTO, RPO and how you are going to perform in DR.
Testing: Having a good DR story is all very well but if you don’t test it you cannot have any confidence it’s going to work! Regularly scheduled DR tests are a must and from experience, these need to be in a closed off network; if you are simulating your data-centre going off line – make sure your DR site isn’t talking to it!
 
So what are we doing then?
We have two BizTalk platforms, one in the UK supporting the “Global” application and systems and a smaller one in the US which deals with local systems. For various historical reasons we have different DR solutions for each one.
Our UK DR solution consists of three servers (1 BTS, 1 SQL and 1 Comms (FTP, WebSphere MQ etc)). Compare this to the production environment which is 4 BizTalk servers, an active-active SQL Cluster and an active-passive comms cluster. The reduced capacity is a known and accepted (and deemed acceptable) solution. The servers are in a dormant state (ENTSSO is disabled on the BizTalk box) and when deployments are made to production they are replayed to DR a few days later (e.g. the Monday following a Saturday deployment). The bulk of what we do in BizTalk is message routing and transformation and we have no long running processes to try and maintain consistency on across a disaster. That means our solution is not a costly as it could have been otherwise. We are responsible for a large number of high volume critical interfaces however so our RTO is matched with theirs. If DR is invoked, we simply have to bring ENTSSO and the BizTalk services back online and we should be good to go. This has been tested yearly since we went live about 5 years ago and has always been successful. The big win for us is effectively not being set an RPO. The critical systems we deal with have a very generous 15 minute RPO due to their log shipping schedule and as most traffic passes through us in seconds, there is no point trying to match it.
Our US solution up until the last test had always failed. The RTO was in days and there was no dedicated hardware. The support team in the US used manual build scripts to rebuild the platform once hardware was provided to them. Due to poor documentation and complexity of deployments and 3rd party tools this was never successful. This year we tried something different – VM snapshots of the BizTalk servers spun up on top of database backups. This worked :) The approach has obvious flaws though (you need to re-image the VM on every change to the BTS environment or reply any changes since the last time the VM was made and the DR invocation (which impacts the speed of recovery).
 
So there is a lot to consider and every company is going to have different requirements but hopefully some of what I’ve detailed above will get you started along the road to good DR offering.
Posted on Monday, December 15, 2008 4:01 PM BizTalk , Disaster Recovery | Back to top


Comments on this post: Planning for BizTalk DR

# re: Planning for BizTalk DR
Requesting Gravatar...
Matt
The US solution from a BizTalk perspective had been successfull with the manual build scripts (not saying I liked the approached) but it is misleading to say it failed. The US team were successfully able to rebuild the BizTalk environment in the DR site, the issue was around getting the domain up and running, Since the domain was not brought online successfully we could not call the DR a success.

Derek
Left by Derek on Jun 30, 2009 12:54 PM

Your comment:
 (will show your gravatar)


Copyright © mattjgilbert | Powered by: GeeksWithBlogs.net