Logo: TUG TORONTO USERS GROUP for Midrange Systems
e -server magazine

September 1997: Volume 13, Number 1

Communicating with Sam

Disaster Recovery Planning & High Availability

By Sam Johnston

From the questions submitted, here is the selected topic for this issue...


e currently have a disaster recovery plan in place for our AS/400. However, recently I have started to hear the term "high availability", both in reference to the AS/400 and our network. How does disaster recovery fit with high availability, and what are the implications from a networking perspective?

Sam's Answer:

he question of high availability is largely related to business practices and strategies, although significant technology can be required depending upon the strategy you pursue to deliver availability. You have correctly identified the trend of increased attention to both System Availability and Network Availability, largely due to the increased reliance on information technology to drive our businesses.

Disaster recovery is essentially one strategy that can be pursued to deliver System Availability. Traditionally, this has been the most popular means for AS/400 organizations to ensure that they can recover essential data and restore service levels in the event of a catastrophic event at the host site. However, the increased importance of information systems in the driving of daily business operations has forced companies to evaluate their availability strategies.

The correct level of systems availability is largely a business decision, and involves a spectrum of technical approaches which have evolved over time. The most basic availability strategy is simply backing-up data on a regular basis, which of course protects you from significant losses of data in the event of system failure, but does not address quick restoration of business activity nor protect you from physical destruction of the host site. The need to have employees productive using current data as quickly as possible created the need for disaster recovery, which has provided a moderate level of availability through third parties providing organizations with a back-up host where they can rebuild their system. However, disaster recovery can still take several days before business activity is restored.

For many companies, especially those that are highly automated or information intensive, the cost of several days without essential systems could be catastrophic. Many of those companies are now pursuing "high system availability" strategies, or in some cases "zero system unavailability". High availability can generally be achieved via a redundant host, likely less robust than the production AS/400, located at a secure second site as geographically separated as possible from the main host site. Essentially the redundant host is kept current with essential modules of the enterprise software and regularly backed-up data. This strategy means that provided you have the right network infrastructure, users could be resuming business activity within perhaps minutes, albeit, initially having to retrace business activity that had taken place since the last back-up. At the extreme, there are companies that cannot accept the cost of retracing any business activity, and they are pursuing a zero tolerance to system down time. This however is a costly business, as it often involves two identical host servers, each with the appropriate enterprise software licences, as well as sophisticated software for mirroring the two servers for complete synchronization and redundancy.

Networking has played an important role both in creating the demand for, and solving high availability due to the evolution of the network centric computing model. Distributed processing, the predecessor to network centric computing, relied upon Client/Server applications, which due to their interactive nature, were too large and costly to move across the WAN with the technologies and band width costs prevalent in the early 1990's. Under the distributed model, each remote site, such as a plant or regional sales office, had their own server, and the head office often only acted as a financial consolidation centre. Disaster recovery was less relevant to a smaller site where a disaster such as a fire might eliminate the server, but would also likely eliminate all economic activity that would be enabled by a disaster recovery plan. However, at the head office, disaster recovery ensured that consolidation would be interrupted for only a short period of time, but even in the event of a disaster impacting the main host, essential economic activity would continue at each distributed site. Over the past two years, decreases in the cost of quality bandwidth, coupled with the Internet driving new technologies, host centric computing has reinvented itself as network centric computing. Although PC file servers may still exist at remote sites for non-mission critical data, we have noticed a significant trend of companies, via the upgrading process, consolidating multiple distributed AS/400's into a single centrally located host AS/400. With this trend comes mission critical reliance on the host and the WAN for supporting economic activity at the head office and all the remote sites. This trend is driving the concern with Network and System Availability, as they essentially become one in the same in a network centric computing model.

Regardless of the availability strategy you select and degree of availability protection that is required, it is crucial that you consider the network implications.

For example, often companies have back-up communication lines between the host site and the disaster recovery site, which is fine if you only need to protect against system failure. However, often the WAN that supports the host transactions to remote sites only has a connection point at the host site, and if there is a true natural disaster such as fire at this site, there is little hope that the communication lines will survive, and the back-up path to the disaster recovery site will do little to connect your WAN traffic to the new production server.

Consequently, it is crucial to have a network design that merges with your availability strategy. For example, a large Frame Relay network likely needs a redundant node at the disaster recovery site. This can be expensive, but carriers such as AT&T are recognizing the need for higher availability and offer Frame Relay services that are only scaled up to the required line speed in the event of a disaster, reducing the ongoing operating cost, but eliminating the 4 to 6 week time frame that it would take to install a new node. It is also important that you have a network management plan in place that is tested along with your disaster recovery plan. Remember that your availability plan, to be truly effective and easy to manage in the event of a disaster, will impact all aspects of your network, including: addressing; band width design and selection; hardware brand and model selection, including evaluating redundant hardware; routing protocols, and; network management software and strategy.

Of course, if you are one of those companies with a central host supporting multiple sites of mission critical processing over a WAN, you are quickly noting the link between System and Network Availability. A cohesive strategy that encompasses both will be generally more cost effective, and certainly provide better protection. For example, perhaps you do select a high availability strategy with a redundant host located at a second site. By incorporating the availability site into your normal network operations via a redundant WAN node and a local backbone connection between the host and availability site, you obtain several benefits:

Remember, both network and system availability are one in the same in a networked world, and they both remain fundamentally business decisions. If you correctly assess your business needs as it relates to availability, the technology you need to deploy will become self evident. T < G

Note: Any TUG member wishing to submit a question to Sam can e-mail or forward their typewritten material to the TUG office, or to Intesys. We would be pleased to publish your question and Sam's answer in an upcoming issue of the TUG eServer magazine.