How to Conduct an " End-to-End " Disaster Recovery Exercise in Real Time PDF

Title How to Conduct an " End-to-End " Disaster Recovery Exercise in Real Time
Author Shankar S
Pages 10
File Size 426 KB
File Type PDF
Total Downloads 142
Total Views 636

Summary

How to Conduct an “End-to-End” Disaster Recovery Exercise in Real Time By Shankar Subramaniyan CISSP, CISM, PMP, ABCP ([email protected]) Abstract Many times organizations conduct traditional disaster recovery exercises where testing is done in silos and the scope is limited and restricted o...


Description

How to Conduct an “End-to-End” Disaster Recovery Exercise in Real Time By Shankar Subramaniyan CISSP, CISM, PMP, ABCP ([email protected]) Abstract Many times organizations conduct traditional disaster recovery exercises where testing is done in silos and the scope is limited and restricted only to host level recovery of individual systems. With growing technology changes and globalization trends, the intricacy and interdependencies of applications have become more complex in recent years and major applications are spread across multiple locations and multiple servers. In this scenario, a traditional recovery exercise focusing on server (host) level recovery is not going to adequately ensure the complete recovery of the application without any inconsistencies among various interdependent subcomponents. In a widespread disaster scenario involving major outages at the data center level, it is fairly certain that this kind of limited exercise is not going to be sufficient to assure the realistic readiness status and overall Recovery Time Objective (RTO) for multiple applications. Therefore, organizations should increase the scope and complexity of disaster recovery exercises over time and ensure that each exercise is process-oriented and focused on “end-to-end” recovery. This article addresses some of the technical challenges faced in end-to-end disaster recovery exercises which attempt a full life cycle of transactions across disaster recovery applications and their dependencies, and simulate business activities during the exercises.

1

Introduction Growing reliance on Information Technology, along with compliance and regulatory requirements, has led many organizations to focus on Business Continuity and Disaster Recovery (DR) solutions. Availability has become a major concern for business survival. Therefore, it becomes mandatory that one should take a detailed look at disaster recovery testing and the specific steps to ensure a disaster recovery plan performs as expected. An end-to-end disaster recovery exercise would provide realistic readiness status and bring out any complexities or intricacies involved in recovering multiple applications in the case of any widespread disasters, including a data center level outage. There are lot of challenges in an “end to end “ Disaster recovery exercise approach compared to traditional disaster recovery exercise since one needs to consider all the dependencies and should take into an account an end to end view to understand the full functionality of the applications. This article illustrates the some of the challenges faced in an actual “end to end” Disaster Recovery exercises conducted for applications which interfaced with external third parties and had heavy reliance on middleware components and batch jobs Why a Disaster Recovery Exercise is Required Disaster recovery plans represent a considerable amount of complex and interrelated technologies and tasks that are required to support an organization’s disaster recovery capability. Constant changes in personnel, technologies, and application systems demand periodic plan validation to assure that the recovery plans are functional and remain so in the future. Without this validation, an organization would not be able to demonstrate that the documented set of recovery plans support current recovery operations that will be needed to sustain critical business functions in time of disaster. The periodic disaster recovery exercise is required to validate the documented recovery procedures, assumptions and associated technology used in the restoration of the production environment.

Issues in Traditional Disaster Recovery Exercises How many organizations attempt a full life cycle of transactions across disaster recovery applications and their dependencies, and simulate business activity as part of disaster recovery exercises? Many times organizations conduct traditional disaster recovery exercises where testing is done in silos and the scope is limited and restricted only to host level recovery of individual systems. In most of these disaster recovery exercises, the participating team is comprised of only the information technology team without involving any business users. Generally the primary objectives in such exercises will be 2

restricted to recovery of standalone systems without involving any integration with upstream or downstream dependencies. Typical application validation carried out in this exercise includes login validation, form navigation and search validation, without testing any connections to other dependent applications or any business activity. Most of the time traditional disaster recovery test activities are limited to travel and the restoration of hosts at the recovery site and not anything further. Major drawbacks in this type of testing are that one will not know, until the actual disaster, how the integration part is going to work, what the main dependencies are and what the impact may be due to any network latency related issues [ 3]. With growing technology changes and globalization trends, the intricacy and interdependencies of applications have become more complex in recent years and major applications are spread across multiple locations and multiple servers. In this scenario, a traditional recovery exercise focusing on server (host) level recovery is not going to be adequate to fully recover the application without any inconsistencies among various interdependent subcomponents. This kind of limited exercise not involving end-to-end disaster recovery activities and without attempting to simulate business activity is not going to be sufficient to reflect the preparedness to handle a real time disaster and to assure the required overall Recovery Time Objective (RTO) for multiple applications.

Why We Need an End-to-End Disaster Recovery Exercise A limited scope disaster recovery exercise not involving end-to-end disaster recovery activities and without attempting to simulate business activity is typically based on asset level (example: specific server or application) outage scenarios and not based on any widespread site (datacenter/city level) level outages. Therefore, in order to ensure effective disaster recovery preparedness, organizations should plan for an end-to-end disaster recovery exercise including all interdependent applications in scope. This will bring out the practical issues involved in performing the business transactions in the disaster recovery environment and verify the real effectiveness of disaster recovery procedures.

Challenges in an “End-to-End” Disaster Recovery Exercise An end-to-end disaster recovery exercise focuses on complete recovery of applications and their dependencies across various layers, including Presentation, Business Logic, Integration and Data layer. It takes into account the required data consistency among various interdependent subcomponents and sees the recovery from the business process perspective.

3

Since an end-to-end disaster recovery exercise attempts a full life cycle of transactions across disaster recovery applications and their dependencies, and simulates a business activity during the exercise, there are many challenges in conducting an “end-to-end” recovery exercise. Typical challenges faced are: • Isolating the DR environment • Replacing hard coded IP addresses and host names • Connecting to dependent systems not having a corresponding disaster recovery environment • Proper sequencing of applications • Thorough preparation and coordination • Ensuring a back out plan and data replication during the exercise This article assumes a parallel exercise scenario and highlights the common technical challenges faced in conducting an end-to-end disaster recovery parallel exercise in a warm site. In a parallel exercise, the DR environment is brought up without interrupting or shutting down the production environment.

Isolating the DR Environment As everyone will agree, we need to perform the disaster recovery exercise without any interruption to production. This is very easily said, but is the toughest challenge for the disaster recovery coordinator, especially when it is required to do a parallel test at a warm site. Isolating the DR environment and at the same time conducting the full life cycle testing requires a lot of planning and coordination. The key issue with a full life cycle test is the potential interruption to production systems by unintended access either by other applications or batch jobs. This may result in updating some transactions in the production environment during the test since these restored systems might have the same host names or IP addresses as production systems. Any production interruption such as any duplicate financial transaction for paying a vendor or any missing critical transaction due to a disaster recovery exercise could put your disaster recovery effort in jeopardy. One should ensure that disaster recovery instances are not connected to the production environment at all layers, including the database and network layers. For example, at the database layer, the tnsnames.ora file or database (DB) links should be updated to ensure that only DR instances are speaking to each other. At the network layer, appropriate firewall rules should be implemented to block any traffic from the disaster recovery environment to the production environment. In an isolated DR environment, there will be challenges for the Desktop clients/end users to connect to the DR environment and to verify whether the production or DR environments are accessed. These challenges can be overcome by allowing access to the disaster recovery environment via DR-Citrix, DR host names or direct DR-URLs as applicable. End user client machine’s local host file and configuration file need to be 4

configured to point to DR host names instead of production host names during the DR exercise.

Replace the Hard-Coded IP Address and Host Name In many organizations, a major issue in disaster recovery exercises is hard-coded IP addresses and host names in applications, particularly in batch jobs. There is a possibility that interfaces and batch jobs might fail or might interrupt production systems during the exercise if there are any hard-coded IP addresses or host names. Hence one needs to thoroughly analyze all the involved systems and identify any hard-coded IP address or host name. As a best practice [2], one should always reference alias names and avoid any hard-coding of host names or IP address. One of the important tasks for effective disaster recovery implementation is to convert every application to reference alias names, not the primary host names listed in DNS or IP addresses. However, it might become a tough job to replace the hard-coded host name or IP address for some of the applications which were developed several years ago. In such cases, it is suggested to use automated scripts as much as possible to replace the production host name or production IP address to the respective DR Hostname or DR IP address. These DR scripts should be documented and ensured that they are not overwritten during storage replication to DR environments.

Connecting to Dependent Systems Not Having a Corresponding Disaster Recovery Environment One of the key challenges in an end-to-end disaster recovery exercise is how to test the connecting interfaces with other applications which do not have a corresponding disaster recovery environment. For instance, as represented in figure 1 let us assume an application X, which is hosted at the disaster recovery site and needs to interface with application Y, which is hosted at a third party site. If application Y has a corresponding disaster recovery system, then we can connect both disaster recovery systems during the exercise. Otherwise, one needs to look into options of using the other available environments, such as development, test or pre-production systems of Y application for testing. Flowcharts, data feeds, and architecture comparisons for production and disaster recovery would help in identifying all the required components for the successful functioning of applications in a disaster scenario. An Interface architecture comparison done between production and DR environment is shown in figure 1. In the below DR Interface Architecture Drawing, since Y which is a vendor application, did not have any DR environment, DR exercise is conducted by connecting to the test environment of Y application from DR environment of X application.

5

Figure 1: DR Interface Architecture Drawing

Proper Sequencing of Applications and overall RTO A crucial challenge in most disaster recovery exercises is the proper identification and sequencing of upstream and downstream dependencies. When performing a disaster recovery exercise with a full life cycle of transactions for 20 or 30 applications, sequencing of applications becomes very critical. The sequence should be planned out properly based on the dependency and agreed overall Recovery Time Objective (RTO) requirements for multiple applications. Documenting all the critical interfaces for a disaster recovery scenario would help in ensuring proper sequencing of applications. While considering the dependencies for application, the interfaces need to be analyzed for business requirement of the data and the frequency at which they run. Figure 2 illustrates the resulting application dependency analysis diagram. As illustrated in this diagram, D1 is the application which needs to be brought up first in DR environment before bringing up the DR application X. This is due to the reason that D1 provides critical input data to X, without which X cannot function appropriately. Inbound interfaces which feed data to applications are required to be brought up first at DR site in most of the cases. In this example, applications marked as D1 , D2, D3 and D4 are brought first followed by which application X is brought up as D5. Under this scenario, RTO for application X (D5) will depend on the RTO for other 4 dependent applications 6

(D1,D2,D3 and D4) and this overall RTO should meet the business requirements as well. Applications for outbound interfaces are brought up subsequently. Applications can also be brought up in parallel instead of in sequence as per the business requirements.

Application Dependency Analysis

D1

(XML)

D6

DB-Snap, Ctrl-M OPC Data/ FTP

D9 D4

DB link

Application X

D2

Web Services FTP, Ctrl-M, (Secure FTP)

D5

D7

DB link D10

DB-MV, Online

D3

D8

DB Replication

Interface Mechanism Legend FTP

Interface Type Legend Outbound dataflow Inbound dataflow Bi-directional data flow Access : No data flow Or Transient data flow

D- Application Sequence

Data push/pull initiator or Control-M source ( Some processing at other end )

:

FTP or Secure FTP

HTTP(s) :

HTTP or HTTPS

XCom

:

XCOM

Ctrl-M

:

Control-M

DB-Snap :

DB Snapshot

DB-MV

DB Materialized View Direct DB Connection Web Services

:

DB-Conn : WS

:

Web Methods

Figure 2 : Application Dependency Analysis

Thorough Preparation and Coordination In disaster recovery exercises, one can tend to skip the proper sequence of DR exercises or one can over look the importance of the sequence. But in the road to an end-to-end

7

disaster recovery exercise, it is crucial to thoroughly follow a proper sequence of testing, namely walk-through, simulation, parallel and then full interruption exercises. A walk-through and simulation test is required first among the various participating teams (network/firewall, server, database, middleware and various applications) to ensure that everyone knows what the scenario is, who needs to do what and what is the sequence. These tests bring out the potential risks to the production environment during DR exercise and the coordination or sequence related issues in recovery procedures. Thorough preparation and coordination involving a great deal of planning, involvement from all the participating teams, and "mini" tests to test all the subcomponents would result in identifying most of the potential issues before they occur and in eliminating most of the human errors.

SAN

SAN- Storage Area Network

SAN

P-Vol – Primary Volume

S-Vol – Secondary Volume

Figure 3 : Performing Disaster Recovery Exercise Using Point-in-Time Copy of Data

Ensuring Back Out Plan and Data Replication During Exercise As always, one need to ensure the appropriate process is in place for a solid back out plan (a restore point prior to test start) and how to abort the exercise in the event of anomaly or critical business needs while performing the exercise. One also needs to ensure that data replication is not stopped while testing if there is a continuous data replication process in place to the disaster recovery site. As shown in figure 3, If storage array based Storage Area Network (SAN) replication is used, then 8

using technologies like point-in-time copies of data can be used for disaster recovery exercises [4] by presenting point-in-time copies of data to hosts instead of directly attaching SAN to hosts. In this way, we need not have to stop the data replication during the exercise. In the figure provided above, there is a continuous data replication from local data center which is a primary site to remote site even during DR exercise. Testers are testing the data in point-in-time copies of data. Also in the above figure, as a best practice , a point in time copy and backup is taken at primary site which might help in resolving any major issues due to data corruption at the primary site. In storage array based replication , there is a risk that when the data at the primary site SAN is corrupted, then the secondary site SAN will also have the corrupted data. In that case both will become unusable. Hence one should consider this risk and design the DR replication solution accordingly.

Simplified and automated recovery procedure to resolve issues in involving testing team Traditional recovery testing team would consists of several groups such as operating system, database, middleware, networking, storage and datacenter operations team etc . Since there are multiple teams (about 8-9 teams) involved in testing, this makes it more complicated in terms of scheduling the test. Also, some times during the test, if any other high priority production issue arises, then testers may need to leave in the middle of the testing since in most of the cases testers will be supporting the production environment as well. In order to reduce the complexity in scheduling and to avoid any interruptions, It is recommended to reduce these levels of dependency to a minimum level and create a DR tester who can run all these recovery steps alone and contact the respective (Database, Network, Storage, OS) system administrators only when there is any issue. The important aspect in this arrangement is that the recovery steps should be documented in such a way that it can be understandable and executable by any normal (L4) Helpdesk level person who will not have any high level specific administrator (L2/L1) skills. Besides testing, In a time of actual disaster, there is a tremendous amount of pressure and stress to get everything back up and running and available to users. In manual processes, mistakes will be made for a variety of reasons. Thus, it is suggested to automate the recovery process as much as possible. Having a simplified and automated disaster recovery processes would eliminate the unnecessary time delay and manual errors during the recovery. In most cases, simple scripts can help in redu...


Similar Free PDFs