High Availability and Disaster Recovery
- Chilukuri Srinivasa Reddy (Unlicensed)
- Mayuresh Balaji Kamble (Unlicensed)
- Enterprise IT
- Shilpa K (Deactivated)
Overview
There are various elements that contribute to customer satisfaction. High Availability and Disaster Recovery (DR) play a vital role and contribute immensely towards providing continuous non-stop services and data, even when there is a system failure or the entire geographical site is impacted. These losses can be a loss of a server or the loss of the physical infrastructure of the complete data center. The below topics provide details about the high availability and disaster recovery.
SummitAI assumes that you are familiar with the following standard High Availability and Disaster Recovery practices in the IT industry:
- Clustering technologies such as database clusters that spread workloads across multiple servers.
- Load balancing with application monitoring. It allows incoming requests to applications routed to healthy application nodes. Also, it raises events to proactively handle failures.
- Self-healing systems that move workloads or allocate additional capacity in case of any detected failures.
Audience
This document is intended for admins using the SummitAI IT Management Suite.
SummitAI Components
SummitAI services are offered using the following infrastructure components:
SummitAI Application Server
The SummitAI application is an IIS based application. It accepts the requests from both internal users and the public users. There are two Application Pools available to support both the MZ and the DMZ users for DR Environment.
- MZ Zone Servers
- App 1
- DMZ Zone Servers
- App 2
The requests and traffic coming to the Application pools are load balanced by the Network Load balancer for both internal and Public Users. Mobile app users will be connected through the DMZ App Servers.
SummitAI Data Collector
SummitAI Data collector receives the encrypted data from the SummitAI proxy. It then decrypts and decompresses the data to publish it into respective tables of the SummitAI Database Server. There is one common Data Collector layer for both the Internal and Public Users.
SummitAI Database Server
SummitAI has MSSQL Database server and hosted in an internal datacenter (DR) of the customer. It is only a single DB environment where the storage replication from the DC is taken care by the automation software which is been used for data syncing.
SummitAI Proxy Servers
- Master Proxy
Master Proxies monitor the Servers and Network Devices across the different ITOM groups based on the branch categorization. The Master Proxies are used for only patch sync with branch proxies. Master Proxy handles the functions of SAM agent communication from all the internal devices and end points. - Branch Proxies
Branch proxies handle the functions of Asset and Patch Management for the branch devices and the end points. - DMZ Proxy
DMZ proxy handles the functions of Asset and Patch Management for the roaming devices and end points which is public subnet.
SummitAI Agent
SummitAI Agents (Asset agents and Server Agents) are deployed on the endpoint devices to aid various information collection and other activities.
For more information, see SummitAI Components.
High Availability
High availability (HA) eliminates single points of failures. It defines a little or no downtime when an individual Server or database fails. The system or component with high availability can continuously operate for a long period in case of any crash or failure. The Server or database with the high availability makes the Infrastructure readily available and improves the business process without any delay in service. The load balancer ensures the high availability of a device. It routes the client requests to the correct server and maximizes performance and capacity utilization. Load balancers monitor the server health and bring additional servers online where there a spike in traffic and also reboot the server that is down.
High Availability for Application Server
The following screenshot displays the high availability architecture for Application Servers:
A network connection is a must for any load balancing solution. An Application Server Client that is connected to a particular Application Server must have the newly opened network connections. As a result, new connections with an existing operation receive errors if they are opened to an Application Server. The connection timeout on the load balancer is important.
Following are the benefits of a highly available architecture for the Application Servers:
- High Availability of the Application Server infrastructure.
- Loss of a single Application Server does not turn down the Application Server services as a whole.
- Scheduled maintenance can occur on an Application Server node without any effect on the availability of the Application Server services.
High Availability for Database Server
Following are the various HA technologies that are provided by the Microsoft SQL Server:
Log Shipping
Log Shipping is a High Availability solution at the database level. It provides critical databases with a manageable recovery point and recovery time. It contains two database servers; a primary database server and one or more secondary servers used for reporting.
Database Mirroring
Database Mirroring is one of the High Availability solutions. It gets configured on the full recovery database models. It contains a primary server, a secondary server, and an optional server. The optional server does the following actions:
- monitors the connection between the primary and the secondary servers
- ensures the availability of servers
- performs the automatic failover
Always on Failover Cluster
Always on the Failover Cluster is one of the High Availability solutions. It is an instance-level solution built over the Windows Server Failover Clustering functionality. It has several servers with the same hardware and software components to provide high availability for the failover cluster instance. After the Server Failover Cluster configuration and upon the start of Server services and the resource groups with the shared storage, at a given time, the cluster node owns either the network name or the virtual IPs.
Always on High Availability Groups
Always on Availability Group is one of the High Availability solutions. It is the database level solution built over the Windows Server Failover Clustering functionality. It has one primary server and eight secondary servers. All databases are available on the primary server and act as the read/write connections. Secondary servers act as the read-only connection for reporting.
For more information, see the web page http://msdn.microsoft.com/en-us/library/ms190202.aspx
Disaster Recovery
Disaster Recovery (DR) is as simple as restoring from a backup. It involves in setting policies and procedures to enable the recovery or continuation of the vital infrastructure and systems following a natural or any other disaster. A Disaster Recover (DR) setup, involves a primary and secondary datacenters, both located at different geographical locations. To avoid any natural calamity from impacting IT services, it is vital to deploy primary and secondary setup at different geographical locations. Both the primary and secondary sites are always in sync with each other.
A Disaster Recover (DR) setup, involves a primary and secondary datacenters, both located at different geographical locations. To avoid any natural calamity from impacting IT services, it is vital to deploy primary and secondary setup at different geographical locations. Both the primary and secondary sites are always in sync with each other.
For example, if the primary datacenter experiences a catastrophic event, such as an earthquake or flood, the system should bring online and serve users using the secondary datacenter. When it comes to business continuity, you need to set up a foolproof system. A backup within the budget is a must to set up a plan in advance of a disaster. There are two main structures for backup architecture:
Active-Active
Active-Active architecture optimizes for uptime. In this setup, DR Site servers are in sync with the Primary Site servers. This setup is useful when one or more servers fail, and the rest of the servers automatically balance the load.
Figure: SummitAI Always On Active-Active Architecture
The above architecture has the following infrastructure:
- Database (Enterprise Edition) configured in Always On in synchronous mode for a two-node cluster and in asynchronous mode with the DR site.
- Application servers within the same data center in load balancing mode.
- F5 Networks Big IP DNS (Global traffic manager) to route the traffic to the available zone. The configuration is required to route the traffic from all geography to the primary site. When the primary site is down, the traffic should route to the secondary site.
- An automatic Failover Cluster.
- The Recovery Time Objective (RTO) of five minutes to restore the business process after a disaster to avoid unacceptable consequences that cause a break in continuity.
- The Recovery Point Objective (RPO) of 15 minutes to restore the organization's operations following a disruptive event such as a cyberattack, natural disaster, or communications failure.
- A Failback that is manually done as the database does not shift back to primary DC upon recovery.
Active-Passive
Active-Passive architecture is less complex. In this setup, DR Site maintains a separate set of critical infrastructure, and it sits idle until such a time as it is needed.
Figure: SummitAI Always On Active-Passive Architecture
The above architecture has the following infrastructure:
- Database (Enterprise Edition) configured in Always On in synchronous mode for a two-node cluster and with log shipping enabled for the secondary DC.
- Application servers within the same data center in load balancing mode.
- A manual Failover Cluster.
- The Recovery Time Objective (RTO) less than three hours to restore the business process after a disaster to avoid unacceptable consequences that cause a break in continuity.
- The Recovery Point Objective (RPO) of 30 minutes to restore the organization's operations following a disruptive event such as a cyberattack, natural disaster, or communications failure.
- A Failback that is manually done as the database does not shift back to primary DC upon recovery.
Disaster Recovery for Application Server
For Disaster Recovery, you must keep the Application Server and its database server on a secondary site. Ensure that there is enough distance between both the primary site and the secondary site. This makes the secondary site in continuous operation if the primary site is completely in an inoperative state due to the loss or damage of any infrastructure. Usually, the Application Server on the secondary site is inactive and gets started only when there is a disaster in the primary state.
To ensure the configured clients to use the Application Server at the primary site are serviced using the secondary site, you must update the following configurations:
- Move the virtual IP address (VIP) from the Application Server at the primary site to the Application Server at the secondary site.
- Update the domain name service (DNS) VIP to the Application Server at the secondary site.
Following are the benefits of a disaster recovery architecture for the Application Servers:
- No extended downtime during the loss of Application Server services at the primary site.
- Temporary failover to the secondary site to ensure the continuous availability of Application Server services.
Disaster Recovery for Database Server
A Database Server disaster recovery plan (DRP) is a process to have the Database Server up and running to overcome the data loss after a disaster. A good disaster recovery plan must take numerous factors into account. These factors can be the sensitivity of data, data loss tolerance, required availability, etc. You can plan based on the following solutions:
Failover Clustering
It is a concept where a Database Server is installed on the shared storage. It provides the infrastructure that supports High Availability and Disaster Recovery scenarios. If a cluster node fails, the services that are hosted on a node can automatically or manually get transferred to another available node in a process known as failover. There is a short period of downtime while SQL Server is failing over.
Database Mirroring
For SQL Database Servers, Database Mirroring is a solution for Disaster Recovery. Database mirroring increases the availability of the database after a disaster event. Database mirroring maintains a single standby or mirror database for a primary database. There can be two types of mirror database servers: hot and warm. A hot mirror server contains synchronized sessions with quick failover time without any data loss. A warm mirror server doesn’t have synchronized sessions, and there is a possibility of data loss.
Confluence Cloud Migration Alert: Please refer to known issues you may encounter in Confluence Cloud: https://eitdocs.atlassian.net/wiki/x/wDGwAQ