Page 1 of 2 next >>

Achieving Intelligent High Availability for SAP HANA on AWS

By Ian Allton, Solutions Architect at SIOS Technology Corp.

Mar 15, 2023

In IT, few names are more significant than SAP and AWS. Countless organizations rely on SAP’s Enterprise Resource Planning (ERP) systems for running their mission-critical systems, while AWS holds an astonishing 33% share of the cloud services hosting market.

The widespread reliance on both services means that any organization looking to high availability for their essential systems—in the region of 99.99% uptime—must be familiar with how SAP ERP and AWS interact. It’s also vital that they look to the long-term, which means familiarizing themselves with SAP’s newer HANA environment ahead of the 2027 cut-off for affordable legacy support.

More than half the organizations already running SAP on the AWS platform have switched to HANA-based solutions. However, when developing the architecture for a SAP HANA system, it’s essential to plan for a variety of possible scenarios long before they occur.

Bugs, crashes, and countless external factors can take the most securely designed database offline. This unplanned downtime can significantly impact any business, with consequences ranging from lost productivity to reputational damage.

This article will run through three best practices for achieving maximum resiliency for running high-availability SAP HANA solutions on AWS.

Mitigate Single Points of Failure

As noted above, failure is virtually impossible to avoid entirely. The nature of servers and complex IT systems means that bugs and errors are nearly inevitable. That’s why it’s essential that we plan to make them as reliable as possible. It also means that one of the first steps in setting up an HA operation is either eliminating or mitigating single points of failure.

These are components of your configuration where just one issue—whether that’s a simple human error, a crashing server, or a flash flood—is enough to take the entire system offline. The most practical way to achieve this is by ensuring that there is redundancy in every component and operation that we can quickly switch to if (and when) the worst happens.

Before we even begin to consider the technical aspect of keeping an HA service online, we should address the simple practicality of the servers’ physical locations. After all, if one of the concerns we’re looking to mitigate is an unexpected power outage or a localized natural disaster (flooded or storm-damaged data center), having your secondary server next to and plugged into the same supply as your primary server isn’t much help.

This location question is one area where working on the AWS platform has a major advantage. Some cloud providers can have HA servers sitting just one rack over from the primary server, but AWS can locate them in different “availability zones” (AZ). These zones are all linked by a fibre connection but are physically separated—typically by more than a mile—and each have their isolated power supplies. This separation can help to mitigate any risks that operate on human scale.

If the only points of failure in our architecture were the servers themselves, achieving high availability would be relatively easy. However, there are also several areas of risk within the architecture itself. If we’re to mitigate their ability to act as a single point of failure, we need to ensure we’re using the right technology to replicate their data and restore their application operation in the event of a crash.

Clustering technology can ensure high availability protection for mission-critical applications running in AWS environments by eliminating single points of failure. In an AWS environment, two or more nodes can be configured in a failover cluster using clustering software that monitors the health of the application running on a primary node, and if an issue arises, orchestrates failover of application operation to a secondary node.

Select Ideal Data Replication Technology

Beyond the primary database, several other vital services need to be maintained to keep critical applications working as expected. These also need to be replicated so that in the event of a failover, they are available in application-ready state to continue operation.

The first aspect we need to consider is the ABAP SAP central services (ASCS) instance, which contains an enqueue server and a message server. The database lock tables are stored in the enqueue server, which plays a vital role in preventing unsynchronized writes to the database and ensuring that it can roll back changes.

The way to protect these logs is to replicate them using an enqueue replication server (ERS). This maintains an up-to-date replica of the lock table, so if the ASCS instance goes offline for any reason, they are safeguarded. Ideally, you don’t want to run ASCS and ERS on the same cluster nodes, as that introduces more points of failure. Therefore, any recovery kit must have the intelligence to ensure that those pieces are always running on different cluster nodes.

Another single point of failure for the SAP system is the network file system (NFS). One of the simplest ways to ensure this is properly replicated is with an AWS-specific tool called Amazon Elastic File System (EFS). This is a fully managed NFS tool that’s handled by AWS. In addition, this is a regional service, so it’s automatically available across an AZ.

Page 1 of 2 next >>