Wednesday, 26 October 2016

High Availability for Amazon VPC NAT Instances: An Example

This article provides all the necessary resources, including an easy-to-use script,and instructions on how you can leverage bidirectional monitoring between two NAT instances to implement a high availability (HA) failover solution for network address translation (NAT).


Learn how you can leverage bidirectional monitoring between two NAT instances to implement a high availability (HA) failover solution for network address translation (NAT). This article provides all the necessary resources, including an easy-to-use script,and instructions on how to create an architecture where two NAT instances monitor each other; if one instance fails, the functional instance handles Internet-bound traffic for both.

Overview

In Amazon Virtual Private Cloud (VPC), you can use private subnets for instances that you do not want to be directly addressable from the Internet. Instances in a private subnet can access the Internet without exposing their private IP address by routing their traffic through a Network Address Translation (NAT) instance in a public subnet. A NAT instance, however, can introduce a single point of failure to your VPC's outbound traffic. This situation is depicted in the diagram below.
Figure 1a: Internet-bound traffic through a NAT instanceFigure 1b: Internet-bound traffic interrupted during NAT failure
One approach to this situation is to leverage multiple NAT instances that can take over for each other if the other NAT instance should fail. This walkthrough and associated monitoring script (nat_monitor.sh) provide instructions for building a HA scenario where two NAT instances in separate Availability Zones  continuously monitor each other. If one NAT instance fails, this script enables the working NAT instance to take over outbound traffic and attempts to fix the failed instance by stopping and restarting it. This script is a modification of the virtual IP monitor and takeover script demonstrated during the AWS re: Invent CPN207: Virtual Networking in the Cloud session.
This approach, where two instances independently monitor each other, has a known edge case that occurs in the unlikely event network connectivity is broken between two AZs but both instances can still communicate to the EC2 API endpoints. Please see Appendix A for more information about this edge case.
To set up a pair of self-monitoring NAT instances, follow these steps:
  1. Create an Amazon Virtual Private Cloud.
  2. Create an Amazon Elastic Cloud Compute (EC2) role in AWS Identity and Access Management (IAM).
  3. Launch two Linux Amazon EC2 NAT instances into each of your VPC's public subnets.
  4. Configure Elastic IP addresses for your NAT instances.
  5. Create route tables for private subnet Internet-bound traffic
  6. Download and configure the nat_monitor.sh script.
  7. Test your configuration.
This will launch a fully functional sample stack as shown in the diagrams below:
Figure 2a: Dual, active NAT instancesFigure 2b: Route takeover on NAT failure
Figure 2c: Active instance attempts to stop the failed instanceFigure 2d: Once stopped, the active instance will restart the failed instance
Figure 2e: Dual, active NAT instances and routing restored
For the purpose of this article, we will deploy the following:
  1. An Amazon VPC with two public subnets, two private subnets, and Internet gateway.
  2. An Amazon EC2 role authorizing the EC2 NAT instances to take over Internet-bound traffic on a partner node failure.
  3. Two self-monitoring Linux Amazon EC2 NAT instances.
  4. Route tables to appropriately route Internet-bound traffic for all subnets.
  5. Two Elastic IP addresses, one for each NAT instance.

Step 1. Create an Amazon Virtual Private Cloud

Start by provisioning our networking infrastructure. To do this, navigate to the VPC console in the AWS Management Console and use the VPC Wizard to create a VPC with a Single Public Subnet Only. For detailed instructions on this step, see the VPC Getting Started Guide.
Click Subnets in the navigation pane and create three additional subnets: one subnet in the same AZ as the existing subnet (10.0.1.0/24 in this example) and two additional subnets in a different AZ (10.0.2.0/24 and 10.0.3.0/24 in this example). For additional information about creating subnets in your VPC, see the Amazon Virtual Private Clouddocumentation.
Figure 3: Two subnets in US-EAST-1A AZ and two subnets in US-EAST-1C AZ
Click Security Groups in the navigation pane and create a VPC security group to control traffic in and out of your NAT instances. For additional information about creating VPC security groups, see the Amazon Virtual Private Clouddocumentation. In the Figure 4 below, a NAT_Monitor security group was created allowing SSH traffic for remote administration and ICMP between the members of the security group. It also allows ingress traffic from instances within the VPC (10.0.0.0/16 CIDR block in this example) to support Internet-bound traffic through each NAT instance.
Figure 4: VPC NAT_Monitor security group for NAT nodes

Step 2. Create an EC2 Role in IAM

Next you need to create an EC2 role that will grant the NAT instances permissions to take over routing in the event the other NAT instance fails. Navigate to the IAM console in the AWS Management Console, click Roles in the navigation pane, and click Create New Role. Give the new role a descriptive name (NAT_Monitor in this example) and clickContinue.
Figure 5: NAT_Monitor EC2 role in IAM
Select the AWS service role for Amazon EC2, click Custom Policy, and click Select. Provide a Policy Name(NAT_Monitor_Policy in this example) and enter the following Policy Document:
 {
 "Statement": [
 {
 "Action": [
 "ec2:DescribeInstances",
 "ec2:CreateRoute",
 "ec2:ReplaceRoute",
 "ec2:StartInstances",
 "ec2:StopInstances"
 ],
 "Effect": "Allow",
 "Resource": "*"
 }
 ]
 }
 
Figure 6: Role policy document for the NAT_Monitor_Policy
Click Create Role to finish creating the EC2 role.

Step 3. Launch Two Linux Amazon EC2 NAT Instances into your VPC's Public Subnet.

Now that you have created an Amazon VPC, it's time to launch two Linux Amazon EC2 instances into the VPC. These will serve as your NAT instances.
Navigate to the EC2 console in the AWS Management Console and click Launch Instance on the EC2 Dashboard. Select Quick Launch Wizard under Create New Instance. Provide a name for your instance (NAT Node #1 in this example). Select an existing AWS key pair (or create a new one). An AWS key pair is a public/private key pair that lets you securely connect to your instance after it launches. For a short tutorial on how to create a new key pair, watch theAmazon EC2 - Creating a Key Pair video. Select Amazon Linux AMI (64-bit option) and click Continue.
Figure 7: Launching NAT Node #1
Next click Edit Details, check Launch into a VPC, and select the public subnet initially created in your VPC.
Figure 8: Editing NAT Node #1 VPC Details
Click the Security Settings section and select the security group you created in the previous section (NAT_Monitor in this example).
Figure 9: Editing NAT Node #1 security settings
Click the Advanced Details section and for IAM Role,  select the role you created in the previous step (NAT_Monitorin this example). Then scroll down to add an IP address (10.0.0.11 in this example).
Figure 10: Editing NAT Node #1 Advanced Details�IAM Role
Figure 11: Editing NAT Node #1 Advanced DetailsIP Address
Click Save details to save your configuration modifications. Now click Launch and then Close after the instance has been launched.
Launch a second NAT instance using the same settings as the first instance above, with the following changes:
  • Provide an appropriate name for the second instance (NAT Node #2 in this example).
  • Select your second public subnet (10.0.2.0/24 in this example) under Instance Details.
  • Under Advanced Details, set the IP address within the second public subnet CIDR block (10.0.2.11 in this example). You still need to check Launch into a VPC (under Instance Details), select your NAT_Monitor security group (under Security Settings), and select your NAT_Monitor IAM role (under Advanced Details).
Figure 12: EC2 console after NAT instances launched
You now have two NAT instances running in your VPC. For a NAT instance to perform network address translation, you must disable source/destination checking on each instance. In other words, each EC2 instance performs source and destination checking by default. This means the instance must be the source or destination of any traffic it sends or receives. However, the NAT instance needs to be able to send and receive traffic where the eventual source or destination is not the NAT instance itself. Disabling source/destination checking on the NAT instance accomplishes this.
To do that, right-click each the NAT instance in the Instances pane, and select Change Source / Dest. Check. For each NAT instance, this attribute should be disabled. Click Yes, Disable.
Figure 13: Change Source/Dest. Check dialog box

Step 4. Configure Elastic IP Addresses for Your NAT Instances

After the EC2 NAT instances have launched, you will need to create Elastic IP addresses (EIPs) for each instance. You will use the individual EIP addresses for each instance to connect to connect to the Internet through the Internet Gateway.
Navigate to the Elastic IPs section of the EC2 console under Network & Security In the navigation PANE, and allocate two new VPC EIPs by clicking Allocate New Address.
Figure 14: Allocate VPC EIP dialog box.
Now you must associate the first EIP to NAT Node #1. Select the Elastic IP address from the list, and then click Associate Address. In the Associate Address dialog box, select the NAT Node #1 instance. By default the association will map to NAT Node #1's primary private IP address (10.0.0.11 in this example; notice the asterisk next to the IP denoting this address as the primary private IP). Now click Yes, Associate to complete the EIP association. Repeat these steps to associate the second EIP to NAT Node #2 as depicted below.
Figure 15: Associating EIPs to primary private IPs
Figure 16: Associating EIPs to primary private IPs

Step 5. Create Route Tables for Private Subnet Internet-bound Traffic

Now that your Amazon EC2 NAT instances are configured with EIPs, you can create route tables and rules for the private subnets to send Internet-bound traffic through these NAT instances. Each subnet in your Amazon VPC must be associated with a route table; the table controls the routing for the subnet. You can associate multiple subnets with the same route table, but you can associate a subnet with only one route table.
Navigate to the VPC console in the AWS Management Console and click Route Tables in the navigation pane. You should currently have two route tables associated with the VPC you created above: One route table is designated as theMain route table while another is not designated as the Main route table. If you don't explicitly associate a subnet with a route table, the subnet is implicitly associated with the Main route table. Currently, your VPC routing table associations resemble the diagram below.
Figure 17: Default route table associations for your VPC
When you used the wizard to create your VPC, the two route tables above were created automatically. Route Table A (Main) and Route Table B both route local traffic between instances within the VPC. In addition, Route Table B routes Internet-bound traffic for associated subnets through the VPC Internet gateway. In order for your VPC private subnets to reach the Internet, you need to route their traffic through your NAT instances in the public subnets. To accomplish this, you will create two additional route tables and associate your subnets to resemble the diagram below.
Figure 18: Final route table associations for your VPC
The first step is to associate both public subnets to the route table that routes Internet-bound traffic to your VPC Internet gateway (Route Table B depicted above). Since the initial public subnet (10.0.0.0/24 in this example) in your VPC was automatically associated with "Route Table B" above, you only need to associate the other public subnet (10.0.2.0/24 in this example) with "Route Table B." To do this, select the route table that is not designated as Main(Route Table: rtb-3c2d8c51 in the example below). Notice in the picture below that Internet-bound traffic is routed through the VPC Internet gateway (igw-252d8c48).
Figure 19: Routing Internet-bound traffic through in the VPC Internet gateway
Now click the Associations tab and select the second public subnet (10.0.2.0/24 in this example) from the dropdown list, as depicted below.
Figure 20: Select second public subnet to associate with route table
Now click Associate. In the Associate Route Table dialog box, click Yes, Associate. Your route tables should now look similar to the image below.
Figure 21: Route tables after associating the second public subnet
Next, you will create two new route tables that handle routing Internet-bound traffic from the private subnets (10.0.1.0/24 and 10.0.3.0/24 in this example) through the NAT instances in the public subnets. To begin, click Create Route Table. A dialog prompts you to select a VPC for your route table. Select the VPC you previously created in this example and click Yes, Create.
Figure 22: Create Route Table dialog
The next step is to set up the routing rule for Internet-bound traffic within the newly created route table. Select the route table you created above. From the Routes tab, enter 0.0.0.0/0 in the text box in the Destination column of the route table. In the Select a target dropdown, select Enter instance ID. In the Select an Instance dialog box, select the Instance ID for NAT Node #1 and click OK. Your route table should now look similar to the following image.
Figure 23: Adding new route rule to route table
Now click Add. The Create Route dialog box asks for confirmation. Click Yes, Create to finish adding the new route. You should now see the new route in your route table as depicted below.
Figure 24: New route rule added to route table
To complete this route table, you need to associate the private subnet (10.0.1.0/24 in this example) that will send Internet-bound traffic through NAT Node #1. As a reference, see Figure 2a above. As you did before, follow the steps to associate a subnet with a route table:
  1. Click the Associations tab of the routing table.
  2. Select the private subnet that will route Internet-bound traffic through NAT Node #1 (10.0.1.0/24 in this example) from the list available in dropdown box.
  3. Click Associate.
  4. In the Associate Route Table dialog box, click Yes, Associate to complete the association of the subnet to the route table.
Finally, use the steps you just completed to create a second route table that will route Internet-bound traffic for the other private subnet (10.0.3.0/24 in this example) through NAT Node #2. Use the steps below as a guide to create the second route table:
  1. From the Route Tables section of the VPC console, click Create Route Table.
  2. In the Create Route Table dialog, select your VPC and click Yes, Create.
  3. In the Route Tables list, select the route table you just created.
  4. On the Routes tab, enter 0.0.0.0/0 in the text box in the Destination column of the route table. In the Select a target dropdown, select Enter instance ID.
  5. From the Select an Instance dialog box, select the Instance ID for NAT Node #2 and click OK.
  6. Click Add for the route.
  7. In the Create Route dialog box, click Yes, Create to finish adding the new route.
  8. Click the Associations tab of the routing table.
  9. Select the private subnet that will route Internet-bound traffic through NAT Node #2 (10.0.3.0/24 in this example) from the list available in dropdown box.
  10. Click Associate.
  11. In the Associate Route Table dialog box, click Yes, Associate to complete the association of the subnet to the route table.
Your route table listing should look similar to the image below.
Figure 25: VPC route tables

Step 6. Download and configure the nat_monitor.sh script.

Connect to NAT Node #1. Change to the root user, navigate to the root user's home directory, update the AWS API tools, and configure the instance to run as a port address translator with the following commands:
[ec2-user@ip-10-0-0-11 ~]$ sudo -s
[root@ip-10-0-0-11 ec2-user]# cd /root
[root@ip-10-0-0-11 ~]# yum update aws*
[root@ip-10-0-0-11 ~]# echo 1 > /proc/sys/net/ipv4/ip_forward
[root@ip-10-0-0-11 ~]# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects
[root@ip-10-0-0-11 ~]# /sbin/iptables -t nat -A POSTROUTING -o eth0 -s 0.0.0.0/0 -j MASQUERADE
[root@ip-10-0-0-11 ~]# /sbin/iptables-save > /etc/sysconfig/iptables
[root@ip-10-0-0-11 ~]# mkdir -p /etc/sysctl.d/
[root@ip-10-0-0-11 ~]# cat <<EOF > /etc/sysctl.d/nat.conf
net.ipv4.ip_forward = 1
net.ipv4.conf.eth0.send_redirects = 0
EOF 
[root@ip-10-0-0-11 ~]# 
Now download the nat_monitor.sh script and make it executable with the following commands:
[root@ip-10-0-0-11 ~]# wget https://media.amazonwebservices.com/articles/nat_monitor_files/nat_monitor.sh
[root@ip-10-0-0-11 ~]# chmod a+x nat_monitor.sh
Edit the following variables to match your settings for NAT Node #1:
  • NAT_ID - The instance ID of the NAT Node #2 instance that this script will be monitoring (i-12990462 in this example).
  • NAT_RT_ID - The ID of the route table routing Internet-bound traffic through the NAT Node #2 instance that this script will be monitoring. This is the route table to be updated to redirect NAT traffic when NAT Node #2 instance fails (rtb-969a23fb in this example).
  • My_RT_ID - The ID of the route table routing Internet-bound traffic through this instance, NAT Node #1. This is the route table that will be updated when NAT Node #1 instance is healthy (rtb-f8e35095 in this example).
  • EC2_URL - This should point to the EC2 URL of the region the NAT instances are running in (e.g., https://ec2.us-east-1.amazonaws.com NAT instances running in the US East Region in this example).
If desired, adjust the following health check variables:
  • Num_Pings - This is the number of times the health check will ping NAT Node #2. The default is 3 pings. NAT Node #2 will only be considered unhealthy if all pings fail.
  • Ping_Timeout - The number of seconds to wait for each ping response before determining that the ping has failed. The default is one second.
  • Wait_Between_Pings - The number of seconds to wait between health checks. The default is two seconds. Therefore, by default, the health check will perfrom 3 pings with 1 second timeouts and a 2 second break between checks -- resulting in a total time of 5 seconds between each aggregete health check.
  • Wait_for_Instance_Stop - The number of seconds to wait for NAT Node #2 to stop before attempting to stop it again (if it hasn't stopped already). The default is 60 seconds.
  • Wait_for_Instance_Start - The number of seconds to wait for NAT Node #2 to restart before resuming health checks again. The default is 300 seconds.
Configure nat_monitor.sh to be started by cron at boot and start nat_monitor.sh:
[root@ip-10-0-0-11 ~]# echo '@reboot /root/nat_monitor.sh >> /tmp/nat_monitor.log' | crontab
[root@ip-10-0-0-11 ~]# ./nat_monitor.sh >> /tmp/nat_monitor.log &
[root@ip-10-0-0-11 ~]#
Verify that the script is running by viewing the log file:
[root@ip-10-0-0-11 ~]# tail /tmp/nat_monitor.log 
Fri Feb 8 13:47:23 UTC 2013 -- Starting NAT monitor
Fri Feb 8 13:47:23 UTC 2013 -- Adding this instance to rtb-f8e35095 default route on start
ROUTE i-f821b388 0.0.0.0/0
[root@ip-10-0-0-11 ~]# 
Now connect to HA Node #2 and issue the same commands as you did previously on NAT Node #1. However, in this case, configure nat_monitor.sh with the following settings:
  • NAT_ID - This should point to NAT Node #1 (i-f821b388 in this example).
  • NAT_RT_ID - This should point to the route table routing Internet-bound traffic through NAT Node #1 (rtb-f8e35095 in this example).
  • My_RT_ID - This should point to the route table routing Internet-bound traffic through NAT Node #2 (rtb-969a23fb in this example).
  • EC2_URL - This should point to the EC2 URL of the region the NAT instances are running in (e.g.. https://ec2.us-east-1.amazonaws.com NAT instances running in the US East Region in this example).
You may also adjust the health check variables to match NAT Node #1 or use different variables to perform the reverse health check at a different interval.

Step 7. Test your configuration.

All done! Now it's time to test your configuration. Watch the nat_monitor.log file on HA Node #1 while you stop NAT Node #2 and observe the script take-over the routing for Internet-bound traffic of NAT Node #2.

NAT Node #1

[root@ip-10-0-0-11 ~]# tail /tmp/nat_monitor.log 
Fri Feb 8 13:47:23 UTC 2013 -- Starting NAT monitor
Fri Feb 8 13:47:23 UTC 2013 -- Adding this instance to rtb-f8e35095 default route on start
ROUTE i-f821b388 0.0.0.0/0
Fri Feb 8 14:01:08 UTC 2013 -- Other NAT heartbeat failed, taking over rtb-969a23fb default route
ROUTE i-f821b388 0.0.0.0/0
Fri Feb 8 14:01:14 UTC 2013 -- Other NAT instance running, attempting to stop for reboot
INSTANCE i-12990462 running stopping
Fri Feb 8 14:02:19 UTC 2013 -- Other NAT instance stopped, starting it back up
INSTANCE i-12990462 stopped pending
 

NAT Node #2 shutdown

[root@ip-10-0-2-11 ~]# shutdown now 
[root@ip-10-0-2-11 ~]# 
Broadcast message from ec2-user@ip-10-0-2-11
 (/dev/pts/0) at 14:28 ...

The system is going down for maintenance NOW!
 

NAT Node #2 back up running

[root@ip-10-0-2-11 ~]# tail /tmp/nat_monitor.log 
Fri Feb 8 14:03:29 UTC 2013 -- Starting NAT monitor
Fri Feb 8 14:03:29 UTC 2013 -- Adding this instance to rtb-969a23fb default route on start
ROUTE i-12990462 0.0.0.0/0
 

Script

Note: This sample script works well with EEC2 API tools version 1.6.12.2 2013-10-15. If you are using a different version and your script is stuck at NAT_STATE, please modify the script to "print $5;" instead of "print $4;" in line 79.
#!/bin/sh
# This script will monitor another NAT instance and take over its routes
# if communication with the other instance fails
yum -y install aws-cli

# NAT instance variables
# Other instance's IP to ping and route to grab if other node goes down
NAT_ID=
NAT_RT_ID=

# My route to grab when I come back up
My_RT_ID=

# Specify the EC2 region that this will be running in (e.g. https://ec2.us-east-1.amazonaws.com)
EC2_URL=
EC2_REGION=`echo $EC2_URL | sed "s/https:\/\/ec2\.//g" | sed "s/\.amazonaws\.com//g"`

# Health Check variables
Num_Pings=3
Ping_Timeout=1
Wait_Between_Pings=2
Wait_for_Instance_Stop=60
Wait_for_Instance_Start=300

# Run aws-apitools-common.sh to set up default environment variables and to
# leverage AWS security credentials provided by EC2 roles
. /etc/profile.d/aws-apitools-common.sh

# Determine the NAT instance private IP so we can ping the other NAT instance, take over
# its route, and reboot it. Requires EC2 DescribeInstances, ReplaceRoute, and Start/RebootInstances
# permissions. The following example EC2 Roles policy will authorize these commands:
# {
# "Statement": [
# {
# "Action": [
# "ec2:DescribeInstances",
# "ec2:CreateRoute",
# "ec2:ReplaceRoute",
# "ec2:StartInstances",
# "ec2:StopInstances"
# ],
# "Effect": "Allow",
# "Resource": "*"
# }
# ]
# }

# Get this instance's ID
Instance_ID=`/usr/bin/curl --silent http://169.254.169.254/latest/meta-data/instance-id`
# Get the other NAT instance's IP
NAT_IP=`/opt/aws/bin/ec2-describe-instances $NAT_ID -U $EC2_URL | grep PRIVATEIPADDRESS -m 1 | awk '{print $2;}'`

echo `date` "-- Starting NAT monitor"
echo `date` "-- Adding this instance to $My_RT_ID default route on start"
/opt/aws/bin/ec2-replace-route $My_RT_ID -r 0.0.0.0/0 -i $Instance_ID -U $EC2_URL
# If replace-route failed, then the route might not exist and may need to be created instead
if [ "$?" != "0" ]; then
 /opt/aws/bin/ec2-create-route $My_RT_ID -r 0.0.0.0/0 -i $Instance_ID -U $EC2_URL
fi

while [ . ]; do
 # Check health of other NAT instance
 pingresult=`ping -c $Num_Pings -W $Ping_Timeout $NAT_IP | grep time= | wc -l`
 # Check to see if any of the health checks succeeded, if not
 if [ "$pingresult" == "0" ]; then
 # Set HEALTHY variables to unhealthy (0)
 ROUTE_HEALTHY=0
 NAT_HEALTHY=0
 STOPPING_NAT=0
 while [ "$NAT_HEALTHY" == "0" ]; do
 # NAT instance is unhealthy, loop while we try to fix it
 if [ "$ROUTE_HEALTHY" == "0" ]; then
 echo `date` "-- Other NAT heartbeat failed, taking over $NAT_RT_ID default route"
 /opt/aws/bin/ec2-replace-route $NAT_RT_ID -r 0.0.0.0/0 -i $Instance_ID -U $EC2_URL
 ROUTE_HEALTHY=1
 fi
 # Check NAT state to see if we should stop it or start it again
#NAT_STATE=`/opt/aws/bin/ec2-describe-instances $NAT_ID -U $EC2_URL | grep INSTANCE | awk '{print $4;}'`
 # The line below replaces the EC2 API tools with the AWS CLI to improve stability across EC2 API tool versions
NAT_STATE=`aws ec2 describe-instances --instance-ids $NAT_ID --region $EC2_REGION --output text --query 'Reservations[*].Instances[*].State.Name'`
if [ "$NAT_STATE" == "stopped" ]; then
 echo `date` "-- Other NAT instance stopped, starting it back up"
 /opt/aws/bin/ec2-start-instances $NAT_ID -U $EC2_URL
 NAT_HEALTHY=1
 sleep $Wait_for_Instance_Start
 else
 if [ "$STOPPING_NAT" == "0" ]; then
 echo `date` "-- Other NAT instance $NAT_STATE, attempting to stop for reboot"
 /opt/aws/bin/ec2-stop-instances $NAT_ID -U $EC2_URL
 STOPPING_NAT=1
 fi
 sleep $Wait_for_Instance_Stop
 fi
 done
 else
 sleep $Wait_Between_Pings
 fi
done

Appendix A: Preventing False Positives

When two instances independently monitor each other, there is a possibility that a network interruption between the two instances will result in a false positive, where both instances think that the other has failed and initiate recovery. If both instances can still communicate with the EC2 API endpoints, this scenario could lead to the instances shutting each other down. The likelihood of this edge case occurring is directly related to how frequently health checks are performed. This example uses the VIP monitor defaults of health checks every five seconds, which results in extremely quick NAT recovery (typically under 10 seconds). However, these defaults make this edge case more likely than if the health checks were performed less frequently (e.g., every 5, 10, or 15 minutes or so).
Three strategies can reduce the risks associated with this edge case:
Increase the number of health checks (Num_Pings) or ping timeout (Ping_Timeout). This will mitigate the risk that temporary network congestion between the NAT instances will result in a false positive. The default is to perform three pings with one second timeout for each ping. Increasing the number of pings or timeout length increases the likelihood that at least one healthy response will be received during a particular health check.
Increase the timeout between health checks (Wait_Between_Pings). Increasing the time between health checks can mitigate the risk that both instances could shut each other down. Ideally this timeout would be greater than the time it takes to stop an instance, and the two NAT instances would be configured to perform their health checks at different intervals. For example, increasing the health check timeout to 10 minutes with NAT Node #2 monitoring starting 5 minutes after NAT Node #1 results in each node performing alternating health checks every 5 minutes but never simultaneously.
Install these scripts on a monitoring instance or "witness server" to perform the monitoring, route swapping, and NAT instance restarting. Additional quorum and recovery logic could also be incorporated into the script to reduce the risk of false positives.

Source : https://aws.amazon.com/articles/2781451301784570

4 comments: