Zero-downtime upgrades with AWS Elastic Loadbalancers (ELBs) and Haproxy
I have a Classic Loadbalancer configured in my infrastructure with Terraform:
resource "aws_elb" "ingress" {
# (...)
# Regular traffic:
listener {
lb_port = 80
lb_protocol = "tcp"
instance_port = 8888
instance_protocol = "tcp"
}
listener {
lb_port = 443
lb_protocol = "tcp"
instance_port = 8889
instance_protocol = "tcp"
}
# Haproxy dashboard answers on / on port 8887
health_check {
healthy_threshold = 2
unhealthy_threshold = 2
timeout = 3
target = "HTTP:8887/healthy"
interval = 5
}
instances = flatten([
aws_instance.nomadclient-01.id,
aws_instance.nomadclient-02.id,
aws_instance.nomadclient-03.id
])
cross_zone_load_balancing = true
idle_timeout = 400
connection_draining = true
connection_draining_timeout = 400
}
Notice how the health_check
is defined, but also how connection_draining
is setup.
Connection Draining in AWS #
The AWS documentation describes connection draining as follows:
To ensure that a Classic Load Balancer stops sending requests to instances that are de-registering or unhealthy, while keeping the existing connections open, use connection draining. This enables the load balancer to complete in-flight requests made to instances that are de-registering or unhealthy.
So, how do we notify AWS ELBs when we’re replacing an endpoint?
Handling Draining in HAProxy #
In HAProxy, we use grace
and monitor fail
to handle this process. The documentation states:
grace <time>
Defines a delay between SIGUSR1 and real soft-stop.This is used for compatibility with legacy environments where the haproxy
process needs to be stopped but some external components need to detect the status before listeners are unbound. The principle is that the internal “stopping” variable (which is reported by the “stopping” sample fetch function) will be turned to true, but listeners will continue to accept
connections undisturbed, until the delay expires, after what the regular
soft-stop will proceed. This must not be used with processes that are
reloaded, or this will prevent the old process from unbinding, and may
prevent the new one from starting, or simply cause trouble.
Our HAProxy configuration looks like this:
global
grace 10s
listen fe_ingress_stats
bind *:8887
mode http
stats enable
stats show-legends
stats show-node
stats uri /
monitor-uri /healthy
monitor fail if { stopping }
Now, when stopping
is true
, the /healthy
endpoint returns an unhealthy response, signaling the load balancer to stop sending traffic to the instance.
Triggering stopping
#
To initiate draining, we send the SIGUSR1
signal to the HAProxy process.
Implementing This in Nomad #
Within Nomad, we configure the task like this:
task "haproxy" {
driver = "docker"
# Defines the time between sending a termination signal and force-killing the task.
kill_timeout = "12s"
kill_signal = "SIGUSR1"
}
Conclusion #
With this setup, we can safely deploy new load balancers without dropping traffic. The combination of AWS connection draining, HAProxy’s grace
and monitor fail
, and Nomad’s signal handling ensures smooth transitions between instances.