Zero-downtime upgrades with AWS Elastic Loadbalancers (ELBs) and Haproxy

Published
Updated

I have a Classic Loadbalancer configured in my infrastructure with Terraform:

bash
resource "aws_elb" "ingress" {
  # (...)

  # Regular traffic:
  listener {
    lb_port = 80
    lb_protocol = "tcp"
    instance_port = 8888
    instance_protocol = "tcp"
  }
  listener {
    lb_port = 443
    lb_protocol = "tcp"
    instance_port = 8889
    instance_protocol = "tcp"
  }

  # Haproxy dashboard answers on / on port 8887
  health_check {
    healthy_threshold = 2
    unhealthy_threshold = 2
    timeout = 3
    target = "HTTP:8887/healthy"
    interval = 5
  }

  instances = flatten([
    aws_instance.nomadclient-01.id,
    aws_instance.nomadclient-02.id,
    aws_instance.nomadclient-03.id
  ])

  cross_zone_load_balancing = true
  idle_timeout = 400
  connection_draining = true
  connection_draining_timeout = 400
}

Notice how the health_check is defined, but also how connection_draining is setup.

Connection Draining in AWS

The AWS documentation describes connection draining as follows:

To ensure that a Classic Load Balancer stops sending requests to instances that are de-registering or unhealthy, while keeping the existing connections open, use connection draining. This enables the load balancer to complete in-flight requests made to instances that are de-registering or unhealthy.

So, how do we notify AWS ELBs when we’re replacing an endpoint?

Handling Draining in HAProxy

In HAProxy, we use grace and monitor fail to handle this process. The documentation states:

grace <time>

Defines a delay between SIGUSR1 and real soft-stop.

This is used for compatibility with legacy environments where the haproxy

process needs to be stopped but some external components need to detect the status before listeners are unbound. The principle is that the internal “stopping” variable (which is reported by the “stopping” sample fetch function) will be turned to true, but listeners will continue to accept

connections undisturbed, until the delay expires, after what the regular

soft-stop will proceed. This must not be used with processes that are

reloaded, or this will prevent the old process from unbinding, and may

prevent the new one from starting, or simply cause trouble.

Our HAProxy configuration looks like this:

bash
global
  grace 10s

listen fe_ingress_stats
  bind *:8887
  mode http
  stats enable
  stats show-legends
  stats show-node
  stats uri /

  monitor-uri /healthy
  monitor fail if { stopping }

Now, when stopping is true, the /healthy endpoint returns an unhealthy response, signaling the load balancer to stop sending traffic to the instance.

Triggering stopping

To initiate draining, we send the SIGUSR1 signal to the HAProxy process.

Implementing This in Nomad

Within Nomad, we configure the task like this:

bash
task "haproxy" {
  driver = "docker"

  # Defines the time between sending a termination signal and force-killing the task.
  kill_timeout = "12s"
  kill_signal  = "SIGUSR1"
}

Conclusion

With this setup, we can safely deploy new load balancers without dropping traffic. The combination of AWS connection draining, HAProxy’s grace and monitor fail, and Nomad’s signal handling ensures smooth transitions between instances.

Kudos

Don't move!