March 10, 2025

Zero-downtime upgrades with AWS Elastic Loadbalancers (ELBs) and Haproxy

I have a Classic Loadbalancer configured in my infrastructure with Terraform:

resource "aws_elb" "ingress" {
  # (...)

  # Regular traffic:
  listener {
    lb_port = 80
    lb_protocol = "tcp"
    instance_port = 8888
    instance_protocol = "tcp"
  }
  listener {
    lb_port = 443
    lb_protocol = "tcp"
    instance_port = 8889
    instance_protocol = "tcp"
  }

  # Haproxy dashboard answers on / on port 8887
  health_check {
    healthy_threshold = 2
    unhealthy_threshold = 2
    timeout = 3
    target = "HTTP:8887/healthy"
    interval = 5
  }

  instances = flatten([
    aws_instance.nomadclient-01.id,
    aws_instance.nomadclient-02.id,
    aws_instance.nomadclient-03.id
  ])

  cross_zone_load_balancing = true
  idle_timeout = 400
  connection_draining = true
  connection_draining_timeout = 400
}

Notice how the health_check is defined, but also how connection_draining is setup.

Connection Draining in AWS #

The AWS documentation describes connection draining as follows:

To ensure that a Classic Load Balancer stops sending requests to instances that are de-registering or unhealthy, while keeping the existing connections open, use connection draining. This enables the load balancer to complete in-flight requests made to instances that are de-registering or unhealthy.

So, how do we notify AWS ELBs when we’re replacing an endpoint?

Handling Draining in HAProxy #

In HAProxy, we use grace and monitor fail to handle this process. The documentation states:

grace <time>
Defines a delay between SIGUSR1 and real soft-stop.

This is used for compatibility with legacy environments where the haproxy
process needs to be stopped but some external components need to detect the status before listeners are unbound. The principle is that the internal “stopping” variable (which is reported by the “stopping” sample fetch function) will be turned to true, but listeners will continue to accept
connections undisturbed, until the delay expires, after what the regular
soft-stop will proceed. This must not be used with processes that are
reloaded, or this will prevent the old process from unbinding, and may
prevent the new one from starting, or simply cause trouble.

Our HAProxy configuration looks like this:

global
  grace 10s

listen fe_ingress_stats
  bind *:8887
  mode http
  stats enable
  stats show-legends
  stats show-node
  stats uri /

  monitor-uri /healthy
  monitor fail if { stopping }

Now, when stopping is true, the /healthy endpoint returns an unhealthy response, signaling the load balancer to stop sending traffic to the instance.

Triggering `stopping` #

To initiate draining, we send the SIGUSR1 signal to the HAProxy process.

Implementing This in Nomad #

Within Nomad, we configure the task like this:

task "haproxy" {
  driver = "docker"

  # Defines the time between sending a termination signal and force-killing the task.
  kill_timeout = "12s"
  kill_signal  = "SIGUSR1"
}

Conclusion #

With this setup, we can safely deploy new load balancers without dropping traffic. The combination of AWS connection draining, HAProxy’s grace and monitor fail, and Nomad’s signal handling ensures smooth transitions between instances.

Kudos

Zero-downtime upgrades with AWS Elastic Loadbalancers (ELBs) and Haproxy

Connection Draining in AWS #

Handling Draining in HAProxy #

Triggering `stopping` #

Implementing This in Nomad #

Conclusion #

Now read this

Easier installation of NixOS on Linode

Zero-downtime upgrades with AWS Elastic Loadbalancers (ELBs) and Haproxy

Connection Draining in AWS #

Handling Draining in HAProxy #

Triggering stopping #

Implementing This in Nomad #

Conclusion #

Now read this

Easier installation of NixOS on Linode

Triggering `stopping` #