How Load Balancers Actually Work
A Deep Dive
A load balancer is one of the most foundational building blocks in distributed systems. It sits between clients and your backend servers and spreads incoming traffic across a pool of machines, so no single server becomes the bottleneck (or the single point of failure).
But the interesting questions start after the definition:
How does the load balancer decide which server should handle a request?
What’s the difference between L4 and L7 Load Balancers?
What happens when a server slows down or goes offline mid-traffic?
How can the load balancer ensure that request from the same client always go to the same server?
And what happens if the load balancer itself goes down?
In this article, we’ll answer these questions and build an intuitive understanding of how load balancers work in real systems.
Let’s start with the basics: why we need load balancers in the first place.
1. Why Do We Need Load Balancers?
Imagine a web app with just one server. Every user request hits the same machine.
It works… until it doesn’t. This “single-server” setup has a few fundamental problems:
Single Point of Failure: If the server crashes, your entire application goes down.
Limited Scalability: A single server can only handle so many requests before it becomes overloaded.
Poor Performance: As traffic increases, response times degrade for all users.
No Redundancy: Hardware failures, software bugs, or maintenance windows cause complete outages.
A load balancer solves these problems by distributing traffic across multiple servers.
With this setup, you get:
High Availability: If one server fails, traffic is automatically routed to healthy servers.
Horizontal Scalability: You can add more servers to handle increased load.
Better Performance: Requests are distributed, so no single server is overwhelmed.
Zero-Downtime Deployments: You can take servers out of rotation for maintenance without affecting users.
But how does the load balancer decide which server should handle each request?
2. Load Balancing Algorithms
The load balancer uses algorithms to distribute incoming requests. Each algorithm has different characteristics and is suited for different scenarios.
Below are the most common ones you’ll see in real systems.
2.1 Round Robin
The simplest algorithm. Requests are distributed to servers in sequential order.
Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (cycle repeats)
Request 5 → Server B
...Pros:
Simple to implement
Works well when all servers have equal capacity
Predictable distribution
Cons:
Does not account for server load or capacity differences
A slow request on one server does not affect the distribution
Best for: Homogeneous server environments where all servers have similar specs and requests have similar processing times.
2.2 Weighted Round Robin
An extension of Round Robin where servers are assigned weights based on their capacity.
Server A (weight=3): Handles 3 out of every 6 requests
Server B (weight=2): Handles 2 out of every 6 requests
Server C (weight=1): Handles 1 out of every 6 requestsPros:
Still simple
Better for mixed instance sizes (e.g., 2 vCPU + 4 vCPU + 8 vCPU)
Cons:
Still not load-aware in real time
If one server becomes slow (GC pause, noisy neighbor, warm cache vs cold cache), it will still get its scheduled share
Best for: Heterogeneous environments where servers have different capacities (e.g., different CPU, memory, or network bandwidth).
2.3 Least Connections
Routes requests to the server with the fewest active connections.
This algorithm is dynamic, it considers the current state of each server rather than using a fixed rotation.
Server A: 10 active connections
Server B: 5 active connections ← Next request goes here
Server C: 8 active connectionsPros:
Adapts to varying request processing times
Naturally balances load when some requests take longer than others
Cons:
Requires tracking connection counts for each server
Slightly more overhead than Round Robin
Best for: Applications where request processing times vary significantly (e.g., database queries, file uploads).
2.4 Weighted Least Connections
Combines Least Connections with server weights. The algorithm considers both the number of active connections and the server’s capacity.
Score = Active Connections / Weight
Server A: 10 connections, weight 5 → Score = 2.0
Server B: 6 connections, weight 2 → Score = 3.0
Server C: 4 connections, weight 1 → Score = 4.0
Next request goes to Server A (lowest score)Pros:
Works well for mixed instance sizes and mixed request durations
More robust than either “weighted” or “least connections” alone
Cons:
Needs reliable tracking + weight tuning
Still uses connections as a proxy for load (not always perfect)
Best for: Heterogeneous environments with varying request processing times.
2.5 IP Hash
The client’s IP address is hashed to determine which server handles the request. The same client IP always goes to the same server.
hash(192.168.1.10) % 3 = 1 → Server B
hash(192.168.1.20) % 3 = 0 → Server A
hash(192.168.1.30) % 3 = 2 → Server CPros:
Simple session persistence without cookies
No additional state to track
Cons:
Uneven distribution if IP addresses are not uniformly distributed
Server additions/removals cause redistribution of clients
Best for: Applications requiring basic session persistence without cookie support.
2.6 Least Response Time
Routes requests to the server with the fastest response time and fewest active connections.
The load balancer continuously measures:
Average response time for each server
Number of active connections
Pros
Optimizes for perceived performance
Can avoid slow/unhealthy servers before they fully fail
Cons
Highest operational complexity (needs continuous measurement and smoothing)
Can “overreact” to noise without careful tuning (feedback loops)
Requires good metrics and stable observation windows
Best for: Latency-sensitive applications where response time is critical.
3. Layer 4 vs Layer 7 Load Balancing
Load balancers operate at different layers of the OSI model, and this determines what information they can use to make routing decisions.
In practice, this usually comes down to two common modes:
Layer 4 (Transport): routes based on IPs/ports and the transport protocol (TCP/UDP)
Layer 7 (Application): routes based on HTTP/HTTPS request details (path, headers, cookies, etc.)
3.1 Layer 4 (Transport Layer)
A Layer 4 load balancer operates at the TCP/UDP level. It does not understand HTTP paths, headers, or payloads. It only sees network and transport metadata such as:
Source IP address
Destination IP address
Source port
Destination port
Protocol (TCP/UDP)
How it works (TCP example)
The client opens a TCP connection to the load balancer (e.g.,
:443).The load balancer chooses a backend server (Server 1 or Server 2).
It forwards packets to that backend.
That connection stays pinned to the chosen backend for the lifetime of the TCP session.
Pros
Very fast: no request parsing, no payload inspection
Efficient: lower CPU/memory overhead
Protocol-agnostic: works for any TCP/UDP traffic (HTTP, TLS, gRPC, MQTT, custom protocols)
Cons
No content-based routing: can’t do
/api/*vs/images/*Limited app visibility: doesn’t know response codes, URL patterns, user sessions, etc.
Harder to do “smart” behaviors that require HTTP awareness
Examples: AWS Network Load Balancer (NLB), HAProxy (TCP mode)
3.2 Layer 7 (Application Layer)
A Layer 7 load balancer understands HTTP/HTTPS. It can inspect each request and route based on application-level information like:
HTTP method (
GET,POST, …)URL path and query parameters (
/api/users?id=7)HTTP headers (
Host,Authorization,Cookie,User-Agent, …)Sometimes the request body (with caveats)
Content-based routing examples
/api/* → API server pool
/images/* → Image server pool
/videos/* → Video streaming servers
Mobile clients → Mobile-optimized serversLayer 7 is especially useful when your “backend” isn’t one pool of identical servers, but a set of specialized services.
Pros
Smart routing: based on path, headers, cookies, hostnames
Better visibility: can observe HTTP status codes, latency, retries, error rates
Can transform traffic: header injection, redirects, rewrites (depending on product)
Commonly supports TLS termination: decrypt at the LB, forward plain HTTP internally (or re-encrypt)
Cons
More overhead: must parse HTTP (and sometimes decrypt TLS first)
Higher latency: usually small, but measurable at high scale
Protocol-specific: primarily for HTTP/HTTPS traffic
Examples: AWS Application Load Balancer (ALB), NGINX, HAProxy (HTTP mode)
4. Health Checks and Failover
A load balancer is only useful if it can avoid sending traffic to broken servers. If it keeps routing requests to a dead instance, users will see timeouts, 5xx errors, and intermittent failures that are hard to debug.
That’s why every serious load balancer has two jobs:
Detect health (which servers are safe to send traffic to)
Fail over (stop routing to unhealthy servers, and bring them back safely once recovered)
4.1 Types of Health Checks
Health checks come in two broad flavors: passive (observe real traffic) and active (send probes). Most production setups use a combination.
Passive Health Checks
The load balancer monitors actual traffic to detect failures.
If Server A returns 5xx errors for 3 consecutive requests
→ Mark Server A as unhealthy
→ Stop sending traffic to Server APros
No extra “probe” traffic
Detects failures that matter to users (real request paths)
Cons
You only detect issues after users are already impacted
Can be noisy: a few bad requests might be app-level bugs, not server death
Doesn’t help much when traffic is low (nothing to observe)
Best used for: fast detection of application-level failures in addition to active checks.
Active Health Checks
The load balancer periodically sends probe requests to each server, independent of user traffic.
Example:
GET /healthevery 10 secondsIf probes fail repeatedly, the server is removed from rotation
This catches failures before they hit users (or at least reduces blast radius quickly).
Health check types by depth
TCP Health Checks
The load balancer checks whether it can open a TCP connection to a port.
Can I connect to Server A on port 8080?
→ Yes: Server is healthy
→ No: Server is unhealthyGood for: basic liveness (“is the process listening?”)
Blind spot: the app may accept TCP but still be broken internally (DB down, stuck threads).
HTTP Health Checks
The load balancer sends an HTTP request to a known endpoint and validates the response.
GET /health HTTP/1.1
Host: server-a.internal
Expected response:
- Status code: 200
- Body contains: "OK"
- Response time: < 500msGood for: verifying the app can actually serve HTTP and respond quickly.
Application-Level Health Checks
A richer endpoint that checks internal dependencies and returns structured status.
GET /health
{
"status": "healthy",
"checks": {
"database": "healthy",
"cache": "healthy",
"external_api": "degraded"
},
"timestamp": "2024-01-15T10:30:00Z"
}4.2 Failover Process
Failover is what happens after health checks decide a server is unhealthy.
Removing an unhealthy server (fail-out)
Once a server crosses the unhealthy threshold, the load balancer:
marks it unhealthy
stops sending it new traffic
continues sending traffic only to healthy servers
Example Timeline:
00:00 - Server C fails to respond to health check
00:10 - Second health check fails
00:20 - Third health check fails → Server C marked unhealthy
00:20 - All new traffic goes to Servers A and B onlyBringing a server back (recovery / fail-in)
After the server is fixed, it shouldn’t immediately receive full traffic (to avoid instant re-failure).
With a healthy threshold of 2:
...Server C is fixed...
05:00 - Health check succeeds
05:10 - Second health check succeeds → Server C marked healthy
05:10 - Server C rejoins the poolMany systems also use slow start / ramp-up, gradually increasing traffic to a recovered server so it can warm caches and stabilize.
5. Session Persistence (Sticky Sessions)
By default, a load balancer may send each request from the same user to a different backend server. That’s usually fine for stateless services but it breaks quickly if your application stores session data in-memory on the server.
Example failure mode:
Request 1: User logs in → Server A (session created)
Request 2: User views profile → Server B (no session found!)
Request 3: User logs in again → Server C (another session created)To avoid this, systems use session persistence (aka sticky sessions): the load balancer keeps routing a given user’s requests to the same server.
5.1 Methods of Session Persistence
There are a few common ways to achieve stickiness, each with different trade-offs.
Cookie-Based Persistence
The load balancer sets a cookie identifying which server the client should use.
How it works:
On the first request, the load balancer selects a backend (say, Server A).
The response includes a cookie that encodes the “chosen backend”.
On subsequent requests, the client sends that cookie back.
The load balancer uses the cookie to route the request to the same server.
Pros
Works well for browsers and HTTP clients
Doesn’t require special networking assumptions
Very reliable as long as cookies are preserved
Cons
Ties users to a specific server, reducing flexibility
If the cookie is not handled carefully, it can become a security or operability risk (tampering, leakage, etc.)
Doesn’t help for non-HTTP protocols
IP-Based Persistence
Use the client’s IP address to consistently route to the same server (same as IP Hash algorithm).
Pros
Simple
No cookies required
Cons
NAT and proxies: many users can appear from the same IP (corporate networks, mobile carriers)
Client IP may be masked by CDNs, proxies, or gateways unless forwarded correctly
Adding/removing servers can reshuffle mapping and break stickiness
Best for: basic stickiness in environments where client IP is stable and meaningful.
Application-Controlled Persistence
The application explicitly indicates the sticky target via a header or cookie, and the load balancer respects it.
Example:
Response header:
X-Sticky-Server: server-a-pod-xyz
This is less common in “standard” setups, but it can be useful when the application has stronger context about routing needs (multi-tenant affinity, shard affinity, warm-cache affinity, etc.).
5.2 Problems with Sticky Sessions
Sticky sessions solve one problem, but they create new ones.
Uneven load distribution
If a few users generate heavy traffic, and they’re pinned to the same backend, that server becomes hot while others stay underutilized.
Failover becomes user-visible
If Server A dies, every user pinned to A may lose their session and get forced to re-authenticate (unless the app has other recovery mechanisms).
Scaling helps less than you expect
When you add new servers, existing users remain pinned to old ones. So the new capacity primarily benefits new users, not the current load distribution.
5.3 Better Alternative: Externalized Sessions
Instead of keeping sessions in server memory, store session state in a shared external store. Then any server can handle any request.
With externalized sessions:
Any backend can serve any request (no affinity required)
Server failures don’t wipe sessions
Horizontal scaling becomes clean and predictable
Deployments become simpler (no special draining logic for “pinned” users)
Popular session stores: Redis (very common), Memcached, DynamoDB
Key point: Sticky sessions are often a tactical fix. Externalizing sessions is usually the strategic solution especially once you care about reliability, autoscaling, and smooth deployments.
6. SSL/TLS Termination
HTTPS is non-negotiable on the modern internet but TLS handshakes and encryption are not free. They cost CPU, add latency, and require careful certificate management.
Load balancers often take on this work so your application servers can focus on business logic.
There are three common patterns:
6.1 TLS Termination at Load Balancer
The client connects to the load balancer over HTTPS. The load balancer decrypts the traffic and forwards it to backends as plain HTTP.
Pros:
Reduces CPU load on application servers
Centralized certificate management
Easier to inspect and modify traffic
Cons:
Traffic between load balancer and servers is unencrypted
Requires trusting the internal network
6.2 TLS Passthrough
The load balancer does not decrypt TLS. It forwards the encrypted bytes to a backend, and the backend terminates TLS.
Pros:
End-to-end encryption
No certificate management at load balancer
Cons:
Cannot inspect or modify traffic (Layer 4 only)
Each server must handle SSL/TLS
6.3 TLS Re-encryption
The load balancer decrypts traffic, applies L7 routing/inspection, then re-encrypts before forwarding to backends.
Pros:
End-to-end encryption
Can still inspect and route based on content
Cons:
Double encryption/decryption overhead
More complex certificate management
7. High Availability for Load Balancers
A load balancer improves availability for your backend servers but it also introduces a new risk: If you deploy one load balancer and it fails, your entire service becomes unreachable.
So the question becomes: how do we make the load balancer itself highly available?
There are three common approaches:
7.1 Active-Passive (Failover)
Two load balancers are deployed: one active, one standby. If the active fails, the passive takes over.
The two load balancers share a Virtual IP (VIP). When the active fails, the standby claims the VIP.
How failover is typically implemented
VRRP (Virtual Router Redundancy Protocol)
Keepalived (common Linux tool that uses VRRP)
Heartbeats between LBs to detect failure quickly
Pros
Easier to reason about
Works well for on-prem / self-managed environments
Cons
The standby is mostly idle (wasted capacity)
Failover is not instant (there’s always a detection + takeover window)
You still need to design for state:
if the active LB held session affinity state, failover may break stickiness unless state is shared
7.2 Active-Active
In active-active mode, both load balancers handle traffic simultaneously. DNS or an upstream router distributes traffic between them.
Pros:
Better resource utilization (no idle standby)
Higher total capacity (you scale the LB tier horizontally)
Often smoother failure handling: if one LB dies, the other still serves traffic
Cons:
More moving parts: routing, failover behavior, monitoring, and debugging are harder
Session persistence gets trickier
If stickiness is implemented at the LB, users might bounce between LBs unless the mechanism works across both (or you keep LBs stateless)
7.3 Cloud Load Balancers
In most cloud environments, you don’t run your own HA pair of load balancers. You use a managed LB (e.g., AWS ALB/NLB, Google Cloud Load Balancing, Azure Load Balancer).
These are designed to be highly available by default, typically by:
running across multiple Availability Zones
automatically replacing unhealthy infrastructure
scaling the LB fleet as traffic increases
providing built-in health checks and failover mechanisms
Trade-off: less control over internals, and you must design within the feature set of the managed product.
8. Key Takeaways
Load balancers distribute traffic across multiple servers for better performance, availability, and scalability.
Algorithms matter. Round Robin is simple but does not account for server load. Least Connections adapts to varying request times.
Layer 4 vs Layer 7: Layer 4 is faster but less flexible. Layer 7 enables content-based routing and advanced features.
Health checks are critical. Without them, users get routed to dead servers.
Session persistence has trade-offs. Sticky sessions can cause uneven load. Externalized sessions are usually better.
SSL termination at the load balancer reduces backend server CPU usage.
Load balancers need HA too. Use active-passive or active-active setups to avoid single points of failure.
Thank you for reading!
If you found it valuable, hit a like ❤️ and consider subscribing for more such content every week.
If you have any questions/suggestions, feel free to leave a comment.


















@Ashish Pratap Singh Good overview - would be good to hear from you on real examples and any trade offs team should take into consideration to kind map this !
Excellent breakdown. The session persistence section really captures why externalized state is the smarter long-term play, even tho sticky sessions feel like an easy workaround initially. One thing I've seen trip teams up is underestimating the operational complexity of the Least Response Time algorithm when your backend latency has high variance - without good smoothing and feedback loop protections, it can create avalanche effects where one slow server gets starved while fast servers get hammerd.