One of the things that make databases truly powerful is their ability to protect your data even in the face of unexpected failures.
Whether the database server crashes, restarts, or there’s a sudden power outage, you can trust that your committed data won’t simply disappear.
This promise is known as Durability — one of the four essential ACID properties of databases.
But, what does it actually take to make a database durable?
In this article, we'll explore the key techniques databases use to ensure durability, including:
Write-Ahead Logging (WAL)
Checkpointing
Replication
1. Write-Ahead Logging (WAL)
One of the most fundamental techniques databases use to guarantee durability is called Write-Ahead Logging (WAL).
The idea is simple:
Always write changes to a log first, before updating the main data files.
This simple-looking step is powerful:
It gives the database a persistent, chronological record of every change, which can later be replayed to recover the exact state after a crash.
How WAL Ensures Durability (Step-by-Step)
1. Log the Change
Whenever the database needs to perform a change, such as an INSERT, UPDATE, or DELETE, it first creates a detailed record in a sequential append-only log file called the WAL (or sometimes a commit/redo log).
This write is typically done in memory first, making it fast and efficient.
Example WAL record:
LSN: 0/01000050
Record Type: INSERT
Table: public.users
Page: 2003
Tuple ID: (0, 5)
Inserted Values:
id = 101
name = 'Alice'
email = 'alice@example.com'
Transaction ID: 5001
LSN (Log Sequence Number) uniquely identifies the position of the record in the WAL.
Record Type describes the operation (in this case, an INSERT).
Other details specify exactly what changed, where it changed, and under which transaction.
2. Flush the Log to Disk
After creating the WAL record in memory, the database flushes it to durable disk storage.
This is done using system calls like fsync()
, which force the operating system to physically persist the WAL entry, bypassing any memory caching layers.
Only after the WAL record is safely written to disk does the database consider the change durable.
Even if the database server crashes immediately after this step, the change is preserved on disk and can be used for recovery.
3. Acknowledge the Commit
Once the WAL record is durably stored on disk, the database sends a "Success" response back to the client.
At this point:
The transaction is officially considered committed and durable.
The client can safely move on, confident that the change will not be lost.
The main data files (such as tables or indexes) may not have been updated yet, but that is acceptable because the WAL contains everything needed to recover the change.
4. Update in-memory Data Pages
After acknowledging the commit, the database may also update in-memory versions of the affected data pages.
These updated pages, held in buffers, reflect the latest state of the database.
However, they do not need to be immediately flushed to disk, which allows the system to continue performing at high speed.
5. Apply Changes to Data Files (Later)
Writing the actual updated pages to the database’s main data files happens later, typically through background processes like:
Checkpointing
Lazy background flushes
This deferred approach provides important performance benefits:
Multiple updates to the same page can be combined, reducing redundant writes.
Disk writes becomes more sequential and efficient.
Meanwhile, if a crash occurs before the background processes complete, the WAL still contains all the information needed to restore the database to a consistent state.
6. Crash Recovery via WAL
If the database crashes at any point, here's what happens during restart:
The database reads the WAL starting from the last known checkpoint.
It replays any committed transactions that were recorded in the WAL but not yet fully applied to the data files.
It restores the database to a consistent, committed state it was in just before the crash.
Why WAL Also Boosts Performance
WAL doesn’t just make databases more durable, it also boosts performance.
Sequential writes, which simply append to a log file, are much faster than random writes scattered across different data files.
Group commits allow multiple transactions to be flushed together, further reducing disk I/O overhead.
By writing to a fast, sequential log first and delaying slower updates to data files, databases achieve higher throughput without compromising durability.
2. Checkpointing
While WAL protects committed data against crashes, relying on WAL alone is not practical in the long run.
Imagine a database that runs for months without any cleanup, constantly appending to the WAL.
If the system crashes, it would need to replay millions or even billions of operations from the beginning just to recover.
That recovery could take hours or even days, making the system unusable for a long time.
This is where Checkpointing comes in.
What is a Checkpoint?
A checkpoint is a point-in-time snapshot where the database ensures that all changes up to that moment are fully written to the main data files.
In simple terms:
At a checkpoint, the database synchronizes its data files with its log files up to a specific position called the Log Sequence Number (LSN).
This creates a save point in the system, so that if a crash happens, the database can start recovery from the latest checkpoint instead of from the very beginning.
Checkpointing solves two major problems:
It reduces crash recovery time by limiting the amount of WAL that needs to be replayed after a failure.
It keeps WAL file size manageable by allowing the system to delete or archive old, unnecessary logs.
Without checkpoints, a database would continuously accumulate WAL data, leading to slow recovery and excessive storage usage.
How Checkpointing Works
In systems like PostgreSQL, checkpointing typically happens periodically based on one of two conditions:
After a certain amount of time passes (for example, every 5 minutes), or
After a certain amount of WAL data is generated.
During a checkpoint, the database performs the following steps:
1. Flush Dirty Pages
All modified (dirty) pages that are currently in memory are flushed to disk.
This ensures that the data files on disk reflect all changes made up to the current WAL position.
2. Record a Special Checkpoint Marker
The database writes a checkpoint record into the WAL, marking:
"Everything up to this LSN has been safely written to disk."
This marker serves as a recovery reference point in case of a crash.
3. Reuse or Archive Old WAL Segments
After completing a checkpoint, the database can safely recycle or archive older WAL segments, because the corresponding changes are already captured in the main data files.
These older WAL records are no longer needed for crash recovery, which keeps disk usage under control.
3. Replication
So far, we’ve focused on single-node durability ensuring that a single database instance can survive server crashes and restarts.
But what happens if the entire machine or disk is destroyed due to hardware failures, disk corruptions or even natural disasters.
In such cases, local durability alone isn't enough.
For truly resilient systems especially in distributed databases and high-availability architectures, we need Replication.
Replication simply means:
Keeping extra copies of your data on separate machines or separate storage so that even if one copy is lost, another remains available.
By maintaining multiple independent copies, replication eliminates single points of failure and ensures that the system can survive major hardware failures.
Replication is crucial not just for durability but also for high availability, allowing systems to quickly fail over to healthy replicas when needed.
Replication in Relational Databases
In relational databases like PostgreSQL, replication often follows a primary-standby model:
The primary database is responsible for handling all write operations (such as INSERT, UPDATE, and DELETE) and actively processing client requests.
Standby servers (replicas) are configured to replicate data from the primary. They continuously receive a stream of Write-Ahead Log (WAL) records generated by the primary.
As the standby receives WAL entries, it replays and applies them in real-time to update its own data files and maintain an almost exact copy of the primary database.
This setup means that if the primary server fails, a standby can be promoted quickly and has nearly all the latest transactions.
Asynchronous vs Synchronous Replication
Replication can be configured in two main flavors: asynchronous and synchronous.
In asynchronous replication, the primary sends WAL to the standby, but it does not wait for acknowledgment.
The primary considers a transaction committed once its own WAL is durably stored locally. This is fast, but if the primary fails, some recent commits might not have reached the standby (they’d be lost).
In synchronous replication, the primary waits for at least one (or a quorum of) replica(s) to confirm they have the WAL data before acknowledging the commit to the client. This ensures the transaction is stored on multiple machines’ disks before it’s marked committed.
Synchronous replication thus offers stronger durability across nodes. Even if the primary server explodes right after acknowledging a commit, a standby has the data. The downside is added latency: each transaction commit waits for network and disk I/O on another node.
Many production systems that cannot afford any data loss like banking systems use synchronous replication to ensure durability across servers.
Replication in Distributed Databases
In distributed databases (like NoSQL or NewSQL systems): replication is often built into the core architecture.
One popular example of this is Apache Cassandra, a widely used distributed NoSQL database.
When you write data in Cassandra, the following process takes place:
Write to Memory (Memtable): The incoming data is first written into an in-memory structure called the memtable. This provides fast, low-latency ingestion of writes.
Append to Commit Log: Simultaneously, the write is appended to a durable commit log stored on disk. This commit log serves a role similar to the Write-Ahead Log (WAL) found in traditional relational databases.
Acknowledge Success: Once the data is recorded in both the memtable and the commit log, the system acknowledges the write as successful to the client.
This two-step logging model ensures that if a node crashes after a write is acknowledged, the node can recover its data by replaying the commit log upon restart.
Beyond local durability, Cassandra also replicates each piece of data across multiple nodes in the cluster.
The replication factor determines how many copies of the data exist.
For example, with a replication factor of 3, each piece of data is stored on three different nodes.
When a write operation occurs, you can configure the system to wait for acknowledgments from a certain number of replicas before considering the write fully successful.
This behavior is controlled by the consistency level setting.
Common consistency levels include:
ONE: Wait for acknowledgment from one replica.
QUORUM: Wait for a majority of replicas (more than half) to acknowledge.
ALL: Wait for all replicas to confirm.
Choosing the right consistency level allows you to balance trade-offs between latency, availability, and durability based on your application's needs.
By combining local WAL-style durability with network-wide replication, Cassandra achieves a highly resilient architecture capable of surviving both software crashes and hardware failures.
Conclusion
Let’s recap how modern databases ensure that once your data is committed, it stays safe and durable, even in the face of crashes or hardware failures:
Write-Ahead Logging (WAL): Acts as the foundation for durability by recording every change in a sequential, append-only log before updating the actual data files.
fsync (or similar techniques): Ensures that WAL entries are physically written to disk, not just cached in memory. Once flushed, the data is considered truly durable.
Checkpointing: Periodically syncs the changes recorded in WAL with the main database files. This reduces recovery time after a crash and prevents the log from growing indefinitely.
Replication: Creates redundant copies of your data across multiple machines or geographic locations. This protects against not just software or disk failures, but also full system or site-level outages.
Thank you for reading!
If you found it valuable, hit a like ❤️ and consider subscribing for more such content every week.
P.S. If you’re enjoying this newsletter and want to get even more value, consider becoming a paid subscriber.
As a paid subscriber, you'll receive an exclusive deep dive every Thursday, access to a structured system design resource, and other premium perks.
There are group discounts, gift options, and referral bonuses available.
Checkout my Youtube channel for more in-depth content.
Follow me on LinkedIn and X to stay updated.
Checkout my GitHub repositories for free interview preparation resources.
I hope you have a lovely day!
See you soon,
Ashish
Once you see your system as a stream of logged changes, everything else gets simpler.
- replication
- failover
- recovery
such a complex topic explained so beautifully