In a single-node database, indexing is relatively simple. Databases use data structures such as B-Trees, Hash Maps, or Bitmap indexes to create shortcuts that allows them to quickly locate rows without scanning the entire table.
But what happens when your database is distributed across dozens or hundreds of machines?
Where should the index live?
Should it be local to each node or global across the cluster?
How do we keep it consistent when data is replicated?
What happens when queries span multiple machines?
In this article, we will explore these challenges and learn how indexing works in distributed databases.
1. Challenged with Distributed Indexing
Modern applications generate massive volumes of data, far beyond what a single machine can store or process efficiently. To scale horizontally, databases distribute data across multiple nodes in a cluster.
Two core techniques make this distribution possible:
Sharding: The dataset is split into slices, and each shard is stored on a different node. For example, one shard might store users with names starting from A–M, while another stores N–Z. Queries are then routed to the relevant shard based on the partitioning strategy.
Replication: Every piece of data is stored on multiple nodes to ensure fault tolerance and high availability. If one node fails, another replica can take over seamlessly.
While sharding and replication provide scalability and reliability, they create new challenges for indexing:
Where should indexes live?
Should each shard maintain its own local index, or should the system maintain a single global index that spans all shards?How do we keep indexes consistent?
When data is updated across multiple replicas, how do we ensure that indexes also reflect those changes correctly?How do we handle queries that touch multiple shards?
Range queries or aggregations may need to scan data spread across several nodes. Without careful indexing strategies, these queries can become bottlenecks.
These questions form the foundation of indexing in distributed databases and heavily influence the design choices made by systems like Cassandra, MongoDB, Google Spanner, and Elasticsearch.
Lets now explore two main strategies for distributed indexing: Local Indexing and Global Indexing.
2. Local Indexing
Local indexing is the simplest and most common strategy in distributed databases. In this approach, each shard maintains its own index for the subset of data it stores.
When a query arrives, the database does not know in advance which shard contains the target data. As a result, the query must be broadcast to all shards in a process known as scatter-gather:
The query is sent to every shard in the cluster.
Each shard uses its local index to search its own slice of data efficiently.
Partial results are collected from all shards.
The coordinator merges these results and returns the final answer to the client.
The key point to note is that while the query still fans out to multiple shards, each shard avoids scanning its entire dataset by leveraging its local index.
Pros
Simplicity: Each node is responsible only for its own data and index. There is no global coordination overhead.
Fast writes: Inserts and updates affect only the shard that owns the data. The index update happens locally, making it efficient for write-heavy workloads.
Scalability: As data volume grows, you simply add more shards, each with its own index.
Cons
Slow targeted reads: To fetch a single record, the system often queries all shards. Even if each shard responds quickly, the overall latency is higher due to network communication and result merging.
High network overhead: Scatter-gather queries can overwhelm the system as the number of nodes increases.
Local indexing optimizes lookups within a shard but does little to optimize queries across shards. It works well for workloads with frequent writes or broad queries (like analytics scans), but it struggles with highly selective queries where only one record is needed.
3. Global Indexing
Global indexing addresses the cross-shard problem directly. Instead of letting each shard operate independently, the system maintains a cluster-wide index that maps keys (such as user IDs or emails) to the shards where the corresponding data resides.
This global view allows the system to route queries to the exact shard that contains the data, eliminating the need for scatter-gather queries.
Suppose we want to look up:
SELECT * FROM Users WHERE userId = 102;
With a global index, the process works like this:
The query first checks the global index.
The global index responds: “That userId is stored on Shard 3.”
The system forwards the query directly to Shard 3.
Shard 3 (optionally using its own local index) fetches the data and returns the result.
This approach avoids broadcasting queries to all shards and makes targeted lookups extremely efficient.
Pros
Fast selective queries: Point lookups and range queries touch only the relevant shard, reducing latency and resource usage.
Lower network overhead: Queries no longer need to fan out across the cluster.
Better user experience: From an client’s perspective, queries behave as though the entire dataset is indexed as one.
Cons
Complexity: The global index is itself a distributed system that must be partitioned, replicated, and synchronized. This adds operational overhead and new failure points.
Slower writes: Each write requires at least two updates: one to the shard storing the data and another to the global index. This increases latency and can create write hotspots.
Consistency trade-offs: To keep the global index up to date, the system must choose between strong consistency (slower writes but accurate reads) and eventual consistency (faster writes but temporarily stale reads).
Global indexing optimizes search across nodes but complicates write operations. It is often used in systems where reads are frequent and selective (e.g., user lookups by ID or email).
4. Maintaining Index Consistency
One of the hardest problems in distributed indexing is keeping the index in sync with the underlying data.
This problem becomes especially tricky with global indexes.
For example, imagine inserting a new user into Shard 1. The write succeeds on the shard, but the update to the global index fails or is delayed. For a brief period, queries that rely on the global index will not be able to locate that user, even though the data is already stored.
To handle this, distributed databases typically adopt one of two consistency models:
Strong Consistency: Systems like Google Spanner use protocols such as Two-Phase Commit or Paxos to ensure that both the data write and the index write happen atomically. This guarantees correctness but increases write latency.
Eventual Consistency: Systems like Cassandra prioritize availability and performance. They update indexes asynchronously, meaning writes are acknowledged immediately, but indexes may lag behind. For many workloads, such as analytics or logging, this trade-off is acceptable.
In practice, databases often provide both options, leaving it up to developers to choose between guaranteed correctness with slower writes or faster writes with the possibility of temporary staleness, depending on the workload.
5. How Indexes Are Distributed
Indexes in distributed databases are themselves large-scale data structures that must be carefully managed across many nodes. Just like primary data, they need to be partitioned, replicated, and kept consistent.
The way partitioning and replication apply depends on the type of index:
Local indexes: Partitioning and replication happen implicitly as part of the shard. Each shard naturally maintains its own local index, and when the shard is replicated, the index is replicated too. No additional design is required.
Global indexes: Partitioning and replication must be explicitly designed and managed. Because a global index spans the entire dataset across all shards, it needs its own partitioning strategy and replication mechanism. This makes it significantly more complex to maintain.
Let’s look at how partitioning and replication work for global indexes in more detail.
Partitioned Indexes
Global indexes are often sharded in the same way as the data they reference. Instead of storing one massive index on a single machine, the index is divided into pieces and spread across the cluster.
A partitioning scheme (commonly hash-based or range-based) decides which node stores which portion of the index.
For example, if we build an index on
email
, the system might take the hash of the email value to determine which shard should hold that entry.This prevents any single node from becoming a bottleneck for indexing or querying.
Benefits:
Scales horizontally as data volume grows.
Avoids a single point of failure.
Each node handles only a fraction of the index, which reduces memory and CPU pressure.
However, partitioned indexes also mean that answering a query may involve multiple shards, especially if the query spans a range of values.
Replicated Indexes
Global Indexes can also be replicated across multiple nodes, much like how data replication works. Instead of a single copy of an index partition, the system maintains multiple replicas.
A query can be served by the nearest replica, improving read latency.
If one node fails, another replica can immediately take over, ensuring high availability.
Benefits:
Improved performance through load balancing (read from the nearest replica).
Improved fault tolerance, since replicas provide redundancy.
Challenges:
All replicas must stay consistent when data is updated.
Synchronization protocols introduce additional overhead.
In practice, most distributed databases use a combination of partitioning and replication for global indexes:
Indexes are partitioned to achieve scalability and distribute workload evenly.
Each partition is replicated to improve fault tolerance and query performance.
This hybrid strategy ensures that distributed indexes can scale with data growth while remaining resilient to failures and responsive to queries.
Thank you for reading!
If you found it valuable, hit a like ❤️ and consider subscribing for more such content.
If you have any questions or suggestions, leave a comment.
P.S. If you’re enjoying this newsletter and want to get even more value, consider becoming a paid subscriber.
As a paid subscriber, you’ll unlock all premium articles and gain full access to all premium courses on algomaster.io.
There are group discounts, gift options, and referral bonuses available.
Checkout my Youtube channel for more in-depth content.
Follow me on LinkedIn, X and Medium to stay updated.
Checkout my GitHub repositories for free interview preparation resources.
I hope you have a lovely day!
See you soon,
Ashish
Simple yet very effective insights into Local v/s Global Indexes. Thanks Ashish for writing and sharing this.. More power to your work !