When you try to sign up on a platform like Instagram and type in your username, the system almost instantly tells you whether it’s available or not. If it’s taken, it even suggests alternatives on the spot.
Wouldn’t the in-memory bloom-filter have consequences? Like during scale-out or server restart, the application needs to have the bloom-filter initialised. Plus all the servers need to be in sync for when new user registers themselves.
I mean from this it looks like bloom filters are not about giving quick responses for "is this username taken", because it doesn't. We still hit our cache/db in those cases and I imagine those calls are optimised because of good indexing.
So bloom filters are about not making a db call in the case when username is available.
Which seems a bit counterintuitive. When you add a username its a slower call either way, you PUT in your db.
When you're looking for a username and it already exists and we get to know that immediately --> that's what I'd associate with a fast call. But that does not happen. If anything bloom potentially slows us down because there's an extra check.
So why are bloom filters associated with speed? They should be associated with efficiency. Not making unnecessary db calls. Its a subtle difference.
Great post, I did not know about in memory bloom filters and how useful some in-memory data structures could be for fast lookup. I have a few questions:
1. How do we warm up the bloom filter? I mean for the very first request it is all initialized to 0 which means every username is available according to the filter.
2. We have multiple server instances (since we have a load balancer). Let's say, one server's in-memory bloom filter is updated to say "abc" username is taken. How is the same change synchronised with the other server's in-memory bloom filter?
Great post. However, I am not sure about level 4 where trie is used for suggesting similar usernames. In my understanding, they are not efficient at a billion-users scale. 🤔
What if data is stored in a non-relational database such as mongodb or dynamodb? In that case assuming username as the partition key is the db lookup faster or using bloomfilter? Also if bloomfilter or trie is stored in memory will is not be cleared on server restart?
Trie is usually stored in-memory while DB fetches from disk. That's why trie is usually faster. Also, trie works in O(m) where m is just the length of the string which is lot faster. Many relational databases O(log N) time for lookups after indexing, where N is the number of records.
Great post, seeing some things come up again and again.
Awesome post!!
Very informative , Thanks for sharing !
Insightful.
Nice 👍
Nice one
Wouldn’t the in-memory bloom-filter have consequences? Like during scale-out or server restart, the application needs to have the bloom-filter initialised. Plus all the servers need to be in sync for when new user registers themselves.
Maybe have the bloom-filter in redis itself?
A good question indeed, maybe the bloom filter is cached because it will be quite expensive to generate it again
I mean from this it looks like bloom filters are not about giving quick responses for "is this username taken", because it doesn't. We still hit our cache/db in those cases and I imagine those calls are optimised because of good indexing.
So bloom filters are about not making a db call in the case when username is available.
Which seems a bit counterintuitive. When you add a username its a slower call either way, you PUT in your db.
When you're looking for a username and it already exists and we get to know that immediately --> that's what I'd associate with a fast call. But that does not happen. If anything bloom potentially slows us down because there's an extra check.
So why are bloom filters associated with speed? They should be associated with efficiency. Not making unnecessary db calls. Its a subtle difference.
Generally, how many hash functions do we need to use for handling billions of usernames?
Consicse and informative
Hey chat, Can anyone tell me what are the other account i should follow as a backend engineer.
Great post, I did not know about in memory bloom filters and how useful some in-memory data structures could be for fast lookup. I have a few questions:
1. How do we warm up the bloom filter? I mean for the very first request it is all initialized to 0 which means every username is available according to the filter.
2. We have multiple server instances (since we have a load balancer). Let's say, one server's in-memory bloom filter is updated to say "abc" username is taken. How is the same change synchronised with the other server's in-memory bloom filter?
Amazing 😍
Great post. However, I am not sure about level 4 where trie is used for suggesting similar usernames. In my understanding, they are not efficient at a billion-users scale. 🤔
What if data is stored in a non-relational database such as mongodb or dynamodb? In that case assuming username as the partition key is the db lookup faster or using bloomfilter? Also if bloomfilter or trie is stored in memory will is not be cleared on server restart?
When you talk about fast look ups with Trie, how does it take less time checking in DB?
If DB itself uses Trie data structure, then it makes sense.
If someone explains it here, it’ll be really helpful.
Trie is usually stored in-memory while DB fetches from disk. That's why trie is usually faster. Also, trie works in O(m) where m is just the length of the string which is lot faster. Many relational databases O(log N) time for lookups after indexing, where N is the number of records.
Thanks