When you try to sign up on a platform like Instagram and type in your username, the system almost instantly tells you whether it’s available or not. If it’s taken, it even suggests alternatives on the spot.
I mean from this it looks like bloom filters are not about giving quick responses for "is this username taken", because it doesn't. We still hit our cache/db in those cases and I imagine those calls are optimised because of good indexing.
So bloom filters are about not making a db call in the case when username is available.
Which seems a bit counterintuitive. When you add a username its a slower call either way, you PUT in your db.
When you're looking for a username and it already exists and we get to know that immediately --> that's what I'd associate with a fast call. But that does not happen. If anything bloom potentially slows us down because there's an extra check.
So why are bloom filters associated with speed? They should be associated with efficiency. Not making unnecessary db calls. Its a subtle difference.
Wouldn’t the in-memory bloom-filter have consequences? Like during scale-out or server restart, the application needs to have the bloom-filter initialised. Plus all the servers need to be in sync for when new user registers themselves.
Great post. However, I am not sure about level 4 where trie is used for suggesting similar usernames. In my understanding, they are not efficient at a billion-users scale. 🤔
What if data is stored in a non-relational database such as mongodb or dynamodb? In that case assuming username as the partition key is the db lookup faster or using bloomfilter? Also if bloomfilter or trie is stored in memory will is not be cleared on server restart?
Trie is usually stored in-memory while DB fetches from disk. That's why trie is usually faster. Also, trie works in O(m) where m is just the length of the string which is lot faster. Many relational databases O(log N) time for lookups after indexing, where N is the number of records.
Great post, seeing some things come up again and again.
Very informative , Thanks for sharing !
Insightful.
Nice 👍
Nice one
I mean from this it looks like bloom filters are not about giving quick responses for "is this username taken", because it doesn't. We still hit our cache/db in those cases and I imagine those calls are optimised because of good indexing.
So bloom filters are about not making a db call in the case when username is available.
Which seems a bit counterintuitive. When you add a username its a slower call either way, you PUT in your db.
When you're looking for a username and it already exists and we get to know that immediately --> that's what I'd associate with a fast call. But that does not happen. If anything bloom potentially slows us down because there's an extra check.
So why are bloom filters associated with speed? They should be associated with efficiency. Not making unnecessary db calls. Its a subtle difference.
Wouldn’t the in-memory bloom-filter have consequences? Like during scale-out or server restart, the application needs to have the bloom-filter initialised. Plus all the servers need to be in sync for when new user registers themselves.
Maybe have the bloom-filter in redis itself?
Generally, how many hash functions do we need to use for handling billions of usernames?
Amazing 😍
Great post. However, I am not sure about level 4 where trie is used for suggesting similar usernames. In my understanding, they are not efficient at a billion-users scale. 🤔
What if data is stored in a non-relational database such as mongodb or dynamodb? In that case assuming username as the partition key is the db lookup faster or using bloomfilter? Also if bloomfilter or trie is stored in memory will is not be cleared on server restart?
When you talk about fast look ups with Trie, how does it take less time checking in DB?
If DB itself uses Trie data structure, then it makes sense.
If someone explains it here, it’ll be really helpful.
Trie is usually stored in-memory while DB fetches from disk. That's why trie is usually faster. Also, trie works in O(m) where m is just the length of the string which is lot faster. Many relational databases O(log N) time for lookups after indexing, where N is the number of records.