22 Comments

This was quite amazing and super super helpful. I am an experienced coder but have not had many opportunities to design high scale systems at work or in practice so this is going to help me a ton.

Expand full comment

Really happy to hear this, thank you so much!

Expand full comment

Nice one Ashish 👍🏻

Expand full comment

Excellent study case, Ashish!

Expand full comment

Impressive! You are truly a gem. Thank you for dedicating your time and sharing such valuable content with crystal-clear clarity.

Expand full comment

Thank you 😊

Love to hear this.

Expand full comment

Great content brother

Expand full comment

thank you!

Expand full comment

Thanks for sharing.

Expand full comment

you are welcome!

Expand full comment

Ashish,

This is awesome content. Just few points to add on if you agree:

1) Syncing mechanisms between Cache and User Preferences NoSQL db to ensure consistency

2) How does Cache payload look like along with the key

3) Cache TTL, eviction strategy, invalidation strategy, cache writing strategy, dealing with a data level conflict (discrepancy between current cache and NoSQL db entry)

4) Deduplication of notification so that users don’t get same notification again

Expand full comment

very nicely explained, specially the details of request message and types of db usage, the way notification service is crafting the message for individual channel. learning alot from you.

Expand full comment

How we are performing rate limiting based on user preferences is not clearly mentioned in the above design. How the service checks the number of notifications sent to a user and compares it with its preferences that flow is absent. And also the bottlenecks related to it.

Expand full comment

That's what I think, if a failure occurs will we loose all the data/events that was supposed to be pick by the Notification Service ? Can anyone help here ?

Expand full comment

I learnt a lot. One query here is it channel processor fails to deliver the notification due to some reason , will it take responsibility of retrying again or it can just update in the scheduler database with scheduled time as the next time using exponential backoff strategy so scheduler picks it up for the next retry. What would be the expected approach should be ?

Thanks.

Expand full comment

thank you so much for this amazing explanation. can this notification system design meet my situation where I have 30k IoT devices, and users can create custom conditions? For example, user1 has device1 and device2 and may create condition1 (for example, notify me if the temperature > 89) for device1 and condition2 (temperature > 60) for device2.

Expand full comment

yes this will surely a way to implement this usecase. Try to start with few client prototype and then scale to 30k number.

Expand full comment

Let’s talk about implementing the notification service , with best design / architecture / patterns .

I would like to do it in rust .

Anyone interested to tag along ?

I am thinking to open source such components so lot others can benefit

Expand full comment

Great post Ashish, well articulated main components and flow.

I noticed that message content generation is done by Notification Service. Would it make more resilient and robust to put behind inbound queue as to manage the spike in load on Notification service? Each request to Notification service to generate message will also take up connection from the web server to handle the request until generated message is placed in the queue. What if error occurs during generation of message content, will the request be dropped?

Expand full comment

That's what I think, if a failure occurs will we loose all the data/events that was supposed to be pick by the Notification Service ? Can anyone help here ?

Expand full comment

Great insights. This article helps how to think widely considering every aspect while designing a scalable and reliable system. Can you provide some example for scheduler service readily available to use? Also, you can provide a deeper explanation of how a scheduler service should be designed.

Expand full comment

How will you partition the Kafka topics to support such a large workload? Also, we are sending large email attachments on the Kafka queue, won't this affect the performance, anything we can do to mitigate it?

Expand full comment