System design is often misunderstood as a simple exercise of sketching boxes and arrows on a whiteboard. In reality, it’s much more than that.
It’s the art and science of building systems that can serve millions of users, survive failures, adapt to growth, and remain cost-effective over time.
At its core, system design is about asking the right questions.
Every large-scale system, whether it’s Netflix streaming billions of hours of video, WhatsApp handling billions of messages per day, or your startup’s web app scaling to thousands of users, faces similar set of fundamental challenges.
In this article, we’ll explore the 10 big questions of system design that will guide your thinking, and help you make better architectural decisions.
1. Scalability
“How will the system handle a large number of users or requests simultaneously?”
Scalability is about a system's ability to handle a growing number of users, requests, or data without a drop in performance.
This isn't just a matter of making things bigger; it's about smart growth. A well-designed system should be able to serve one user, one million users, or one hundred million users with only minimal architectural changes.
Think of scalability as preparing your system for success. If tomorrow your app suddenly goes viral, will it crash under the load, or will it handle the traffic as if nothing happened?
Things to Consider:
Horizontal vs Vertical Scaling: Do we add more machines (horizontal) or beef up a single one (vertical)?
Load Balancers: How do you ensure requests are evenly distributed across servers?
Sharding: Can you split your data intelligently across multiple databases?
Stateless Services: Can your services scale out without shared state?
2. Latency and Performance
“How can we reduce response time and ensure low-latency performance under load?”
Users expect instant results. Every millisecond counts.
Latency is the time it takes for a system to respond to a user’s request, while performance refers to how efficiently the system handles many such requests under load.
Think of latency as the difference between a delightful and a frustrating user experience.
Things to Consider:
Caching: Can you avoid recomputation by storing frequently used data in a cache (e.g., Redis, CDN)?
Asynchronous Processing: Can you offload slow tasks (e.g., email, image resizing) to queues?
Efficient Algorithms: Are you using optimal data structures and algorithms in your services?
Database Optimization: Are queries indexed properly? Are joins slowing you down?
3. Communication
“How do different components of the system interact with each other?”
Modern systems are built using distributed components like databases, services, APIs, queues, caches etc.. that must work together seamlessly. For this to happen, they need clear and reliable ways to communicate.
Designing communication isn’t just about picking a protocol; it’s about ensuring that messages are delivered correctly, on time, and in a way that can evolve with the system.
Things to Consider:
REST vs RPC vs gRPC vs GraphQL: Which API protocol is most suited for your use case?
Synchronous vs Asynchronous Communication: Should components wait for a response, or use pub-sub/message queues?
Service Contracts: How are APIs versioned and documented?
Retries and Timeouts: What if a request fails? Do we retry? How many times?
4. Data Management
“How should we store, retrieve, and manage data efficiently?”
Data is the lifeblood of any system. Every request, transaction, or interaction eventually boils down to storing or retrieving data.
And while it’s tempting to think of databases as just a place to “dump” information, the truth is that a poorly designed data model can bring even the most sophisticated system to its knees.
How you manage data affects not just performance, but also scalability, reliability, and even business costs.
Things to Consider:
SQL vs NoSQL: Do you need relational consistency or flexible schema?
CAP Theorem Tradeoffs: Can you tolerate data inconsistency during a partition?
Data Partitioning and Replication: How do you scale and ensure high availability?
Data Lifecycle Management: What happens to old, stale, or deleted data?
5. Fault Tolerance and Reliability
“What happens if a part of the system crashes or becomes unreachable?”
No system is perfect. Hardware fails, networks go down, databases get overloaded, and bugs slip into production. Failures are inevitable. What matters is whether your system can survive them.
Fault tolerance is the ability of a system to continue operating despite the failure of one or more of its components.
Reliability is about ensuring the system performs its functions correctly and predictably over time.
The truth is, your system is only as strong as its weakest link. A single point of failure can bring down the entire application if it’s not anticipated and addressed.
The goal of good design is not to prevent all failures (that’s impossible), but to design for resilience so the system can recover gracefully when things inevitably go wrong.
Things to Consider:
Redundancy: Are critical services replicated across availability zones?
Graceful Degradation: Can you still serve partial functionality if a dependency fails?
Circuit Breakers and Timeouts: How do you avoid cascading failures?
Backup and Disaster Recovery: What happens if your database is corrupted, deleted, or compromised?
6. Security
“How do we protect the system against threats like unauthorized access or data breaches?”
Security shouldn’t be an afterthought; it must be built into the system from the start.
A single vulnerability can compromise not just your application but also user trust, brand reputation, and even regulatory compliance. Secure systems do more than protect users; they create confidence that data and interactions are safe.
Things to Consider:
Authentication and Authorization: Who are you, and what can you access?
Encryption: Is data encrypted in transit and at rest?
Input Validation and Sanitization: Are you protected against SQL injection or XSS?
Rate Limiting and Throttling: Can you defend against abuse or denial-of-service (DoS) attacks?
7. Maintainability and Extensibility
“How easy is it to maintain, monitor, debug, and evolve the system over time?”
Building a system is hard, but keeping it reliable, flexible, and future-proof is often harder. Many projects fail not because they couldn’t launch, but because they became too brittle to evolve.
A maintainable system allows engineers to fix bugs quickly, onboard new features without fear of breaking existing functionality, and adapt to changing business needs.
You don’t want a system that needs a team of superheroes to debug a 3 AM incident.
Things to Consider:
Modular Design: Is each component independent and loosely coupled?
Clear Interfaces and Contracts: Are integrations easy to test and change?
Observability: Can we trace a request through the system?
CI/CD Pipelines: Can we release code safely and quickly?
8. Cost Efficiency
“How can we balance performance with infrastructure cost?”
System design is not just an engineering problem, it’s a business one too.
A system can be perfectly performant, reliable, and secure, but if it costs a fortune to run, it's not a good design.
The best architectures are not the ones with unlimited resources, but the ones that achieve business goals with smart trade-offs and sustainable costs.
Cost efficiency is about finding the sweet spot between performance, reliability, and budget.
Things to Consider:
Right-sizing Instances: Are you using the right compute/storage resources?
Auto-Scaling: Do you scale up/down based on demand?
Cold vs Hot Data: Can you move infrequently accessed data to cheaper storage?
Third-party Services: Are you overpaying for managed solutions?
9. Observability and Monitoring
“How do we monitor system health and diagnose issues in production?”
You can’t fix what you can’t see. Once a system is in production, you need to know how it's behaving.
Monitoring gives you metrics and alerts about the system's health, like CPU usage or disk space.
Observability is a step further; it's the ability to understand the internal state of a system just by looking at its external outputs, such as logs, metrics, and traces. It helps you answer the "why" behind a problem, not just the "what."
Things to Consider:
Metrics, Logs, Traces: Are you collecting meaningful insights?
Dashboards and Alerts: Can you catch problems before users do?
Root Cause Analysis Tools: Can you replay and understand past incidents?
Health Checks and Heartbeats: Are services alive and well?
10. Compliance and Privacy
“Are we complying with relevant laws and regulations (e.g., GDPR, HIPAA)?”
This is an increasingly critical question especially in industries like finance and health. As systems handle more personal and sensitive data, they must adhere to various legal and regulatory requirements.
You need to design with privacy by default.
Things to Consider:
Data Retention Policies: Are you storing data only as long as needed?
Access Controls: Can only authorized personnel view sensitive data?
Anonymization and Masking: Are you protecting user identity?
Audit Trails: Can you track access and changes to sensitive information?
Thank you for reading!
If you found it valuable, hit a like ❤️ and consider subscribing for more such content.
If you have any questions or suggestions, leave a comment.
P.S. If you’re enjoying this newsletter and want to get even more value, consider becoming a paid subscriber.
As a paid subscriber, you'll unlock all premium articles and gain full access to all premium courses on algomaster.io.
There are group discounts, gift options, and referral bonuses available.
Checkout my Youtube channel for more in-depth content.
Follow me on LinkedIn, X and Medium to stay updated.
Checkout my GitHub repositories for free interview preparation resources.
I hope you have a lovely day!
See you soon,
Ashish
Thanks 🙏