Android Nomad #34 - System Design

Approaching a system design interview for mobile engineers.

Android Nomad #34 - System Design

Yes, system design is a vague topic for mobile engineers, as they are mainly focussed on the client (app) side. Folks early in their career, wouldn’t need to deal with any of this as tickets/issues are more focussed towards implementation of micro features or bug fixes. As a senior, you won’t need much of the system design either, focus is still very much on the app design which is very different to designing a system. However, as the industry is making shifts (LLM era) in terms of expectations, I believe its important that every mobile engineer should be equipped when you have been assigned to this task in this job market. I see more mobile engineers, now pivoting or contributing towards backend as well to implement mobile first feature.

Ok, Lets go!

Problem:

Designing a Scalable Bulk Marketing Engine for Sending Notifications.

Before you dive into the problem, understand for whom you’re building for using what. Very important step!

Step 0: Clarifying Questions:

  1. Is it designed for multiple time zones
  2. Is it cloud native vs on-prem (HLD depends on this)
  3. What is the deployment mechanism?
  4. What happens when there is a deadlock when two events happen at the same time?

The next step will be handed out as you ask what is this system about and expectations applicable to both functional and non-functional as the name suggests.

Step 1: Functional Requirements

  • Send emails to millions of subscribers within 5 days of user 72 pm
  • Allow scheduling of email campaigns
  • Send email notifications for users nearing their birthday
  • Support different types of email templates
  • Track bounced emails and failed delivery attempts
  • Allow scheduling of email campaigns
  • Support A/B testing
  • Provide analytics on email opens and clicks

Step 2: Non-Functional Requirements:

  • High throughput: Able to send millions of emails in 5 days
  • Low latency: Minimal delay between trigger and sending
  • Scalability: Handle increasing load efficiently
  • Reliability: Ensure emails are sent successfully
  • Availability: System should be operational 24x7
  • Fault tolerance: Ability to handle and recover from unexpected errors

Once, this is out of the way, you can focus more on the numbers by determining the volume of the system.

Step 3: Capacity Estimation:

  • Let's assume we need to send emails to 5.1 million subscribers in 5 days
  • Throughput: 5,100,000 emails / (5*24) seconds ≈ 11,847 emails/second
  • Assuming average email size of 100 KB
  • Storage: 100 KB * 5.1M = 510 GB per campaign
  • Bandwidth: (100 KB * 11,847 emails/second) ≈ 1,000,000 T ≈ 1.1 GB/second

Step 4: High-Level Design:

The system consists of the following main components:

  1. Client: Interfaces with the Load Balancer
  2. Load Balancer: Distributes incoming requests
  3. API Gateway: Handles authentication and request routing
  4. Campaign Manager: Manages email campaign creation and scheduling
  5. Message Queue: Buffers email sending requests
  6. Email Workers: Process and send emails
  7. Subscriber Database: Stores subscriber information
  8. Campaign Database: Stores campaign details
  9. Monitoring & Logging: Tracks system performance and issues

Step 5: Detailed Design:

  1. Campaign Manager:

    • Handles creation and scheduling of email campaigns

    • Uses cron jobs or similar (e.g., Airflow) to trigger campaigns at the scheduled time

    • Manages campaign data and subscriber list/filter conditions

  2. Message Queue:

    • Implements a distributed queue system (e.g., Apache Kafka)

    • Buffers email sending requests to handle traffic spikes

    • Enables parallel processing of email sending tasks

  3. Email Workers:

    • Consume messages from the queue and send emails

    • Implement connection pooling for database and SMTP connections

    • Use asynchronous I/O to improve performance

    • Handle retries for failed email sending attempts

  4. Data Storage:

    • Use separate databases for subscriber data and campaign data

    • Implement data sharding and query optimization for each DB

    • Use a caching layer (e.g., Redis) to reduce database load

  5. Monitoring & Logging:

    • Implement real-time monitoring of queue size, worker performance, and email sending rates

    • Log all system events and errors for debugging

    • Set up alerts for abnormal system behavior

Step 6: Data Model

The system uses three main tables:

  1. CREATE TABLE subscribers ( id INT PRIMARY KEY, email VARCHAR(255) UNIQUE NOT NULL, name VARCHAR(255), birthdate DATE, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP );
  2. CREATE TABLE campaigns ( id INT PRIMARY KEY, name VARCHAR(255) NOT NULL, content TEXT, scheduled_at TIMESTAMP, status VARCHAR(20), created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP );
  3. CREATE TABLE email_logs ( id INT PRIMARY KEY, campaign_id INT REFERENCES campaigns(id), subscriber_id INT REFERENCES subscribers(id), status VARCHAR(20), sent_at TIMESTAMP, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP );

Step 7: Scalability and Performance:

To ensure the system can handle millions of subscribers:

  • Use horizontally scalable architecture
  • Implement data partitioning for subscriber and campaign databases
  • Employ use of caching to reduce database load
  • Use asynchronous processing through queues
  • Rate limit API endpoints to prevent system overload

Fault Tolerance and Reliability:

  • Implement retry mechanisms for failed email sending attempts
  • Use redundant servers and load balancers to eliminate single points of failure
  • Replicate data across multiple data centers for disaster recovery
  • Implement circuit breakers to prevent cascading failures

Step 8: Pros and Cons

Pros:

  • High scalability due to distributed architecture
  • Fault-tolerant design with multiple ESPs and worker nodes
  • Efficient use of resources through queue-based processing
  • Real-time monitoring allows quick response to issues

Cons:

  • Complex system with many components, potentially increasing maintenance overhead
  • Reliance on external ESPs could introduce unpredictable latencies
  • Potential for increased costs due to multiple ESPs and cloud resources
  • Compliance with varying email regulations across regions could be challenging

Conclusion:

This design provides a scalable and robust solution for sending emails to millions of subscribers within the required timeframe. The use of distributed systems and asynchronous processing allows for high throughput and fault tolerance, while the monitoring and logging systems ensure operational visibility and quick issue resolution.

Subscribe to Sid Pillai

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe