miles990

system-design

Scalability, availability, and distributed systems design

miles990 16 5 Updated 4mo ago
GitHub

Install

npx skillscat add miles990/claude-software-skills/system-design

Install via the SkillsCat registry.

SKILL.md

System Design

Overview

Principles for designing systems that handle scale, remain available, and perform well under load.


Scalability Fundamentals

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up):
┌─────────────────────┐
│  Bigger Server      │
│  - More CPU         │
│  - More RAM         │
│  - Faster disk      │
└─────────────────────┘
Limit: Hardware ceiling

Horizontal Scaling (Scale Out):
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Server│ │Server│ │Server│ │Server│
└──────┘ └──────┘ └──────┘ └──────┘
         ↑
    Load Balancer
Limit: Coordination complexity

Stateless Services

// ❌ Stateful - stores session in memory
class BadService {
  private sessions = new Map();

  login(userId: string) {
    this.sessions.set(userId, { loggedIn: true });
  }
}

// ✅ Stateless - external session store
class GoodService {
  constructor(private sessionStore: Redis) {}

  async login(userId: string) {
    await this.sessionStore.set(`session:${userId}`, { loggedIn: true });
  }
}

Load Balancing

Strategies

Strategy Description Use Case
Round Robin Cycle through servers Equal capacity servers
Weighted RR Based on server capacity Mixed capacity
Least Connections Route to least busy Long-lived connections
IP Hash Same IP → same server Session stickiness
URL Hash Same URL → same server Cache optimization

Health Checks

# Kubernetes-style health checks
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
// Health check endpoints
app.get('/health/live', (req, res) => {
  // Am I running?
  res.status(200).json({ status: 'alive' });
});

app.get('/health/ready', async (req, res) => {
  // Can I serve traffic?
  const dbOk = await checkDatabase();
  const cacheOk = await checkCache();

  if (dbOk && cacheOk) {
    res.status(200).json({ status: 'ready' });
  } else {
    res.status(503).json({ status: 'not ready', db: dbOk, cache: cacheOk });
  }
});

Caching Strategies

Cache Patterns

Cache-Aside (Lazy Loading):
┌────────┐    miss     ┌────────┐
│  App   │ ──────────→ │  Cache │
│        │ ←────────── │        │
└────────┘    null     └────────┘
    │
    │  read
    ↓
┌────────┐
│   DB   │  ──── write ──→ Cache
└────────┘

Write-Through:
App → Cache → DB (synchronous)

Write-Behind (Write-Back):
App → Cache → (async) → DB
// Cache-aside implementation
class CachedUserService {
  constructor(
    private cache: Redis,
    private db: Database
  ) {}

  async getUser(id: string): Promise<User> {
    // Try cache first
    const cached = await this.cache.get(`user:${id}`);
    if (cached) return JSON.parse(cached);

    // Cache miss - read from DB
    const user = await this.db.users.findById(id);
    if (user) {
      // Store in cache with TTL
      await this.cache.set(`user:${id}`, JSON.stringify(user), 'EX', 3600);
    }
    return user;
  }

  async updateUser(id: string, data: Partial<User>): Promise<User> {
    const user = await this.db.users.update(id, data);
    // Invalidate cache
    await this.cache.del(`user:${id}`);
    return user;
  }
}

Cache Invalidation

Strategy Description Complexity
TTL Expire after time Simple
Event-based Invalidate on write Medium
Version-based Key includes version Medium
Tag-based Group related keys Complex

Database Scaling

Read Replicas

                    ┌─────────────────┐
     Writes ──────→ │   Primary DB    │
                    └────────┬────────┘
                             │ replication
           ┌─────────────────┼─────────────────┐
           ↓                 ↓                 ↓
    ┌──────────┐      ┌──────────┐      ┌──────────┐
    │ Replica 1│      │ Replica 2│      │ Replica 3│
    └──────────┘      └──────────┘      └──────────┘
           ↑                 ↑                 ↑
           └─────── Reads ───┴────────────────┘

Sharding (Partitioning)

// Hash-based sharding
function getShard(userId: string, numShards: number): number {
  const hash = crypto.createHash('md5').update(userId).digest('hex');
  return parseInt(hash.slice(0, 8), 16) % numShards;
}

// Range-based sharding
function getShardByDate(date: Date): string {
  const year = date.getFullYear();
  const month = date.getMonth() + 1;
  return `orders_${year}_${month.toString().padStart(2, '0')}`;
}

// Consistent hashing for dynamic shards
class ConsistentHash {
  private ring: Map<number, string> = new Map();

  addNode(node: string) {
    for (let i = 0; i < 150; i++) { // Virtual nodes
      const hash = this.hash(`${node}:${i}`);
      this.ring.set(hash, node);
    }
  }

  getNode(key: string): string {
    const hash = this.hash(key);
    // Find next node on ring
    for (const [nodeHash, node] of [...this.ring.entries()].sort()) {
      if (nodeHash >= hash) return node;
    }
    return this.ring.values().next().value;
  }
}

Message Queues

Patterns

Point-to-Point:
Producer → Queue → Consumer

Pub/Sub:
              ┌─→ Subscriber 1
Publisher → Topic ─→ Subscriber 2
              └─→ Subscriber 3

Work Queue:
              ┌─→ Worker 1
Producer → Queue ─→ Worker 2  (competing consumers)
              └─→ Worker 3

Delivery Guarantees

Guarantee Description Implementation
At-most-once May lose messages Fire and forget
At-least-once May duplicate Ack after process
Exactly-once No loss, no dupe Idempotency + dedup
// Idempotent processing
async function processOrder(event: OrderEvent) {
  // Check if already processed
  const processed = await redis.get(`processed:${event.id}`);
  if (processed) {
    console.log(`Already processed ${event.id}`);
    return;
  }

  // Process the order
  await db.orders.create(event.order);

  // Mark as processed (with TTL for cleanup)
  await redis.set(`processed:${event.id}`, '1', 'EX', 86400 * 7);
}

High Availability

Redundancy Patterns

Active-Active:
┌────────┐     ┌────────┐
│Server A│ ←─→ │Server B│  Both handle traffic
└────────┘     └────────┘

Active-Passive:
┌────────┐     ┌────────┐
│ Active │ ──→ │Standby │  Failover on failure
└────────┘     └────────┘

Circuit Breaker

class CircuitBreaker {
  private failures = 0;
  private lastFailure: Date | null = null;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private threshold: number = 5,
    private timeout: number = 30000
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure!.getTime() > this.timeout) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failures++;
    this.lastFailure = new Date();
    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
}

CAP Theorem

         Consistency
            /\
           /  \
          /    \
         /      \
        /   CA   \
       /──────────\
      /            \
     / CP        AP \
    /________________\
Partition          Availability
Tolerance

CA: Single node (RDBMS)
CP: MongoDB, HBase (may reject writes during partition)
AP: Cassandra, DynamoDB (eventual consistency)

Consistency Models

Model Description Example
Strong Read sees latest write RDBMS
Eventual Eventually consistent DNS, Cassandra
Causal Respects causality Chat apps
Read-your-writes See your own writes Social feeds

Rate Limiting

Algorithms

// Token Bucket
class TokenBucket {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private capacity: number,
    private refillRate: number // tokens per second
  ) {
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  consume(tokens: number = 1): boolean {
    this.refill();
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    return false;
  }

  private refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.capacity,
      this.tokens + elapsed * this.refillRate
    );
    this.lastRefill = now;
  }
}

// Sliding Window
class SlidingWindowRateLimiter {
  constructor(
    private redis: Redis,
    private limit: number,
    private window: number // seconds
  ) {}

  async isAllowed(key: string): Promise<boolean> {
    const now = Date.now();
    const windowStart = now - this.window * 1000;

    const pipe = this.redis.pipeline();
    pipe.zremrangebyscore(key, 0, windowStart);
    pipe.zadd(key, now, `${now}`);
    pipe.zcard(key);
    pipe.expire(key, this.window);

    const results = await pipe.exec();
    const count = results[2][1] as number;

    return count <= this.limit;
  }
}

Common System Designs

System Key Components
URL Shortener Hash function, Redis cache, DB
Twitter Feed Fan-out, Redis timeline, Kafka
Chat App WebSocket, Presence, Message queue
E-commerce Cart service, Inventory, Payment
Video Streaming CDN, Chunking, Adaptive bitrate

Related Skills

  • [[architecture-patterns]] - Microservices, event-driven
  • [[database]] - Database optimization
  • [[caching-implementation]] - Cache strategies
  • [[reliability-engineering]] - SRE practices