System Design Fundamentals: Scalability, Availability, Consistency, and Latency

Introduction

When building systems that serve millions of users, understanding the fundamental properties that define system behavior becomes critical. Scalability, availability, consistency, and latency are not isolated concerns—they form an interconnected web of trade-offs that shape every architectural decision. These four pillars determine whether your application gracefully handles growth, remains operational during failures, maintains data correctness, and delivers responsive user experiences. Every system designer eventually confronts the reality that optimizing for one dimension often means compromising another.

The challenge lies not in achieving perfection across all dimensions, but in making informed trade-offs aligned with business requirements. A payment processing system prioritizes consistency over availability—it's better to reject a transaction than to process it twice. Conversely, a social media feed can tolerate eventual consistency in exchange for high availability and low latency. This article explores each fundamental property in depth, examines their relationships, and provides practical patterns for navigating their inherent tensions in production systems.

Understanding Scalability: Growth Without Breaking

Scalability describes a system's ability to handle increased load by adding resources. When we discuss scalability, we're asking a fundamental question: as demand grows—whether measured in requests per second, data volume, or concurrent users—can the system maintain acceptable performance without architectural reinvention? The answer depends on how the system scales and where its bottlenecks lie.

Vertical scaling (scaling up) means adding more power to existing machines: faster CPUs, more RAM, larger disks. This approach offers simplicity—your application architecture remains unchanged, and you avoid the complexity of distributed coordination. However, vertical scaling hits hard limits. Physical hardware has ceilings, and costs grow non-linearly. More importantly, vertical scaling provides no redundancy; a single machine failure means total downtime. While vertical scaling works well for databases that benefit from single-machine consistency (PostgreSQL, MySQL), it eventually becomes a constraint rather than a solution.

Horizontal scaling (scaling out) distributes load across multiple machines. This approach theoretically provides unlimited capacity—just add more nodes. Modern cloud infrastructure makes horizontal scaling economically attractive with commodity hardware and elastic provisioning. However, horizontal scaling introduces distributed systems complexity: data must be partitioned, requests load-balanced, and state coordinated across machines. Not all workloads parallelize easily. Systems with shared mutable state, like traditional relational databases, require careful design to scale horizontally without sacrificing consistency.

Consider a web application serving user profiles. Initially, a single application server and database suffice. As traffic grows, you add read replicas (horizontal scaling for reads) while the primary database handles writes (still vertically scaled). Eventually, you partition users across multiple database shards, achieving horizontal scaling for both reads and writes. Each step introduces complexity: replica lag, cross-shard queries, and distributed transactions. The key insight is that scalability is not binary—systems scale horizontally in some dimensions while remaining vertically scaled in others, and the boundaries between these approaches shift as requirements evolve.

Availability and Fault Tolerance: Staying Online When Things Break

Availability measures the percentage of time a system remains operational and accessible. Typically expressed in "nines" (99.9% = "three nines"), availability directly translates to user experience and business impact. Three nines allows 8.76 hours of downtime per year; five nines permits only 5.26 minutes. But availability is not simply uptime—it encompasses the system's ability to serve requests correctly, not just respond with error messages. High availability requires redundancy, fault detection, and graceful degradation.

Redundancy is the foundation of availability. Every critical component must have backup capacity to handle failures. Application servers sit behind load balancers that route around failed instances. Databases use replication to provide standby replicas. Network paths have redundant switches and routers. Storage systems maintain multiple copies of data across failure domains. The principle is simple: eliminate single points of failure. However, redundancy alone is insufficient—you need mechanisms to detect failures quickly and automatically shift traffic to healthy components.

Fault detection and health checking determine how quickly systems recognize and respond to failures. Passive health checks monitor request success rates and timeouts, removing unhealthy backends when error thresholds are exceeded. Active health checks periodically probe endpoints, testing actual functionality rather than just network connectivity. The trade-off lies in detection speed versus false positives. Aggressive timeouts catch failures quickly but risk cascading failures when brief network hiccups trigger avalanche retries. Conservative timeouts tolerate transient issues but allow failures to affect users longer.

Graceful degradation acknowledges that partial functionality beats complete failure. When a recommendation engine fails, an e-commerce site can still display products—just without personalized suggestions. When a comments service is unavailable, a news site can still serve articles. This requires designing systems as loosely coupled services where failures are isolated and non-critical components can fail without bringing down core functionality. Circuit breakers prevent cascading failures by stopping requests to failing dependencies, allowing them time to recover.

Here's a practical example of implementing circuit breakers in TypeScript:

class CircuitBreaker {
  private failureCount = 0;
  private lastFailureTime: number | null = null;
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';

  constructor(
    private failureThreshold: number = 5,
    private recoveryTimeout: number = 60000, // 60 seconds
    private successThreshold: number = 2
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() - this.lastFailureTime! > this.recoveryTimeout) {
        this.state = 'HALF_OPEN';
        this.failureCount = 0;
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    if (this.state === 'HALF_OPEN') {
      this.failureCount = 0;
      if (++this.successThreshold >= 2) {
        this.state = 'CLOSED';
      }
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
    }
  }
}

// Usage
const breaker = new CircuitBreaker();
async function fetchUserData(userId: string) {
  return breaker.execute(async () => {
    const response = await fetch(`/api/users/${userId}`);
    if (!response.ok) throw new Error('Fetch failed');
    return response.json();
  });
}

This pattern protects upstream services from overload while allowing automatic recovery. When errors exceed the threshold, the circuit "opens," immediately rejecting requests without attempting the failing operation. After a timeout, it enters "half-open" state, cautiously testing if the service has recovered before fully closing the circuit again.

Consistency Models and Trade-offs: When Data Agrees (Or Doesn't)

Consistency defines what guarantees a system provides about data correctness, particularly in distributed environments with replication. Strong consistency ensures all readers see the same value at any given time—essentially, the system behaves as if there's a single copy of the data. Eventual consistency allows temporary disagreement between replicas, promising only that if updates stop, all replicas will eventually converge to the same value. Between these extremes lies a spectrum of consistency models, each with different trade-offs.

Strong consistency provides intuitive behavior: after a write completes, all subsequent reads return the updated value. This model simplifies application logic—you don't need to reason about stale data or conflicting updates. Traditional relational databases running on single nodes naturally provide strong consistency. However, in distributed systems, strong consistency requires coordination. Before acknowledging a write, the system must ensure replicas agree, typically using consensus protocols like Paxos or Raft. This coordination introduces latency—writes must wait for network round-trips and replica acknowledgments. Under network partitions, strongly consistent systems often sacrifice availability rather than risk inconsistency.

Eventual consistency optimizes for availability and performance by allowing replicas to temporarily diverge. Writes are acknowledged after updating a single replica, with background processes propagating changes to others. Reads might return stale data, but the system remains available even during network partitions. Eventually, all replicas converge. This model powers highly available systems like Amazon's DynamoDB, Cassandra, and DNS. The challenge is handling conflicting concurrent updates. When two users simultaneously modify the same data on different replicas, which version wins? Systems use strategies like last-write-wins (based on timestamps), conflict-free replicated data types (CRDTs), or application-level conflict resolution.

Session consistency and read-your-writes consistency offer middle ground. Session consistency guarantees that within a single user session, reads reflect that user's previous writes, even if global consistency is eventual. This provides intuitive behavior for individual users without requiring global coordination. For example, after a user updates their profile, they should immediately see their changes, even if other users might see the old version briefly. These models are particularly useful for user-facing applications where personal data consistency matters more than global consistency.

The CAP theorem, formalized by Eric Brewer and proven by Seth Gilbert and Nancy Lynch, states that distributed systems can guarantee at most two of three properties: Consistency (linearizability), Availability (every request receives a response), and Partition tolerance (system continues operating despite network partitions). Since network partitions are inevitable in distributed systems, the real choice is between consistency and availability during partitions. CP systems (like HBase, MongoDB with certain configurations) sacrifice availability to maintain consistency during partitions. AP systems (like Cassandra, DynamoDB) sacrifice consistency to maintain availability. Modern systems often provide tunable consistency, allowing applications to choose per-operation trade-offs.

Latency Optimization Strategies: Making Systems Feel Fast

Latency measures the time between initiating a request and receiving a response. While throughput describes how many requests a system handles per second, latency describes how long each individual request takes. Users perceive latency directly—a responsive system feels fast even at moderate throughput, while high-latency systems feel sluggish regardless of backend capacity. Reducing latency requires understanding where time is spent and applying appropriate optimization strategies.

Network latency dominates distributed system performance. Light travels through fiber optic cables at roughly 200,000 km/s (about 2/3 the speed of light in a vacuum), imposing fundamental physical limits. A request from San Francisco to New York takes at least 20ms just for the round-trip, before any processing. Cross-ocean requests add hundreds of milliseconds. Optimizing network latency means reducing round-trips, minimizing data transferred, and placing computation closer to users.

Content Delivery Networks (CDNs) and edge computing address geographic latency by distributing content and computation globally. CDNs cache static assets—images, JavaScript, CSS—at edge locations near users, eliminating long-distance requests for common resources. Modern edge platforms like Cloudflare Workers, AWS Lambda@Edge, and Fastly Compute@Edge run application logic at the edge, processing requests without backhaul to origin datacenters. For example, user authentication, A/B test bucketing, and personalization logic can execute at the edge, reducing latency from hundreds of milliseconds to single-digit milliseconds.

Caching is perhaps the most powerful latency reduction technique, applicable at multiple layers. Browser caches eliminate requests entirely. Reverse proxies like Varnish or nginx cache responses at the application tier. Application-level caches (Redis, Memcached) store computed results, avoiding expensive database queries or API calls. Database query caches reduce disk I/O. The challenge is cache invalidation—ensuring cached data remains fresh. As Phil Karlton famously noted, "There are only two hard things in Computer Science: cache invalidation and naming things."

Here's a practical caching implementation with time-based and tag-based invalidation in Python:

import time
from typing import Any, Optional, Set, Dict, Callable
from dataclasses import dataclass
from functools import wraps

@dataclass
class CacheEntry:
    value: Any
    expires_at: float
    tags: Set[str]

class TaggedCache:
    def __init__(self):
        self._cache: Dict[str, CacheEntry] = {}
        self._tag_to_keys: Dict[str, Set[str]] = {}
    
    def get(self, key: str) -> Optional[Any]:
        entry = self._cache.get(key)
        if entry is None:
            return None
        
        if time.time() > entry.expires_at:
            self._remove(key)
            return None
        
        return entry.value
    
    def set(self, key: str, value: Any, ttl: int, tags: Optional[Set[str]] = None):
        tags = tags or set()
        entry = CacheEntry(
            value=value,
            expires_at=time.time() + ttl,
            tags=tags
        )
        
        self._cache[key] = entry
        
        for tag in tags:
            if tag not in self._tag_to_keys:
                self._tag_to_keys[tag] = set()
            self._tag_to_keys[tag].add(key)
    
    def invalidate_by_tag(self, tag: str):
        """Invalidate all cache entries with the given tag"""
        if tag not in self._tag_to_keys:
            return
        
        keys_to_invalidate = self._tag_to_keys[tag].copy()
        for key in keys_to_invalidate:
            self._remove(key)
    
    def _remove(self, key: str):
        entry = self._cache.pop(key, None)
        if entry:
            for tag in entry.tags:
                self._tag_to_keys[tag].discard(key)

# Decorator for caching function results
def cached(cache: TaggedCache, ttl: int, tags: Optional[Set[str]] = None):
    def decorator(func: Callable):
        @wraps(func)
        def wrapper(*args, **kwargs):
            # Simple cache key from function name and arguments
            cache_key = f"{func.__name__}:{args}:{kwargs}"
            
            result = cache.get(cache_key)
            if result is not None:
                return result
            
            result = func(*args, **kwargs)
            cache.set(cache_key, result, ttl, tags)
            return result
        return wrapper
    return decorator

# Usage example
cache = TaggedCache()

@cached(cache, ttl=300, tags={"user_data"})
def get_user_profile(user_id: str):
    # Expensive database query
    return fetch_from_database(f"SELECT * FROM users WHERE id = {user_id}")

def update_user_profile(user_id: str, data: dict):
    save_to_database(user_id, data)
    # Invalidate all cached user data
    cache.invalidate_by_tag("user_data")

This implementation demonstrates practical cache management: time-based expiration prevents stale data, while tag-based invalidation allows selective cache clearing when related data changes. When a user updates their profile, all cached user data is invalidated without clearing unrelated cached content.

Asynchronous processing and request batching reduce perceived latency by decoupling user-facing requests from expensive background work. When a user uploads a video, the application immediately returns success after storing the raw file, processing encoding jobs asynchronously. Email notifications, analytics logging, and machine learning inference can happen in the background without blocking user requests. Request batching amortizes fixed costs—instead of making 100 individual database queries, batch them into a single query, reducing network round-trips and database overhead.

The Interplay: CAP Theorem and Real-World Trade-offs

Understanding scalability, availability, consistency, and latency in isolation provides foundational knowledge, but production systems require navigating their complex interactions. The CAP theorem crystallizes one fundamental trade-off: during network partitions, distributed systems must choose between consistency and availability. This theoretical constraint manifests as practical architectural decisions that ripple through system design.

Consider a global social network serving billions of users. Strong consistency would require coordinating writes across datacenters on multiple continents, introducing hundreds of milliseconds of latency—unacceptable for interactive features like posting status updates or sending messages. Instead, these systems embrace eventual consistency: when you post an update, it's immediately visible to you (session consistency) and quickly propagates to followers worldwide. Temporary inconsistencies—where different users see posts in slightly different orders—are acceptable trade-offs for low latency and high availability. However, critical operations like financial transactions within the same platform might use strong consistency despite higher latency, demonstrating that consistency models can vary within a single system based on operation requirements.

The relationship between availability and latency is equally nuanced. High availability typically requires redundancy and failover mechanisms, which can introduce latency. Health checks, cross-datacenter replication, and consensus protocols all add milliseconds. Conversely, optimizing for minimal latency often means reducing redundancy layers—serving directly from cache without fallback queries, for instance. When the cache misses, latency spikes dramatically, effectively reducing availability for those requests. Production systems balance these concerns through tiered architectures: fast paths with minimal redundancy for common cases, slower fallback paths with strong guarantees for edge cases.

Scalability intersects with all three other properties. Horizontal scaling improves availability through redundancy but complicates consistency by distributing data across nodes. Vertical scaling maintains consistency easily but creates availability risks through single points of failure. As systems scale, latency characteristics change—larger clusters mean more complex coordination, cross-rack network hops, and longer data replication times. Yet scale also enables latency optimization through geographic distribution and caching layers that smaller systems couldn't justify economically. The art of system design lies in understanding these trade-offs and making deliberate choices aligned with specific workload requirements rather than pursuing universal optimization.

Practical Implementation Patterns for Production Systems

Translating theoretical understanding into production systems requires concrete patterns and practices. Modern architectures employ several proven strategies to balance competing demands while maintaining operational simplicity. These patterns aren't silver bullets—each introduces its own complexities—but they provide starting points for reasoning about system design decisions.

Database replication strategies fundamentally shape consistency and availability characteristics. Primary-replica replication (master-slave) directs writes to a single primary while distributing reads across replicas. This pattern scales read-heavy workloads and provides availability through replica promotion during primary failures. However, replica lag introduces eventual consistency for reads, and failover can lose recent writes if the primary fails before replicating them. Multi-primary replication allows writes to multiple nodes, improving write availability but requiring conflict resolution. Amazon Aurora uses a distributed storage layer where the database engine writes to multiple availability zones simultaneously, achieving strong consistency without traditional replication lag.

Sharding and partitioning enable horizontal scaling by dividing data across multiple independent databases. A common pattern is sharding by user ID—users 0-1M on shard 1, 1M-2M on shard 2, etc. This scales both reads and writes and provides failure isolation—a single shard failure affects only a subset of users. The challenges are well-known: cross-shard queries become expensive or impossible, resharding as data grows requires migrating data, and hotspots emerge when certain shards receive disproportionate traffic. Consistent hashing helps by distributing data more evenly and minimizing resharding impact, but adds complexity.

// Consistent hashing for distributed cache
import crypto from 'crypto';

class ConsistentHash {
  private ring: Map<number, string> = new Map();
  private sortedKeys: number[] = [];
  private virtualNodes: number;

  constructor(nodes: string[], virtualNodes = 150) {
    this.virtualNodes = virtualNodes;
    nodes.forEach(node => this.addNode(node));
  }

  private hash(key: string): number {
    return parseInt(
      crypto.createHash('md5').update(key).digest('hex').substring(0, 8),
      16
    );
  }

  addNode(node: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const hash = this.hash(`${node}:${i}`);
      this.ring.set(hash, node);
      this.sortedKeys.push(hash);
    }
    this.sortedKeys.sort((a, b) => a - b);
  }

  removeNode(node: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const hash = this.hash(`${node}:${i}`);
      this.ring.delete(hash);
      this.sortedKeys = this.sortedKeys.filter(k => k !== hash);
    }
  }

  getNode(key: string): string | undefined {
    if (this.ring.size === 0) return undefined;

    const hash = this.hash(key);
    
    // Binary search for the first node >= hash
    let idx = this.sortedKeys.findIndex(k => k >= hash);
    
    // Wrap around if we're past the last node
    if (idx === -1) idx = 0;
    
    return this.ring.get(this.sortedKeys[idx]);
  }

  // Get multiple nodes for replication
  getNodes(key: string, count: number): string[] {
    if (this.ring.size === 0) return [];

    const hash = this.hash(key);
    const nodes: Set<string> = new Set();
    let idx = this.sortedKeys.findIndex(k => k >= hash);
    if (idx === -1) idx = 0;

    while (nodes.size < count && nodes.size < this.ring.size) {
      const node = this.ring.get(this.sortedKeys[idx]);
      if (node) nodes.add(node);
      idx = (idx + 1) % this.sortedKeys.length;
    }

    return Array.from(nodes);
  }
}

// Usage
const cacheCluster = new ConsistentHash([
  'cache-1.example.com',
  'cache-2.example.com',
  'cache-3.example.com'
]);

const key = 'user:12345:profile';
const primaryNode = cacheCluster.getNode(key);
const replicaNodes = cacheCluster.getNodes(key, 3);

console.log(`Primary: ${primaryNode}`);
console.log(`Replicas: ${replicaNodes.join(', ')}`);

Consistent hashing distributes keys uniformly across nodes and minimizes redistribution when nodes are added or removed—adding a node only requires moving approximately 1/N of the keys, where N is the number of nodes. Virtual nodes (multiple hash positions per physical node) improve distribution uniformity.

Read-through and write-through caching patterns provide structured approaches to cache management. Read-through caches intercept read requests: on cache hit, return the cached value; on miss, fetch from the database, populate the cache, then return the value. Write-through caches intercept writes, updating both cache and database synchronously. Write-behind (write-back) caches acknowledge writes after updating the cache, asynchronously persisting to the database. This improves write latency but risks data loss if the cache fails before persistence. The choice depends on whether you optimize for read latency, write latency, or consistency guarantees.

Event-driven architectures and message queues decouple system components, improving scalability and availability. When a user places an order, the order service publishes an event to a message broker (RabbitMQ, Kafka, AWS SQS). Inventory, payment, and notification services consume these events independently. If the notification service fails, orders still process—notifications queue up and deliver once the service recovers. This pattern enables independent scaling (you can add more inventory service instances without touching payment services) and failure isolation. The trade-off is eventual consistency—there's a delay between placing an order and receiving a confirmation email—and increased system complexity from distributed state management.

Common Pitfalls and Anti-Patterns to Avoid

Despite well-understood principles, production systems frequently suffer from avoidable mistakes that stem from insufficient consideration of trade-offs or premature optimization. Recognizing these anti-patterns helps teams avoid costly architectural mistakes that are difficult to correct once systems reach scale.

Premature sharding is perhaps the most common pitfall. Teams anticipate scaling needs and implement complex sharding schemes early, before actually reaching single-database limits. Modern databases like PostgreSQL handle millions of rows efficiently with proper indexing. Premature sharding introduces complexity—cross-shard queries, data migration challenges, increased operational burden—without corresponding benefits. The better approach is vertical scaling and read replicas until you definitively hit limits, then sharding becomes a clear necessity rather than speculative optimization. Instagram famously ran on a single PostgreSQL instance for years before eventually sharding, demonstrating that well-optimized single-database systems scale further than many architects assume.

Ignoring the fallacies of distributed computing leads to fragile systems. The eight fallacies—the network is reliable, latency is zero, bandwidth is infinite, the network is secure, topology doesn't change, there is one administrator, transport cost is zero, and the network is homogeneous—represent assumptions that seem reasonable but fail in production. Systems that don't handle network failures, timeouts, and partial failures become brittle at scale. Every network call needs timeouts, retries with exponential backoff, circuit breakers, and fallback behavior. Every distributed transaction should have compensating transactions or idempotency to handle partial failures. Failing to plan for these realities causes cascading failures and mysterious bugs that only appear under load.

Best Practices for Modern Distributed Systems

Building production-grade distributed systems requires adhering to established best practices that have emerged from decades of industry experience. These practices don't eliminate complexity but provide frameworks for managing it systematically. Following these guidelines significantly improves system reliability and maintainability.

Design for failure from day one. Assume every component will fail and architect accordingly. Use health checks, circuit breakers, and automatic failover. Implement graceful degradation so partial failures don't cascade into total outages. Practice chaos engineering—deliberately introduce failures in controlled environments to validate that redundancy and recovery mechanisms work as designed. Netflix's Chaos Monkey randomly terminates production instances, forcing teams to build resilient systems that handle failures gracefully. This proactive approach reveals weaknesses before they cause customer-impacting incidents.

Measure everything and set clear SLOs (Service Level Objectives). You cannot optimize what you don't measure. Instrument systems to track latency percentiles (p50, p95, p99), error rates, throughput, and saturation for every critical component. Set explicit SLOs that define acceptable performance—for example, "99% of requests complete within 200ms" or "99.9% availability measured monthly." These objectives drive architectural decisions: if you're missing latency SLOs, you know where to invest optimization effort. They also inform trade-offs: is eventual consistency acceptable if it gets you from p99 latency of 500ms to 50ms? The SLO provides quantitative answers.

Choose consistency models per-operation, not per-system. Modern databases increasingly offer tunable consistency, allowing applications to specify requirements per-request. DynamoDB and Cassandra let you choose consistency levels (eventual, quorum, strong) for each read and write. Critical operations like financial transactions use strong consistency; less critical operations like view counts use eventual consistency. This fine-grained control optimizes the consistency-performance trade-off at the operation level rather than accepting system-wide constraints. Design your data models and access patterns to support this flexibility—separate critical strongly-consistent data from eventually-consistent data when possible.

Conclusion

The four fundamentals—scalability, availability, consistency, and latency—form the foundation of all distributed systems design. Understanding these properties individually provides necessary knowledge, but mastering their interactions and trade-offs separates competent engineers from exceptional architects. There is no universal "right" architecture; every system makes deliberate compromises based on specific requirements, constraints, and acceptable trade-offs.

The path forward lies in embracing complexity thoughtfully rather than avoiding it. Start simple—a monolithic architecture with vertical scaling and strong consistency—and evolve based on actual requirements rather than anticipated ones. Measure relentlessly to understand current system behavior and identify real bottlenecks. When scaling becomes necessary, apply patterns like replication, caching, and partitioning incrementally, validating each change against defined service level objectives. Accept that some problems have no perfect solution, only trade-offs to be managed. A payment system will always prioritize consistency over availability; a content feed will always choose availability over consistency. The skill is recognizing which trade-offs align with your system's core purpose and implementing them deliberately with full understanding of the consequences.

Key Takeaways

Start simple and scale based on measurement, not speculation. Avoid premature optimization and complex architectures until actual metrics demonstrate the need. Many systems never outgrow well-optimized single-server setups.
Design every component with failure as the default expectation. Implement timeouts, retries with exponential backoff, circuit breakers, and health checks from the beginning. Graceful degradation prevents cascading failures.
Set explicit Service Level Objectives (SLOs) for latency, availability, and consistency. These quantifiable targets guide architectural decisions and trade-off evaluations. Measure p99 latency, not just averages.
Choose consistency models per-operation, not per-system. Critical operations warrant strong consistency despite higher latency; non-critical operations benefit from eventual consistency. Modern databases support tunable per-request consistency.
Embrace caching at multiple layers but invest in cache invalidation strategies. Browser caching, CDNs, application caches, and query caches dramatically reduce latency. Tag-based invalidation and time-based expiration prevent stale data issues.

References

Brewer, Eric. "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer, vol. 45, no. 2, 2012, pp. 23-29.
Gilbert, Seth, and Nancy Lynch. "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, vol. 33, no. 2, 2002, pp. 51-59.
DeCandia, Giuseppe, et al. "Dynamo: Amazon's Highly Available Key-Value Store." ACM SIGOPS Operating Systems Review, vol. 41, no. 6, 2007, pp. 205-220.
Vogels, Werner. "Eventually Consistent." Communications of the ACM, vol. 52, no. 1, 2009, pp. 40-44.
Kleppmann, Martin. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O'Reilly Media, 2017.
Tanenbaum, Andrew S., and Maarten van Steen. Distributed Systems: Principles and Paradigms. 3rd ed., CreateSpace Independent Publishing Platform, 2017.
Ongaro, Diego, and John Ousterhout. "In Search of an Understandable Consensus Algorithm." USENIX Annual Technical Conference, 2014, pp. 305-319. (Raft consensus protocol)
Lamport, Leslie. "The Part-Time Parliament." ACM Transactions on Computer Systems, vol. 16, no. 2, 1998, pp. 133-169. (Paxos consensus protocol)
Bailis, Peter, et al. "Highly Available Transactions: Virtues and Limitations." Proceedings of the VLDB Endowment, vol. 7, no. 3, 2013, pp. 181-192.
Newman, Sam. Building Microservices: Designing Fine-Grained Systems. 2nd ed., O'Reilly Media, 2021.
Nygard, Michael T. Release It!: Design and Deploy Production-Ready Software. 2nd ed., Pragmatic Bookshelf, 2018.
Helland, Pat. "Life Beyond Distributed Transactions: An Apostate's Opinion." CIDR Conference, 2007.