Scaling & Performance Optimization

Optimize MCP server performance with connection pooling, caching strategies, load balancing, monitoring, and horizontal scaling patterns for high-traffic production deployments.


title: "Scaling & Performance Optimization" description: "Optimize MCP server performance with connection pooling, caching strategies, load balancing, monitoring, and horizontal scaling patterns for high-traffic production deployments." order: 19 level: "advanced" duration: "30 min" keywords:

  • "MCP scaling"
  • "MCP performance"
  • "MCP connection pooling"
  • "MCP caching"
  • "MCP load balancing"
  • "MCP monitoring"
  • "MCP server optimization"
  • "@modelcontextprotocol/sdk performance"
  • "mcp-framework scaling" date: "2026-04-01"

Quick Summary

As MCP servers handle more clients and heavier workloads, performance becomes critical. This lesson covers connection pooling for databases and external services, multi-layer caching strategies, load balancing MCP servers behind reverse proxies, monitoring and observability, horizontal scaling patterns, and memory optimization. These patterns apply to servers built with both the official TypeScript SDK and mcp-framework.

Understanding MCP Performance Bottlenecks

Before optimizing, identify where time is actually spent:

BottleneckImpactTypical LatencySolution
Database queriesHigh10-500msConnection pooling, query optimization, caching
External API callsHigh100-5000msCaching, circuit breakers, timeouts
JSON serializationLow-Medium1-50msStreaming responses, selective fields
Transport overheadLow1-10msTransport selection, compression
Tool handler logicVaries1-1000msProfiling, algorithm optimization
80%of MCP server latency typically comes from database queries and external API calls

Connection Pooling

Database Connection Pool

Never create a new database connection per request. Use a connection pool:

// src/db.ts
import { Pool } from "pg";

// Create a shared pool — reuse across all tool handlers
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20,                    // Maximum pool size
  min: 5,                     // Minimum idle connections
  idleTimeoutMillis: 30000,   // Close idle connections after 30s
  connectionTimeoutMillis: 5000, // Fail if connection takes >5s
  maxUses: 7500,              // Recycle connection after N uses
});

// Monitor pool health
pool.on("error", (err) => {
  console.error("Unexpected pool error:", err);
});

export async function query(sql: string, params?: unknown[]) {
  const client = await pool.connect();
  try {
    const start = Date.now();
    const result = await client.query(sql, params);
    const duration = Date.now() - start;

    if (duration > 1000) {
      console.error(`Slow query (${duration}ms): ${sql.substring(0, 100)}`);
    }

    return result;
  } finally {
    client.release(); // Always release back to pool
  }
}

export async function getPoolStats() {
  return {
    total: pool.totalCount,
    idle: pool.idleCount,
    waiting: pool.waitingCount,
  };
}
Connection Pool Sizing

Set max to roughly 2-4x your expected concurrent tool executions. Too few connections cause queuing; too many waste memory and overwhelm the database. Monitor waitingCount — if it is consistently above zero, increase the pool size.

HTTP Client Connection Pool

For tools that call external APIs, reuse HTTP connections:

// src/http-client.ts

// Node.js fetch reuses connections by default via the global agent,
// but for heavy use, configure an explicit agent
import { Agent } from "undici";

const httpAgent = new Agent({
  keepAliveTimeout: 30000,     // Keep connections alive for 30s
  keepAliveMaxTimeout: 60000,  // Max keepalive time
  connections: 50,             // Max connections per origin
  pipelining: 1,               // HTTP pipelining
});

export async function fetchWithPool(
  url: string,
  options: RequestInit = {}
): Promise<Response> {
  return fetch(url, {
    ...options,
    // @ts-ignore — dispatcher is supported by undici-backed fetch
    dispatcher: httpAgent,
    signal: AbortSignal.timeout(
      parseInt(String(options.signal)) || 10000
    ),
  });
}

Caching Strategies

In-Memory Cache with TTL

// src/cache.ts
interface CacheEntry<T> {
  value: T;
  expiresAt: number;
  accessCount: number;
}

export class TTLCache<T> {
  private store = new Map<string, CacheEntry<T>>();
  private maxSize: number;

  constructor(options: { maxSize?: number } = {}) {
    this.maxSize = options.maxSize || 1000;

    // Periodic cleanup
    setInterval(() => this.cleanup(), 60000);
  }

  get(key: string): T | undefined {
    const entry = this.store.get(key);

    if (!entry) return undefined;

    if (entry.expiresAt < Date.now()) {
      this.store.delete(key);
      return undefined;
    }

    entry.accessCount++;
    return entry.value;
  }

  set(key: string, value: T, ttlMs: number): void {
    // Evict if at capacity
    if (this.store.size >= this.maxSize) {
      this.evictLeastUsed();
    }

    this.store.set(key, {
      value,
      expiresAt: Date.now() + ttlMs,
      accessCount: 0,
    });
  }

  invalidate(key: string): void {
    this.store.delete(key);
  }

  invalidatePattern(pattern: RegExp): void {
    for (const key of this.store.keys()) {
      if (pattern.test(key)) {
        this.store.delete(key);
      }
    }
  }

  get stats() {
    return {
      size: this.store.size,
      maxSize: this.maxSize,
    };
  }

  private cleanup(): void {
    const now = Date.now();
    for (const [key, entry] of this.store) {
      if (entry.expiresAt < now) {
        this.store.delete(key);
      }
    }
  }

  private evictLeastUsed(): void {
    let minKey = "";
    let minAccess = Infinity;

    for (const [key, entry] of this.store) {
      if (entry.accessCount < minAccess) {
        minAccess = entry.accessCount;
        minKey = key;
      }
    }

    if (minKey) this.store.delete(minKey);
  }
}

Using Cache in Tool Handlers

import { TTLCache } from "./cache.js";

const cache = new TTLCache<string>({ maxSize: 500 });

server.tool(
  "get-weather",
  "Get current weather for a location",
  {
    location: z.string().describe("City name or coordinates"),
  },
  async ({ location }) => {
    const cacheKey = `weather:${location.toLowerCase()}`;

    // Check cache first
    const cached = cache.get(cacheKey);
    if (cached) {
      return {
        content: [{
          type: "text",
          text: cached,
        }],
      };
    }

    // Fetch from API
    const response = await fetch(
      `https://api.weather.example/v1/current?q=${encodeURIComponent(location)}`,
      { signal: AbortSignal.timeout(5000) }
    );
    const data = await response.json();
    const result = JSON.stringify(data, null, 2);

    // Cache for 5 minutes
    cache.set(cacheKey, result, 5 * 60 * 1000);

    return {
      content: [{ type: "text", text: result }],
    };
  }
);

Redis Cache for Distributed Deployments

When running multiple server instances, use a shared cache:

import Redis from "ioredis";

const redis = new Redis(process.env.REDIS_URL);

async function cachedQuery<T>(
  key: string,
  ttlSeconds: number,
  fetchFn: () => Promise<T>
): Promise<T> {
  // Try cache
  const cached = await redis.get(key);
  if (cached) {
    return JSON.parse(cached);
  }

  // Fetch and cache
  const result = await fetchFn();
  await redis.setex(key, ttlSeconds, JSON.stringify(result));
  return result;
}

// Usage in tools
server.tool(
  "get-user-stats",
  "Get user statistics",
  { userId: z.string() },
  async ({ userId }) => {
    const stats = await cachedQuery(
      `user-stats:${userId}`,
      300, // 5 minute TTL
      () => computeUserStats(userId)
    );

    return {
      content: [{ type: "text", text: JSON.stringify(stats, null, 2) }],
    };
  }
);
Cache TypeSpeedSharedPersistenceBest For
In-memory (Map)Fastest (~0.01ms)NoNoSingle instance, hot data
RedisFast (~1-5ms)YesOptionalMulti-instance, shared state
CDN/EdgeVariable (~10-50ms)YesYesStatic resources, public data

Load Balancing

Nginx Configuration for MCP SSE Servers

upstream mcp_servers {
    # Sticky sessions for SSE (required)
    ip_hash;

    server mcp-server-1:3001;
    server mcp-server-2:3001;
    server mcp-server-3:3001;
}

server {
    listen 443 ssl;
    server_name mcp.example.com;

    ssl_certificate /etc/ssl/certs/mcp.example.com.pem;
    ssl_certificate_key /etc/ssl/private/mcp.example.com.key;

    # SSE endpoint — long-lived connections
    location /sse {
        proxy_pass http://mcp_servers;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # SSE specific
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 86400s; # 24h for long-lived SSE
        proxy_send_timeout 86400s;

        # Chunked transfer
        chunked_transfer_encoding on;
    }

    # Message endpoint — short-lived requests
    location /messages {
        proxy_pass http://mcp_servers;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    # Health check
    location /health {
        proxy_pass http://mcp_servers;
    }
}
Sticky Sessions for SSE

SSE transport requires sticky sessions because the SSE stream and POST messages must reach the same server instance. Use ip_hash, cookie-based affinity, or a session-aware load balancer. Streamable HTTP with stateless session management avoids this limitation.

Nginx for Streamable HTTP (No Sticky Sessions Needed)

upstream mcp_servers {
    # Round-robin — no sticky sessions needed
    server mcp-server-1:3001;
    server mcp-server-2:3001;
    server mcp-server-3:3001;
}

server {
    listen 443 ssl;
    server_name mcp.example.com;

    location /mcp {
        proxy_pass http://mcp_servers;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header mcp-session-id $http_mcp_session_id;

        # Allow streaming responses
        proxy_buffering off;
    }
}

Monitoring and Observability

Metrics Collection

// src/metrics.ts
class Metrics {
  private counters = new Map<string, number>();
  private histograms = new Map<string, number[]>();

  increment(name: string, value = 1): void {
    this.counters.set(name, (this.counters.get(name) || 0) + value);
  }

  recordDuration(name: string, durationMs: number): void {
    if (!this.histograms.has(name)) {
      this.histograms.set(name, []);
    }
    const values = this.histograms.get(name)!;
    values.push(durationMs);

    // Keep last 1000 values
    if (values.length > 1000) {
      values.splice(0, values.length - 1000);
    }
  }

  getSnapshot(): Record<string, unknown> {
    const snapshot: Record<string, unknown> = {};

    for (const [name, count] of this.counters) {
      snapshot[name] = count;
    }

    for (const [name, values] of this.histograms) {
      if (values.length === 0) continue;
      const sorted = [...values].sort((a, b) => a - b);
      snapshot[`${name}_p50`] = sorted[Math.floor(sorted.length * 0.5)];
      snapshot[`${name}_p95`] = sorted[Math.floor(sorted.length * 0.95)];
      snapshot[`${name}_p99`] = sorted[Math.floor(sorted.length * 0.99)];
      snapshot[`${name}_count`] = values.length;
    }

    return snapshot;
  }
}

export const metrics = new Metrics();

Instrumented Tool Handlers

import { metrics } from "./metrics.js";

function instrumentedTool(
  server: McpServer,
  name: string,
  description: string,
  schema: Record<string, z.ZodType>,
  handler: (args: any) => Promise<any>
) {
  server.tool(name, description, schema, async (args) => {
    const start = Date.now();
    metrics.increment(`tool.${name}.calls`);

    try {
      const result = await handler(args);
      metrics.recordDuration(`tool.${name}.duration`, Date.now() - start);

      if (result.isError) {
        metrics.increment(`tool.${name}.errors`);
      } else {
        metrics.increment(`tool.${name}.success`);
      }

      return result;
    } catch (error) {
      metrics.increment(`tool.${name}.exceptions`);
      metrics.recordDuration(`tool.${name}.duration`, Date.now() - start);
      throw error;
    }
  });
}

// Expose metrics endpoint
app.get("/metrics", (req, res) => {
  res.json({
    ...metrics.getSnapshot(),
    poolStats: getPoolStats(),
    cacheStats: cache.stats,
    uptime: process.uptime(),
    memory: process.memoryUsage(),
  });
});

Key Metrics to Monitor

MetricWhat It Tells YouAlert Threshold
tool.*.duration_p9595th percentile tool latency>5 seconds
tool.*.errorsTool error countError rate >5%
pool.waitingCountDatabase connection contention>0 sustained
memory.heapUsedMemory consumption>80% of limit
cache.hitRateCache effectiveness<50% hit rate
active_connectionsCurrent SSE/WS connectionsNear max capacity

Horizontal Scaling

Scaling with Kubernetes

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
    spec:
      containers:
        - name: mcp-server
          image: your-registry/mcp-server:latest
          ports:
            - containerPort: 3001
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3001
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /readyz
              port: 3001
            initialDelaySeconds: 5
            periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Memory Optimization

Streaming Large Responses

For tools that return large datasets, process data in chunks rather than loading everything into memory:

server.tool(
  "export-data",
  "Export large dataset",
  {
    table: z.string().describe("Table to export"),
    format: z.enum(["json", "csv"]).default("json"),
  },
  async ({ table, format }) => {
    // Don't load all rows at once
    const cursor = db.cursor(`SELECT * FROM ${table}`);
    const chunks: string[] = [];
    let rowCount = 0;

    for await (const batch of cursor.batches(100)) {
      chunks.push(
        format === "csv"
          ? batchToCsv(batch)
          : JSON.stringify(batch)
      );
      rowCount += batch.length;

      // Limit output size
      if (rowCount >= 10000) {
        chunks.push(`\n... truncated at 10,000 rows`);
        break;
      }
    }

    return {
      content: [{
        type: "text",
        text: chunks.join("\n"),
      }],
    };
  }
);
Set Response Size Limits

Always cap the amount of data returned by tools. AI models have context limits, and sending a 10MB JSON response is wasteful. Implement pagination or truncation with a clear indicator that more data is available.

Frequently Asked Questions