How many MCP server instances should I run?

Start with 2-3 instances for redundancy. Monitor CPU and memory usage, connection counts, and tool latency. Scale horizontally when any metric consistently exceeds 70% of capacity. Use auto-scaling (HPA in Kubernetes) for variable workloads.

Should I cache tool results?

Cache idempotent tool results (searches, reads, computations) but never cache mutation results (creates, updates, deletes). Use cache keys that include all tool parameters. Set TTLs based on how quickly the underlying data changes — 1 minute for fast-changing data, 5-15 minutes for slower data.

How do I identify slow tools?

Instrument every tool handler with timing metrics. Log duration warnings for tools exceeding a threshold (e.g., 2 seconds). Use the metrics endpoint to monitor p95 and p99 latency. Profile specific tools with Node.js --inspect when you need deeper analysis.

Can I use mcp-framework for high-traffic deployments?

Yes. mcp-framework is built on the official SDK and handles the same workloads. The framework adds minimal overhead — the bottlenecks are in your tool logic, database queries, and external API calls, not in the framework layer. Apply the same scaling patterns regardless of which framework you use.

Scaling & Performance Optimization

Optimize MCP server performance with connection pooling, caching strategies, load balancing, monitoring, and horizontal scaling patterns for high-traffic production deployments.

title: "Scaling & Performance Optimization" description: "Optimize MCP server performance with connection pooling, caching strategies, load balancing, monitoring, and horizontal scaling patterns for high-traffic production deployments." order: 19 level: "advanced" duration: "30 min" keywords:

"MCP scaling"
"MCP performance"
"MCP connection pooling"
"MCP caching"
"MCP load balancing"
"MCP monitoring"
"MCP server optimization"
"@modelcontextprotocol/sdk performance"
"mcp-framework scaling" date: "2026-04-01"

Quick Summary

As MCP servers handle more clients and heavier workloads, performance becomes critical. This lesson covers connection pooling for databases and external services, multi-layer caching strategies, load balancing MCP servers behind reverse proxies, monitoring and observability, horizontal scaling patterns, and memory optimization. These patterns apply to servers built with both the official TypeScript SDK and mcp-framework.

Understanding MCP Performance Bottlenecks

Before optimizing, identify where time is actually spent:

Bottleneck	Impact	Typical Latency	Solution
Database queries	High	10-500ms	Connection pooling, query optimization, caching
External API calls	High	100-5000ms	Caching, circuit breakers, timeouts
JSON serialization	Low-Medium	1-50ms	Streaming responses, selective fields
Transport overhead	Low	1-10ms	Transport selection, compression
Tool handler logic	Varies	1-1000ms	Profiling, algorithm optimization

80%of MCP server latency typically comes from database queries and external API calls

Connection Pooling

Database Connection Pool

Never create a new database connection per request. Use a connection pool:

// src/db.ts
import { Pool } from "pg";

// Create a shared pool — reuse across all tool handlers
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20,                    // Maximum pool size
  min: 5,                     // Minimum idle connections
  idleTimeoutMillis: 30000,   // Close idle connections after 30s
  connectionTimeoutMillis: 5000, // Fail if connection takes >5s
  maxUses: 7500,              // Recycle connection after N uses
});

// Monitor pool health
pool.on("error", (err) => {
  console.error("Unexpected pool error:", err);
});

export async function query(sql: string, params?: unknown[]) {
  const client = await pool.connect();
  try {
    const start = Date.now();
    const result = await client.query(sql, params);
    const duration = Date.now() - start;

    if (duration > 1000) {
      console.error(`Slow query (${duration}ms): ${sql.substring(0, 100)}`);
    }

    return result;
  } finally {
    client.release(); // Always release back to pool
  }
}

export async function getPoolStats() {
  return {
    total: pool.totalCount,
    idle: pool.idleCount,
    waiting: pool.waitingCount,
  };
}

Connection Pool Sizing

Set max to roughly 2-4x your expected concurrent tool executions. Too few connections cause queuing; too many waste memory and overwhelm the database. Monitor waitingCount — if it is consistently above zero, increase the pool size.

HTTP Client Connection Pool

For tools that call external APIs, reuse HTTP connections:

// src/http-client.ts

// Node.js fetch reuses connections by default via the global agent,
// but for heavy use, configure an explicit agent
import { Agent } from "undici";

const httpAgent = new Agent({
  keepAliveTimeout: 30000,     // Keep connections alive for 30s
  keepAliveMaxTimeout: 60000,  // Max keepalive time
  connections: 50,             // Max connections per origin
  pipelining: 1,               // HTTP pipelining
});

export async function fetchWithPool(
  url: string,
  options: RequestInit = {}
): Promise<Response> {
  return fetch(url, {
    ...options,
    // @ts-ignore — dispatcher is supported by undici-backed fetch
    dispatcher: httpAgent,
    signal: AbortSignal.timeout(
      parseInt(String(options.signal)) || 10000
    ),
  });
}

Caching Strategies

In-Memory Cache with TTL

// src/cache.ts
interface CacheEntry<T> {
  value: T;
  expiresAt: number;
  accessCount: number;
}

export class TTLCache<T> {
  private store = new Map<string, CacheEntry<T>>();
  private maxSize: number;

  constructor(options: { maxSize?: number } = {}) {
    this.maxSize = options.maxSize || 1000;

    // Periodic cleanup
    setInterval(() => this.cleanup(), 60000);
  }

  get(key: string): T | undefined {
    const entry = this.store.get(key);

    if (!entry) return undefined;

    if (entry.expiresAt < Date.now()) {
      this.store.delete(key);
      return undefined;
    }

    entry.accessCount++;
    return entry.value;
  }

  set(key: string, value: T, ttlMs: number): void {
    // Evict if at capacity
    if (this.store.size >= this.maxSize) {
      this.evictLeastUsed();
    }

    this.store.set(key, {
      value,
      expiresAt: Date.now() + ttlMs,
      accessCount: 0,
    });
  }

  invalidate(key: string): void {
    this.store.delete(key);
  }

  invalidatePattern(pattern: RegExp): void {
    for (const key of this.store.keys()) {
      if (pattern.test(key)) {
        this.store.delete(key);
      }
    }
  }

  get stats() {
    return {
      size: this.store.size,
      maxSize: this.maxSize,
    };
  }

  private cleanup(): void {
    const now = Date.now();
    for (const [key, entry] of this.store) {
      if (entry.expiresAt < now) {
        this.store.delete(key);
      }
    }
  }

  private evictLeastUsed(): void {
    let minKey = "";
    let minAccess = Infinity;

    for (const [key, entry] of this.store) {
      if (entry.accessCount < minAccess) {
        minAccess = entry.accessCount;
        minKey = key;
      }
    }

    if (minKey) this.store.delete(minKey);
  }
}

Using Cache in Tool Handlers

import { TTLCache } from "./cache.js";

const cache = new TTLCache<string>({ maxSize: 500 });

server.tool(
  "get-weather",
  "Get current weather for a location",
  {
    location: z.string().describe("City name or coordinates"),
  },
  async ({ location }) => {
    const cacheKey = `weather:${location.toLowerCase()}`;

    // Check cache first
    const cached = cache.get(cacheKey);
    if (cached) {
      return {
        content: [{
          type: "text",
          text: cached,
        }],
      };
    }

    // Fetch from API
    const response = await fetch(
      `https://api.weather.example/v1/current?q=${encodeURIComponent(location)}`,
      { signal: AbortSignal.timeout(5000) }
    );
    const data = await response.json();
    const result = JSON.stringify(data, null, 2);

    // Cache for 5 minutes
    cache.set(cacheKey, result, 5 * 60 * 1000);

    return {
      content: [{ type: "text", text: result }],
    };
  }
);

Redis Cache for Distributed Deployments

When running multiple server instances, use a shared cache:

import Redis from "ioredis";

const redis = new Redis(process.env.REDIS_URL);

async function cachedQuery<T>(
  key: string,
  ttlSeconds: number,
  fetchFn: () => Promise<T>
): Promise<T> {
  // Try cache
  const cached = await redis.get(key);
  if (cached) {
    return JSON.parse(cached);
  }

  // Fetch and cache
  const result = await fetchFn();
  await redis.setex(key, ttlSeconds, JSON.stringify(result));
  return result;
}

// Usage in tools
server.tool(
  "get-user-stats",
  "Get user statistics",
  { userId: z.string() },
  async ({ userId }) => {
    const stats = await cachedQuery(
      `user-stats:${userId}`,
      300, // 5 minute TTL
      () => computeUserStats(userId)
    );

    return {
      content: [{ type: "text", text: JSON.stringify(stats, null, 2) }],
    };
  }
);

Cache Type	Speed	Shared	Persistence	Best For
In-memory (Map)	Fastest (~0.01ms)	No	No	Single instance, hot data
Redis	Fast (~1-5ms)	Yes	Optional	Multi-instance, shared state
CDN/Edge	Variable (~10-50ms)	Yes	Yes	Static resources, public data

Load Balancing

Nginx Configuration for MCP SSE Servers

upstream mcp_servers {
    # Sticky sessions for SSE (required)
    ip_hash;

    server mcp-server-1:3001;
    server mcp-server-2:3001;
    server mcp-server-3:3001;
}

server {
    listen 443 ssl;
    server_name mcp.example.com;

    ssl_certificate /etc/ssl/certs/mcp.example.com.pem;
    ssl_certificate_key /etc/ssl/private/mcp.example.com.key;

    # SSE endpoint — long-lived connections
    location /sse {
        proxy_pass http://mcp_servers;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # SSE specific
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 86400s; # 24h for long-lived SSE
        proxy_send_timeout 86400s;

        # Chunked transfer
        chunked_transfer_encoding on;
    }

    # Message endpoint — short-lived requests
    location /messages {
        proxy_pass http://mcp_servers;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    # Health check
    location /health {
        proxy_pass http://mcp_servers;
    }
}

Sticky Sessions for SSE

SSE transport requires sticky sessions because the SSE stream and POST messages must reach the same server instance. Use ip_hash, cookie-based affinity, or a session-aware load balancer. Streamable HTTP with stateless session management avoids this limitation.

Nginx for Streamable HTTP (No Sticky Sessions Needed)

upstream mcp_servers {
    # Round-robin — no sticky sessions needed
    server mcp-server-1:3001;
    server mcp-server-2:3001;
    server mcp-server-3:3001;
}

server {
    listen 443 ssl;
    server_name mcp.example.com;

    location /mcp {
        proxy_pass http://mcp_servers;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header mcp-session-id $http_mcp_session_id;

        # Allow streaming responses
        proxy_buffering off;
    }
}

Monitoring and Observability

Metrics Collection

// src/metrics.ts
class Metrics {
  private counters = new Map<string, number>();
  private histograms = new Map<string, number[]>();

  increment(name: string, value = 1): void {
    this.counters.set(name, (this.counters.get(name) || 0) + value);
  }

  recordDuration(name: string, durationMs: number): void {
    if (!this.histograms.has(name)) {
      this.histograms.set(name, []);
    }
    const values = this.histograms.get(name)!;
    values.push(durationMs);

    // Keep last 1000 values
    if (values.length > 1000) {
      values.splice(0, values.length - 1000);
    }
  }

  getSnapshot(): Record<string, unknown> {
    const snapshot: Record<string, unknown> = {};

    for (const [name, count] of this.counters) {
      snapshot[name] = count;
    }

    for (const [name, values] of this.histograms) {
      if (values.length === 0) continue;
      const sorted = [...values].sort((a, b) => a - b);
      snapshot[`${name}_p50`] = sorted[Math.floor(sorted.length * 0.5)];
      snapshot[`${name}_p95`] = sorted[Math.floor(sorted.length * 0.95)];
      snapshot[`${name}_p99`] = sorted[Math.floor(sorted.length * 0.99)];
      snapshot[`${name}_count`] = values.length;
    }

    return snapshot;
  }
}

export const metrics = new Metrics();

Instrumented Tool Handlers

import { metrics } from "./metrics.js";

function instrumentedTool(
  server: McpServer,
  name: string,
  description: string,
  schema: Record<string, z.ZodType>,
  handler: (args: any) => Promise<any>
) {
  server.tool(name, description, schema, async (args) => {
    const start = Date.now();
    metrics.increment(`tool.${name}.calls`);

    try {
      const result = await handler(args);
      metrics.recordDuration(`tool.${name}.duration`, Date.now() - start);

      if (result.isError) {
        metrics.increment(`tool.${name}.errors`);
      } else {
        metrics.increment(`tool.${name}.success`);
      }

      return result;
    } catch (error) {
      metrics.increment(`tool.${name}.exceptions`);
      metrics.recordDuration(`tool.${name}.duration`, Date.now() - start);
      throw error;
    }
  });
}

// Expose metrics endpoint
app.get("/metrics", (req, res) => {
  res.json({
    ...metrics.getSnapshot(),
    poolStats: getPoolStats(),
    cacheStats: cache.stats,
    uptime: process.uptime(),
    memory: process.memoryUsage(),
  });
});

Key Metrics to Monitor

Metric	What It Tells You	Alert Threshold
tool.*.duration_p95	95th percentile tool latency	>5 seconds
tool.*.errors	Tool error count	Error rate >5%
pool.waitingCount	Database connection contention	>0 sustained
memory.heapUsed	Memory consumption	>80% of limit
cache.hitRate	Cache effectiveness	<50% hit rate
active_connections	Current SSE/WS connections	Near max capacity

Horizontal Scaling

Scaling with Kubernetes

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
    spec:
      containers:
        - name: mcp-server
          image: your-registry/mcp-server:latest
          ports:
            - containerPort: 3001
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3001
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /readyz
              port: 3001
            initialDelaySeconds: 5
            periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mcp-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mcp-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Memory Optimization

Streaming Large Responses

For tools that return large datasets, process data in chunks rather than loading everything into memory:

server.tool(
  "export-data",
  "Export large dataset",
  {
    table: z.string().describe("Table to export"),
    format: z.enum(["json", "csv"]).default("json"),
  },
  async ({ table, format }) => {
    // Don't load all rows at once
    const cursor = db.cursor(`SELECT * FROM ${table}`);
    const chunks: string[] = [];
    let rowCount = 0;

    for await (const batch of cursor.batches(100)) {
      chunks.push(
        format === "csv"
          ? batchToCsv(batch)
          : JSON.stringify(batch)
      );
      rowCount += batch.length;

      // Limit output size
      if (rowCount >= 10000) {
        chunks.push(`\n... truncated at 10,000 rows`);
        break;
      }
    }

    return {
      content: [{
        type: "text",
        text: chunks.join("\n"),
      }],
    };
  }
);

Set Response Size Limits

Always cap the amount of data returned by tools. AI models have context limits, and sending a 10MB JSON response is wasteful. Implement pagination or truncation with a clear indicator that more data is available.