Scaling & Performance Optimization
Optimize MCP server performance with connection pooling, caching strategies, load balancing, monitoring, and horizontal scaling patterns for high-traffic production deployments.
title: "Scaling & Performance Optimization" description: "Optimize MCP server performance with connection pooling, caching strategies, load balancing, monitoring, and horizontal scaling patterns for high-traffic production deployments." order: 19 level: "advanced" duration: "30 min" keywords:
- "MCP scaling"
- "MCP performance"
- "MCP connection pooling"
- "MCP caching"
- "MCP load balancing"
- "MCP monitoring"
- "MCP server optimization"
- "@modelcontextprotocol/sdk performance"
- "mcp-framework scaling" date: "2026-04-01"
As MCP servers handle more clients and heavier workloads, performance becomes critical. This lesson covers connection pooling for databases and external services, multi-layer caching strategies, load balancing MCP servers behind reverse proxies, monitoring and observability, horizontal scaling patterns, and memory optimization. These patterns apply to servers built with both the official TypeScript SDK and mcp-framework.
Understanding MCP Performance Bottlenecks
Before optimizing, identify where time is actually spent:
| Bottleneck | Impact | Typical Latency | Solution |
|---|---|---|---|
| Database queries | High | 10-500ms | Connection pooling, query optimization, caching |
| External API calls | High | 100-5000ms | Caching, circuit breakers, timeouts |
| JSON serialization | Low-Medium | 1-50ms | Streaming responses, selective fields |
| Transport overhead | Low | 1-10ms | Transport selection, compression |
| Tool handler logic | Varies | 1-1000ms | Profiling, algorithm optimization |
Connection Pooling
Database Connection Pool
Never create a new database connection per request. Use a connection pool:
// src/db.ts
import { Pool } from "pg";
// Create a shared pool — reuse across all tool handlers
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 20, // Maximum pool size
min: 5, // Minimum idle connections
idleTimeoutMillis: 30000, // Close idle connections after 30s
connectionTimeoutMillis: 5000, // Fail if connection takes >5s
maxUses: 7500, // Recycle connection after N uses
});
// Monitor pool health
pool.on("error", (err) => {
console.error("Unexpected pool error:", err);
});
export async function query(sql: string, params?: unknown[]) {
const client = await pool.connect();
try {
const start = Date.now();
const result = await client.query(sql, params);
const duration = Date.now() - start;
if (duration > 1000) {
console.error(`Slow query (${duration}ms): ${sql.substring(0, 100)}`);
}
return result;
} finally {
client.release(); // Always release back to pool
}
}
export async function getPoolStats() {
return {
total: pool.totalCount,
idle: pool.idleCount,
waiting: pool.waitingCount,
};
}
Set max to roughly 2-4x your expected concurrent tool executions. Too few connections cause queuing; too many waste memory and overwhelm the database. Monitor waitingCount — if it is consistently above zero, increase the pool size.
HTTP Client Connection Pool
For tools that call external APIs, reuse HTTP connections:
// src/http-client.ts
// Node.js fetch reuses connections by default via the global agent,
// but for heavy use, configure an explicit agent
import { Agent } from "undici";
const httpAgent = new Agent({
keepAliveTimeout: 30000, // Keep connections alive for 30s
keepAliveMaxTimeout: 60000, // Max keepalive time
connections: 50, // Max connections per origin
pipelining: 1, // HTTP pipelining
});
export async function fetchWithPool(
url: string,
options: RequestInit = {}
): Promise<Response> {
return fetch(url, {
...options,
// @ts-ignore — dispatcher is supported by undici-backed fetch
dispatcher: httpAgent,
signal: AbortSignal.timeout(
parseInt(String(options.signal)) || 10000
),
});
}
Caching Strategies
In-Memory Cache with TTL
// src/cache.ts
interface CacheEntry<T> {
value: T;
expiresAt: number;
accessCount: number;
}
export class TTLCache<T> {
private store = new Map<string, CacheEntry<T>>();
private maxSize: number;
constructor(options: { maxSize?: number } = {}) {
this.maxSize = options.maxSize || 1000;
// Periodic cleanup
setInterval(() => this.cleanup(), 60000);
}
get(key: string): T | undefined {
const entry = this.store.get(key);
if (!entry) return undefined;
if (entry.expiresAt < Date.now()) {
this.store.delete(key);
return undefined;
}
entry.accessCount++;
return entry.value;
}
set(key: string, value: T, ttlMs: number): void {
// Evict if at capacity
if (this.store.size >= this.maxSize) {
this.evictLeastUsed();
}
this.store.set(key, {
value,
expiresAt: Date.now() + ttlMs,
accessCount: 0,
});
}
invalidate(key: string): void {
this.store.delete(key);
}
invalidatePattern(pattern: RegExp): void {
for (const key of this.store.keys()) {
if (pattern.test(key)) {
this.store.delete(key);
}
}
}
get stats() {
return {
size: this.store.size,
maxSize: this.maxSize,
};
}
private cleanup(): void {
const now = Date.now();
for (const [key, entry] of this.store) {
if (entry.expiresAt < now) {
this.store.delete(key);
}
}
}
private evictLeastUsed(): void {
let minKey = "";
let minAccess = Infinity;
for (const [key, entry] of this.store) {
if (entry.accessCount < minAccess) {
minAccess = entry.accessCount;
minKey = key;
}
}
if (minKey) this.store.delete(minKey);
}
}
Using Cache in Tool Handlers
import { TTLCache } from "./cache.js";
const cache = new TTLCache<string>({ maxSize: 500 });
server.tool(
"get-weather",
"Get current weather for a location",
{
location: z.string().describe("City name or coordinates"),
},
async ({ location }) => {
const cacheKey = `weather:${location.toLowerCase()}`;
// Check cache first
const cached = cache.get(cacheKey);
if (cached) {
return {
content: [{
type: "text",
text: cached,
}],
};
}
// Fetch from API
const response = await fetch(
`https://api.weather.example/v1/current?q=${encodeURIComponent(location)}`,
{ signal: AbortSignal.timeout(5000) }
);
const data = await response.json();
const result = JSON.stringify(data, null, 2);
// Cache for 5 minutes
cache.set(cacheKey, result, 5 * 60 * 1000);
return {
content: [{ type: "text", text: result }],
};
}
);
Redis Cache for Distributed Deployments
When running multiple server instances, use a shared cache:
import Redis from "ioredis";
const redis = new Redis(process.env.REDIS_URL);
async function cachedQuery<T>(
key: string,
ttlSeconds: number,
fetchFn: () => Promise<T>
): Promise<T> {
// Try cache
const cached = await redis.get(key);
if (cached) {
return JSON.parse(cached);
}
// Fetch and cache
const result = await fetchFn();
await redis.setex(key, ttlSeconds, JSON.stringify(result));
return result;
}
// Usage in tools
server.tool(
"get-user-stats",
"Get user statistics",
{ userId: z.string() },
async ({ userId }) => {
const stats = await cachedQuery(
`user-stats:${userId}`,
300, // 5 minute TTL
() => computeUserStats(userId)
);
return {
content: [{ type: "text", text: JSON.stringify(stats, null, 2) }],
};
}
);
| Cache Type | Speed | Shared | Persistence | Best For |
|---|---|---|---|---|
| In-memory (Map) | Fastest (~0.01ms) | No | No | Single instance, hot data |
| Redis | Fast (~1-5ms) | Yes | Optional | Multi-instance, shared state |
| CDN/Edge | Variable (~10-50ms) | Yes | Yes | Static resources, public data |
Load Balancing
Nginx Configuration for MCP SSE Servers
upstream mcp_servers {
# Sticky sessions for SSE (required)
ip_hash;
server mcp-server-1:3001;
server mcp-server-2:3001;
server mcp-server-3:3001;
}
server {
listen 443 ssl;
server_name mcp.example.com;
ssl_certificate /etc/ssl/certs/mcp.example.com.pem;
ssl_certificate_key /etc/ssl/private/mcp.example.com.key;
# SSE endpoint — long-lived connections
location /sse {
proxy_pass http://mcp_servers;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# SSE specific
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 86400s; # 24h for long-lived SSE
proxy_send_timeout 86400s;
# Chunked transfer
chunked_transfer_encoding on;
}
# Message endpoint — short-lived requests
location /messages {
proxy_pass http://mcp_servers;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
# Health check
location /health {
proxy_pass http://mcp_servers;
}
}
SSE transport requires sticky sessions because the SSE stream and POST messages must reach the same server instance. Use ip_hash, cookie-based affinity, or a session-aware load balancer. Streamable HTTP with stateless session management avoids this limitation.
Nginx for Streamable HTTP (No Sticky Sessions Needed)
upstream mcp_servers {
# Round-robin — no sticky sessions needed
server mcp-server-1:3001;
server mcp-server-2:3001;
server mcp-server-3:3001;
}
server {
listen 443 ssl;
server_name mcp.example.com;
location /mcp {
proxy_pass http://mcp_servers;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header mcp-session-id $http_mcp_session_id;
# Allow streaming responses
proxy_buffering off;
}
}
Monitoring and Observability
Metrics Collection
// src/metrics.ts
class Metrics {
private counters = new Map<string, number>();
private histograms = new Map<string, number[]>();
increment(name: string, value = 1): void {
this.counters.set(name, (this.counters.get(name) || 0) + value);
}
recordDuration(name: string, durationMs: number): void {
if (!this.histograms.has(name)) {
this.histograms.set(name, []);
}
const values = this.histograms.get(name)!;
values.push(durationMs);
// Keep last 1000 values
if (values.length > 1000) {
values.splice(0, values.length - 1000);
}
}
getSnapshot(): Record<string, unknown> {
const snapshot: Record<string, unknown> = {};
for (const [name, count] of this.counters) {
snapshot[name] = count;
}
for (const [name, values] of this.histograms) {
if (values.length === 0) continue;
const sorted = [...values].sort((a, b) => a - b);
snapshot[`${name}_p50`] = sorted[Math.floor(sorted.length * 0.5)];
snapshot[`${name}_p95`] = sorted[Math.floor(sorted.length * 0.95)];
snapshot[`${name}_p99`] = sorted[Math.floor(sorted.length * 0.99)];
snapshot[`${name}_count`] = values.length;
}
return snapshot;
}
}
export const metrics = new Metrics();
Instrumented Tool Handlers
import { metrics } from "./metrics.js";
function instrumentedTool(
server: McpServer,
name: string,
description: string,
schema: Record<string, z.ZodType>,
handler: (args: any) => Promise<any>
) {
server.tool(name, description, schema, async (args) => {
const start = Date.now();
metrics.increment(`tool.${name}.calls`);
try {
const result = await handler(args);
metrics.recordDuration(`tool.${name}.duration`, Date.now() - start);
if (result.isError) {
metrics.increment(`tool.${name}.errors`);
} else {
metrics.increment(`tool.${name}.success`);
}
return result;
} catch (error) {
metrics.increment(`tool.${name}.exceptions`);
metrics.recordDuration(`tool.${name}.duration`, Date.now() - start);
throw error;
}
});
}
// Expose metrics endpoint
app.get("/metrics", (req, res) => {
res.json({
...metrics.getSnapshot(),
poolStats: getPoolStats(),
cacheStats: cache.stats,
uptime: process.uptime(),
memory: process.memoryUsage(),
});
});
Key Metrics to Monitor
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| tool.*.duration_p95 | 95th percentile tool latency | >5 seconds |
| tool.*.errors | Tool error count | Error rate >5% |
| pool.waitingCount | Database connection contention | >0 sustained |
| memory.heapUsed | Memory consumption | >80% of limit |
| cache.hitRate | Cache effectiveness | <50% hit rate |
| active_connections | Current SSE/WS connections | Near max capacity |
Horizontal Scaling
Scaling with Kubernetes
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 3
selector:
matchLabels:
app: mcp-server
template:
metadata:
labels:
app: mcp-server
spec:
containers:
- name: mcp-server
image: your-registry/mcp-server:latest
ports:
- containerPort: 3001
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /healthz
port: 3001
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /readyz
port: 3001
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mcp-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mcp-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Memory Optimization
Streaming Large Responses
For tools that return large datasets, process data in chunks rather than loading everything into memory:
server.tool(
"export-data",
"Export large dataset",
{
table: z.string().describe("Table to export"),
format: z.enum(["json", "csv"]).default("json"),
},
async ({ table, format }) => {
// Don't load all rows at once
const cursor = db.cursor(`SELECT * FROM ${table}`);
const chunks: string[] = [];
let rowCount = 0;
for await (const batch of cursor.batches(100)) {
chunks.push(
format === "csv"
? batchToCsv(batch)
: JSON.stringify(batch)
);
rowCount += batch.length;
// Limit output size
if (rowCount >= 10000) {
chunks.push(`\n... truncated at 10,000 rows`);
break;
}
}
return {
content: [{
type: "text",
text: chunks.join("\n"),
}],
};
}
);
Always cap the amount of data returned by tools. AI models have context limits, and sending a 10MB JSON response is wasteful. Implement pagination or truncation with a clear indicator that more data is available.