Gormes

Agent Infrastructure

Infrastructure layer: context management, model serving, state, observability, security, scaling.

Agent Infrastructure — The Boring But Critical Layer

1. Context Window Management

Context Compression

ResourceApproachKey Result
LLMLingua (Microsoft)Budget controller + iterative token compression2-4x compression, minimal degradation
Selective ContextPerplexity-based low-value token removal40% memory reduction, 2x content capacity
ICAELoRA-adapted encoder + frozen decoder4x compression into memory slots, ~1% extra params
SCAGreedy KV cache compressionRetains distinctive vectors over similar ones

Context Caching

ProviderMechanismSavings
Anthropiccache_control marker, 5-min TTL (extendable 1hr)90% cost reduction
Google GeminiImplicit (auto) + Explicit (manual) caching90% discount
Vertex AIRegional cache, TTL expiration, CMEK encryptionTTL-based
LiteLLMUnified API across providersAuto-translation of cache markers

Infinite Context via Retrieval

ResourceApproachScale
Unlimiformer (NeurIPS 2023)kNN index over hidden states; sublinear queryUnlimited input
MemWalkerTree-based navigation of long contextsQuery-driven
RAPTORHierarchical clustering + recursive summarizationFlattened tree retrieval
Infini-attentionCompressive memory + local attention1M token passkey on 1B/8B models
EM-LLMBayesian surprise + graph boundary detection10M token retrieval

2. Model Serving for Agents

Inference Optimization

ResourceApproachResult
NVIDIA Dynamoagent_hints for latency sensitivity, priority KV eviction, speculative prefill3x TTFT on turn 2+
StreamServeDisaggregated prefill-decode + metric routing + adaptive speculation6.8x throughput
EAGLE-3SuffixDecoding, KV reuse, speculative tool executionAgent loop acceleration

Batching & Scheduling

ResourceApproachResult
ConcurAIMD-based cache-aware admission at agent granularityPrevents middle-phase thrashing
HaloDAG-based query plan optimization, adaptive batching, KV-cache sharing18.6x speedup
AdaServeSLO-customized speculation depth, latency targets4.3x SLO violation reduction

3. State Management

Durable Execution Patterns

PatternMechanismProvider
Checkpoint/ReplayAtomic state snapshots, replay from checkpointAWS Lambda, Temporal
SagasCompensation-based rollback for multi-step workflowsAWS Step Functions
Event SourcingAppend-only event log, rebuild state from eventsCustom implementations

Checkpointing Systems

SystemKey Feature
LangGraphPostgreSQL/SQLite checkpointing with time-travel debugging
TemporalDurable execution SDK: steps, waits, callbacks, 1-year max
CRIUOS-level process checkpointing
DaprKEDA queue-depth autoscaling, actors for stateful agents

4. Observability

Tracing Standards

StandardScopeStatus
OpenTelemetry GenAIinvoke_agent, execute_tool span typesStabilizing
Agent SpecOpen standard for agent traces, framework-agnosticEmerging

Tool Comparison

ToolStrengthsBest For
LangSmithDeep LangChain integration, prompt playgroundLangGraph-heavy teams
LangfuseSelf-hosted, MIT license, framework-agnosticGeneral-purpose, self-hosting
Arize PhoenixRAG evals, UMAP drift detection, local-firstRAG-first teams, ML monitoring
HeliconeSimple integration, cost trackingLightweight observability

5. Security Infrastructure

Prompt Injection Defense

ResourceApproachResult
LlamaFirewallBERT classifier + CoT auditing + static analysisMulti-layer defense
AgentSentinelReal-time system-level tracing, event dependency analysisCUA defense
Multi-Agent Defense PipelineChain-of-agents + coordinator100% ASR mitigation (0% from 30%)
Prompt FencingCryptographically signed prompt segmentsVerifiable trust boundaries
OpenClaw IsolationTwo-agent privilege separation + JSON formatting323x improvement over baseline
MeshAI DetectionProxy-layer detection, 15+ patterns<2ms overhead

Sandboxing

ResourceApproach
ceLLMateBrowser-level HTTP interception; agent-agnostic Chrome extension
AgentBoundAndroid-style permission model for MCP servers; 80.9% auto-policy accuracy
ProgentProgrammable privilege control DSL; 0% attack success
ClawLessBPF syscall interception + formal security model

Multi-Tenant Security

PatternMechanism
Row-Level SecurityTenant ID on every row, query filters
Virtual KeysPer-tenant API keys, AES-256-GCM encryption, hash-based lookup
Hierarchical Budgetsorg → team → user → key budget chains
MicroVM IsolationPer-tenant Firecracker VMs, default-deny egress

6. Scaling

Architecture Patterns

PatternWhen to UseExample
Work queuesAsync task processing, priority orderingRedis/Kafka-backed queues
Agent poolsWarm workers, scale-to-NDocker Compose, K8s deployments
Slot-basedFixed capacity, backpressurePostgreSQL queue, Consul discovery
Actor modelStateful agents, message-passingDapr actors for stateful agents

Cold Start Mitigation

TechniqueResult
Warm containers (warm_containers=N, min_containers=N)Prevents scale-to-zero latency
Speculative prefill (NVIDIA Dynamo)Pre-warm KV cache for predicted next-turn prefix; 3x TTFT
Shared KV cache (Halo)Cache sharing across operators in same workflow

7. Replay & Debugging

ToolFeature
AgentOpticsFork at failure, replay with fix, AI diagnosis
CogitatorCheckpoint replay, deterministic vs live, trace comparison
agent-replaySQLite-powered CLI, hallucination detection, AI root cause
AgentReplayLocal-only, time-scrub slider, DAG-based trace storage

Quick Reference: Key Trade-offs

ProblemSolutionTrade-off
Context window limitsRetrieval (Unlimiformer, RAPTOR)Latency vs compression quality
Prefix redundancyPrompt cachingCache invalidation complexity
Decode latencySpeculative decodingExtra compute on rejection
Agent isolationPrivilege separationCross-agent communication overhead
Multi-tenant costVirtual keys + budgetsCache namespace collision
KV cache evictionPriority-based + agent hintsTuning complexity
Cold startsWarm containers / speculative prefillIdle resource cost