Agent Infrastructure
Infrastructure layer: context management, model serving, state, observability, security, scaling.
Agent Infrastructure — The Boring But Critical Layer
1. Context Window Management
Context Compression
| Resource | Approach | Key Result |
|---|
| LLMLingua (Microsoft) | Budget controller + iterative token compression | 2-4x compression, minimal degradation |
| Selective Context | Perplexity-based low-value token removal | 40% memory reduction, 2x content capacity |
| ICAE | LoRA-adapted encoder + frozen decoder | 4x compression into memory slots, ~1% extra params |
| SCA | Greedy KV cache compression | Retains distinctive vectors over similar ones |
Context Caching
| Provider | Mechanism | Savings |
|---|
| Anthropic | cache_control marker, 5-min TTL (extendable 1hr) | 90% cost reduction |
| Google Gemini | Implicit (auto) + Explicit (manual) caching | 90% discount |
| Vertex AI | Regional cache, TTL expiration, CMEK encryption | TTL-based |
| LiteLLM | Unified API across providers | Auto-translation of cache markers |
Infinite Context via Retrieval
| Resource | Approach | Scale |
|---|
| Unlimiformer (NeurIPS 2023) | kNN index over hidden states; sublinear query | Unlimited input |
| MemWalker | Tree-based navigation of long contexts | Query-driven |
| RAPTOR | Hierarchical clustering + recursive summarization | Flattened tree retrieval |
| Infini-attention | Compressive memory + local attention | 1M token passkey on 1B/8B models |
| EM-LLM | Bayesian surprise + graph boundary detection | 10M token retrieval |
2. Model Serving for Agents
Inference Optimization
| Resource | Approach | Result |
|---|
| NVIDIA Dynamo | agent_hints for latency sensitivity, priority KV eviction, speculative prefill | 3x TTFT on turn 2+ |
| StreamServe | Disaggregated prefill-decode + metric routing + adaptive speculation | 6.8x throughput |
| EAGLE-3 | SuffixDecoding, KV reuse, speculative tool execution | Agent loop acceleration |
Batching & Scheduling
| Resource | Approach | Result |
|---|
| Concur | AIMD-based cache-aware admission at agent granularity | Prevents middle-phase thrashing |
| Halo | DAG-based query plan optimization, adaptive batching, KV-cache sharing | 18.6x speedup |
| AdaServe | SLO-customized speculation depth, latency targets | 4.3x SLO violation reduction |
3. State Management
Durable Execution Patterns
| Pattern | Mechanism | Provider |
|---|
| Checkpoint/Replay | Atomic state snapshots, replay from checkpoint | AWS Lambda, Temporal |
| Sagas | Compensation-based rollback for multi-step workflows | AWS Step Functions |
| Event Sourcing | Append-only event log, rebuild state from events | Custom implementations |
Checkpointing Systems
| System | Key Feature |
|---|
| LangGraph | PostgreSQL/SQLite checkpointing with time-travel debugging |
| Temporal | Durable execution SDK: steps, waits, callbacks, 1-year max |
| CRIU | OS-level process checkpointing |
| Dapr | KEDA queue-depth autoscaling, actors for stateful agents |
4. Observability
Tracing Standards
| Standard | Scope | Status |
|---|
| OpenTelemetry GenAI | invoke_agent, execute_tool span types | Stabilizing |
| Agent Spec | Open standard for agent traces, framework-agnostic | Emerging |
| Tool | Strengths | Best For |
|---|
| LangSmith | Deep LangChain integration, prompt playground | LangGraph-heavy teams |
| Langfuse | Self-hosted, MIT license, framework-agnostic | General-purpose, self-hosting |
| Arize Phoenix | RAG evals, UMAP drift detection, local-first | RAG-first teams, ML monitoring |
| Helicone | Simple integration, cost tracking | Lightweight observability |
5. Security Infrastructure
Prompt Injection Defense
| Resource | Approach | Result |
|---|
| LlamaFirewall | BERT classifier + CoT auditing + static analysis | Multi-layer defense |
| AgentSentinel | Real-time system-level tracing, event dependency analysis | CUA defense |
| Multi-Agent Defense Pipeline | Chain-of-agents + coordinator | 100% ASR mitigation (0% from 30%) |
| Prompt Fencing | Cryptographically signed prompt segments | Verifiable trust boundaries |
| OpenClaw Isolation | Two-agent privilege separation + JSON formatting | 323x improvement over baseline |
| MeshAI Detection | Proxy-layer detection, 15+ patterns | <2ms overhead |
Sandboxing
| Resource | Approach |
|---|
| ceLLMate | Browser-level HTTP interception; agent-agnostic Chrome extension |
| AgentBound | Android-style permission model for MCP servers; 80.9% auto-policy accuracy |
| Progent | Programmable privilege control DSL; 0% attack success |
| ClawLess | BPF syscall interception + formal security model |
Multi-Tenant Security
| Pattern | Mechanism |
|---|
| Row-Level Security | Tenant ID on every row, query filters |
| Virtual Keys | Per-tenant API keys, AES-256-GCM encryption, hash-based lookup |
| Hierarchical Budgets | org → team → user → key budget chains |
| MicroVM Isolation | Per-tenant Firecracker VMs, default-deny egress |
6. Scaling
Architecture Patterns
| Pattern | When to Use | Example |
|---|
| Work queues | Async task processing, priority ordering | Redis/Kafka-backed queues |
| Agent pools | Warm workers, scale-to-N | Docker Compose, K8s deployments |
| Slot-based | Fixed capacity, backpressure | PostgreSQL queue, Consul discovery |
| Actor model | Stateful agents, message-passing | Dapr actors for stateful agents |
Cold Start Mitigation
| Technique | Result |
|---|
Warm containers (warm_containers=N, min_containers=N) | Prevents scale-to-zero latency |
| Speculative prefill (NVIDIA Dynamo) | Pre-warm KV cache for predicted next-turn prefix; 3x TTFT |
| Shared KV cache (Halo) | Cache sharing across operators in same workflow |
7. Replay & Debugging
| Tool | Feature |
|---|
| AgentOptics | Fork at failure, replay with fix, AI diagnosis |
| Cogitator | Checkpoint replay, deterministic vs live, trace comparison |
| agent-replay | SQLite-powered CLI, hallucination detection, AI root cause |
| AgentReplay | Local-only, time-scrub slider, DAG-based trace storage |
Quick Reference: Key Trade-offs
| Problem | Solution | Trade-off |
|---|
| Context window limits | Retrieval (Unlimiformer, RAPTOR) | Latency vs compression quality |
| Prefix redundancy | Prompt caching | Cache invalidation complexity |
| Decode latency | Speculative decoding | Extra compute on rejection |
| Agent isolation | Privilege separation | Cross-agent communication overhead |
| Multi-tenant cost | Virtual keys + budgets | Cache namespace collision |
| KV cache eviction | Priority-based + agent hints | Tuning complexity |
| Cold starts | Warm containers / speculative prefill | Idle resource cost |