1. The “Cost Cliff” of Pure Paradigms
In the previous section, we distinguished between MCP (tool-based execution) and Code-as-MCP along two key dimensions:
(1) how actions are represented, and
(2) when validation occurs within the system.
These differences are often negligible at small scale. However, as system complexity increases, forcing all tasks into a single paradigm causes problems to surface in a distinctly non-linear fashion.
In practice, we repeatedly encountered two forms of cost cliffs.
The first occurs on the tool side.
When an agent is asked to perform tasks such as “list all anomalies in the dataset”, a standard API-based tool often returns thousands of entries in a single response.
If such results are serialized verbatim and injected into the conversational context, they can easily consume hundreds of thousands of tokens.
The consequences extend well beyond higher token costs:
- End-to-end latency increases sharply
- The context window becomes saturated
- The model begins to lose track of its initial constraints and instructions
Tools are fundamentally designed for precise control. Yet under high data density, they paradoxically become the primary source of contextual pressure.
The second cliff appears on the code side.
At the other extreme, using generated scripts to execute tasks that are essentially atomic—such as “select object #42”—constitutes severe overengineering.
Such tasks require little to no business logic, yet code-based execution introduces additional overhead:
- Code generation latency
- Runtime and dependency risks
- Increased likelihood of execution failures
As a result, the system sacrifices determinism and responsiveness in exchange for expressiveness it does not actually need.
The issue is therefore not which paradigm is more advanced, but rather that:
When applied outside their appropriate regimes, both paradigms exhibit abrupt cost blow-ups.
Based on this observation, we adopted a dual-track execution structure.
2. Architecture: A Dual-Track System with a Context Off-Ramp
In implementation, the system does not attempt to interleave the two paradigms.
Instead, execution paths are deliberately separated.
Track A: Deterministic Control Loop (Standard MCP)
This path is dedicated to:
- UI interactions
- State mutations
- Single-entity queries with small return sizes
It is characterized by strong constraints and a finite action set. All actions can be validated against schemas prior to execution.
The objective of this track is simple: fast, deterministic, and verifiable execution.
Track B: Generative Analysis Loop (Code-as-MCP)
The second path is reserved for:
- Large-scale data processing
- Aggregation, statistics, and visualization
- Multi-step analytical logic
Here, strong constraints are intentionally relaxed in favor of expressiveness.
Generated programs run inside a sandboxed environment and operate directly on data sources rather than the conversational context.
3. The Context Off-Ramp: When to Switch Tracks
The critical component of the architecture is not the dual-track structure itself, but the switching mechanism.
The system continuously monitors the data density of each observation.
When the output of a standard tool call approaches or exceeds a predefined token threshold, the orchestrator refrains from injecting the full result into the context.
Instead, it performs three actions:
- Abort injection: prevent large JSON payloads from entering the context
- Data offloading: write the result to shared storage (e.g., CSV or Parquet)
- Return a meta-observation: explicitly inform the agent of the data location and require subsequent processing via the code path
This step is not an optimization—it is a forced state transition.
At this point, the agent can no longer continue along the tool path and must switch to the code track.
5. Experiments: Comparing Three Architectures on the Same Task
We evaluated the three approaches on a real digital twin workload:
Perform full anomaly detection across 10,000 entities and generate a PDF report.
We implemented three variants for comparison:
(A) pure Tool-based execution,
(B) pure Code-as-MCP execution, and
(C) the proposed hybrid architecture with a context off-ramp.
A. Pure Tool (Pure MCP)
- Total runtime: ~13 minutes
- ~11 minutes spent in AI analysis dialogue
- 7 tool calls, ~3 minutes total
- Token usage: a full analysis would require ~510,000 tokens without truncation; to avoid explosion, all API responses were truncated
- Result quality: due to severe context compression, anomaly detection accuracy was only ~23%
This confirms the earlier diagnosis: under high data density, tools become the dominant source of contextual pressure rather than an advantage for precise control.
B. Pure Code-as-MCP (Code as the Sole Paradigm)
- Total runtime: ~23 minutes
- Tool invocations: 15 rounds (mostly for controlling and recovering code execution)
- Result quality: anomaly detection accuracy ~85%
While the business outcome is substantially better, overall latency nearly doubles. A significant portion of time is spent on code generation and correction rather than I/O or tool execution.
C. Hybrid Architecture (Tool + Code + Context Off-Ramp)
- Total runtime: ~4.5 minutes (04:36:26 – 04:40:58), with a clear Tool → Code transition
- Key rounds:
- R2:
entities_keyword_searchreturned ~108,634 tokens → flagged as oversized, offloaded to CSV with only 100 pointer rows retained - R3:
entities_browserefined output still reached ~512,675 tokens → offloaded again, storing 500 detailed rows - R4+: transition to
execute_data_analysis, processing data inside the Python sandbox for ~3.5 minutes
- R2:
- Token perspective:
- Raw output: 512,675 tokens
- System limit: 2,000 tokens
- Effective context load: compressed to a file-path string plus instruction—reducing ~0.5M tokens to tens of tokens (≈256× compression)
Latency breakdown:
- Data retrieval + I/O: ~5 seconds (read/write + CSV offload)
- Analysis and generation: ~3.5 minutes, entirely within the sandboxed code environment rather than the chat thread
These results align with expectations. In the hybrid architecture, I/O costs become negligible, and the dominant cost shifts to code generation and iteration—confined entirely to the data plane.
On the same 10,000-entity anomaly detection + report generation task:
- Pure Tool: 13 minutes, ~23% accuracy, forced truncation
- Pure Code: 23 minutes, ~85% accuracy, excessive latency
- Hybrid: ~4.5 minutes, ~90% accuracy, avoiding a 500K-token context explosion via a single off-ramp
6. Beyond Digital Twins: Where Else Does This Structure Apply?
Although this work is motivated by digital twins, the combination of Tool-based control plane, Code-based data plane, and a minimal switching rule represents a structural pattern rather than a domain-specific solution.
Similar task spectra appear in other domains:
Financial analysis
- High-frequency trading and risk controls resemble emergency shutdowns: they require strictly validated tools (trading APIs)
- Strategy backtesting and portfolio rebalancing resemble full anomaly scans: they require Code-as-MCP for large-scale computation
- A typical hybrid flow is “analyze holdings with Code, then issue a small number of verifiable Tool commands”
DevOps / SRE
- “Find errors in a 1GB log file” naturally fits Code (grep, scripts, pandas)
- “Restart a pod or scale a deployment” must use Tools (Kubernetes APIs) for safety and auditability
- The off-ramp pattern is analogous: log queries exceed context → offload to file → process via code → map results back to minimal control actions
General data-driven systems
Any system that simultaneously exhibits:
- Open, mutable state
- Decisions dependent on multi-round feedback
- Both low-data/high-risk operations and high-data/low-risk analysis
is likely to benefit from a Tool control plane + Code data plane + simple off-ramp structure, rather than relying exclusively on either paradigm.
Conclusion: Returning to System Structure
Across the three parts of this series, we describe an evolutionary process:
- We first abstract intuitive issues such as multi-round interaction and Tool vs. Code trade-offs into structural, action-level, and cost-level variables.
- Building on this abstraction, we present a concrete, executable hybrid architecture—dual-track execution with a context off-ramp—turning “when to use Tools vs. Code” into an explicit state transition.
- Finally, through empirical comparison on a 10,000-entity workload, we show that this architecture is not merely reasonable, but necessary:
- Pure Tool: 13 minutes, ~23% accuracy, forced truncation
- Pure Code: 23 minutes, ~85% accuracy, unacceptable latency
- Hybrid: ~4.5 minutes, ~90% accuracy, avoiding a 500K-token context explosion
The central message of this work is therefore not that Tools are better than Code, or vice versa, but that:
In open, feedback-driven systems with large variability in data scale, a single paradigm cannot simultaneously sustain both the control plane and the data plane. The viable approach is to acknowledge their structural differences, assign them to appropriate roles, and allow the agent to switch tracks at the precise moment when it becomes necessary.
Digital twins represent one of the most intuitive and high-impact instances of this structural necessity