Connecting to the Paper Theory
Our empirical findings validate the core theorems and complexity claims made in the paper “A Trace-Language Theory of Agents”. Here is how the mathematical constructs in the paper directly correspond to the behavior observed in our experiments.
1. Theorem 2: Regular Conformance Closure
The central formal claim of the paper is Theorem 2:
Let $\mathcal{G}$ be an agent with $L(\mathcal{G})$ in Chomsky class $\mathcal{L}_i$ for some $i \in {0, 1, 2, 3}$. Let $\mathcal{V}$ be an agent with $L(\mathcal{V}) \in \mathcal{L}_3$ (a regular language). Define the validated composition $\mathcal{G} \parallel \mathcal{V}$ as the trace filter whose accepted traces are exactly those emitted by $\mathcal{G}$ and accepted by $\mathcal{V}$’s DFA. Then:
\[L(\mathcal{G} \parallel \mathcal{V}) = L(\mathcal{G}) \cap L(\mathcal{V}) \in \mathcal{L}_i\]
Empirical Validation
In our main experiment:
- The generator $\mathcal{G}$ is an unrestricted Turing-complete machine (emulating a Type-0 LLM agent), prone to executing arbitrary steps and introducing code bugs.
- The verifier $\mathcal{V}$ is implemented as a simple DFA transition matrix (
trace_matrix.csv) and deterministic data validators, making it strictly Type-3 (regular). - Checking verifier conformance runs in linear time $O(n)$ relative to the trace length and consumes negligible resources ($O(1)$ space).
Our run metrics confirmed this:
- DFA Verification Latency: $67.5 \text{ ms}$ average overhead.
- LLM Judge Verification Latency: $14,650.0 \text{ ms}$ average.
- Token Cost: The DFA consumed 0 tokens, while the LLM Judge consumed 11,720 tokens per trial.
This shows that constraining the generator using a Type-3 verifier does not elevate the complexity of the consumer side, preserving a bounded, linear-time verification budget.
2. Bounding the Expressive Power
The paper argues that when a Type-0 generator is bound by a Type-3 verifier, it prevents the generator’s execution trace from drifting into “pathological corners of $\Sigma^*$ that the consumer was not designed to cover.”
In our Ablation 1 (Verifier Components), when structural DFA checks were disabled (representing the generator running without regular language bounds):
- Under a 15% transition error rate, the agent got stuck or generated out-of-order calls (e.g. attempting to train a model before datasets were loaded or preprocessed).
- This led to a 0.0% completion/success rate for both the Control and Data-Only groups.
- In contrast, when the regular language constraints were active (Full Verifier), the agent was successfully bound, yielding a 49.5% average score.
This shows that the regular-language envelope is not just a theoretical wrapper, but a necessary operational boundary to keep Type-0 generators convergent.
3. Auditable Verification vs. Black-Box Judges
Section VI-D of the paper explains why an Aho-Corasick or DFA verifier is superior to a black-box LLM Judge:
- Class Containment: An LLM-as-judge is not a transparent regular recogniser, losing the $O(n)$ complexity guarantee.
- Auditability: A DFA is mechanically derived, finite, and version-controlled.
- Robustness: A DFA “cannot be talked into anything; it either matches or it does not.”
Our experiments directly illustrated this:
- The LLM Judge suffered from false negatives, failing to catch buggy predictions or malformed CSV lengths, leading to 1.2 undetected errors per run and dropping overall accuracy to 35.2%.
- The DFA Verifier caught 100% of errors (0.00 undetected errors), guiding the agent to repair every mistake.
4. Experiment 6: Trace-Language in Practice — NeuroGolf 2026
Experiment 6 bridges the gap between controlled simulation and real-world application. While Experiments 1–5 validate the theory in synthetic environments, Experiment 6 applies the same principles to a live Kaggle competition with 400+ participants.
Theorem 2 in the Wild
The formal claim of Theorem 2 — that intersecting a Type-0 generator’s language with a Type-3 verifier’s language preserves decidability — was tested under real conditions:
- The Generator ($G$): An LLM writing Python code for a Kaggle notebook. This is the unrestricted Type-0 machine, capable of emitting arbitrary code, paths, and data transformations.
- The Verifier ($V$): A 13-state DFA with 76 transitions, checking that the notebook’s trace follows the correct pipeline ordering and contains essential keywords. This is Type-3 (regular) — it cannot understand ONNX graphs, cost formulas, or task patterns.
- The Composition $G \parallel V$: The notebook must pass DFA checks at three checkpoints. If any fails, the generator must repair the trace.
Empirical Outcome
| Metric | Pre-Verifier (v5) | With DFA (v10) | Improvement |
|---|---|---|---|
| Competition Score | 2739.27 | 4127.12 | +50% |
| Tasks with Bundle Coverage | 169/400 | 398/400 | +135% |
| Datasets Discovered | 1 (partial) | 4 (all) | +300% |
| Submission Size | 0.1 MB | 0.74 MB | Within limit |
| DFA Checks Passing | N/A | All 3/3 | ✓ |
How the Theory Applied
The DFA verifier enforced three critical patterns derived from log analysis of top notebooks:
-
Pipeline Ordering: Discover bundles before building, build before verifying, verify before blending, blend before packaging. The notebook generator could not reorder steps without the DFA rejecting the trace.
-
Keyword-Aware Phases: The 29-keyword coverage set ensured each pipeline phase contained the right vocabulary — e.g., the optimization phase required
dim_scrub,graph_rewrite, orfp16keywords matching real notebook patterns. -
Accepting States: The DFA defined
SUBMITTEDandPACKAGEDas accepting states. Any trace that reached these states corresponded to a structurally valid submission — without the DFA ever needing to understand cost calculations or ONNX op semantics.
Significance for the Paper
Experiment 6 demonstrates that the trace-language framework is not just a theoretical construct but a practical tool for agent verification. The same DFA mechanisms that caught drop_table sequences in Experiment 5 and enforced pipeline ordering in Experiment 1 also guided a competition submission to a 50% improvement. This confirms that the Type-3 regular language envelope is a sufficient constraint for real-world agent tasks, not just synthetic benchmarks.