Connecting to the Paper Theory

Our empirical findings validate the core theorems and complexity claims made in the paper “A Trace-Language Theory of Agents”. Here is how the mathematical constructs in the paper directly correspond to the behavior observed in our experiments.

1. Theorem 2: Regular Conformance Closure

The central formal claim of the paper is Theorem 2:

Let $\mathcal{G}$ be an agent with $L(\mathcal{G})$ in Chomsky class $\mathcal{L}_i$ for some $i \in {0, 1, 2, 3}$. Let $\mathcal{V}$ be an agent with $L(\mathcal{V}) \in \mathcal{L}_3$ (a regular language). Define the validated composition $\mathcal{G} \parallel \mathcal{V}$ as the trace filter whose accepted traces are exactly those emitted by $\mathcal{G}$ and accepted by $\mathcal{V}$’s DFA. Then:
\[L(\mathcal{G} \parallel \mathcal{V}) = L(\mathcal{G}) \cap L(\mathcal{V}) \in \mathcal{L}_i\]

Empirical Validation

In our main experiment:

The generator $\mathcal{G}$ is an unrestricted Turing-complete machine (emulating a Type-0 LLM agent), prone to executing arbitrary steps and introducing code bugs.
The verifier $\mathcal{V}$ is implemented as a simple DFA transition matrix (trace_matrix.csv) and deterministic data validators, making it strictly Type-3 (regular).
Checking verifier conformance runs in linear time $O(n)$ relative to the trace length and consumes negligible resources ($O(1)$ space).

Our run metrics confirmed this:

DFA Verification Latency: $67.5 \text{ ms}$ average overhead.
LLM Judge Verification Latency: $14,650.0 \text{ ms}$ average.
Token Cost: The DFA consumed 0 tokens, while the LLM Judge consumed 11,720 tokens per trial.

This shows that constraining the generator using a Type-3 verifier does not elevate the complexity of the consumer side, preserving a bounded, linear-time verification budget.

2. Bounding the Expressive Power

The paper argues that when a Type-0 generator is bound by a Type-3 verifier, it prevents the generator’s execution trace from drifting into “pathological corners of $\Sigma^*$ that the consumer was not designed to cover.”

In our Ablation 1 (Verifier Components), when structural DFA checks were disabled (representing the generator running without regular language bounds):

Under a 15% transition error rate, the agent got stuck or generated out-of-order calls (e.g. attempting to train a model before datasets were loaded or preprocessed).
This led to a 0.0% completion/success rate for both the Control and Data-Only groups.
In contrast, when the regular language constraints were active (Full Verifier), the agent was successfully bound, yielding a 49.5% average score.

This shows that the regular-language envelope is not just a theoretical wrapper, but a necessary operational boundary to keep Type-0 generators convergent.

3. Auditable Verification vs. Black-Box Judges

Section VI-D of the paper explains why an Aho-Corasick or DFA verifier is superior to a black-box LLM Judge:

Class Containment: An LLM-as-judge is not a transparent regular recogniser, losing the $O(n)$ complexity guarantee.
Auditability: A DFA is mechanically derived, finite, and version-controlled.
Robustness: A DFA “cannot be talked into anything; it either matches or it does not.”

Our experiments directly illustrated this:

The LLM Judge suffered from false negatives, failing to catch buggy predictions or malformed CSV lengths, leading to 1.2 undetected errors per run and dropping overall accuracy to 35.2%.
The DFA Verifier caught 100% of errors (0.00 undetected errors), guiding the agent to repair every mistake.

4. Experiment 6: Trace-Language in Practice — NeuroGolf 2026

Experiment 6 bridges the gap between controlled simulation and real-world application. While Experiments 1–5 validate the theory in synthetic environments, Experiment 6 applies the same principles to a live Kaggle competition with 400+ participants.

Theorem 2 in the Wild

The formal claim of Theorem 2 — that intersecting a Type-0 generator’s language with a Type-3 verifier’s language preserves decidability — was tested under real conditions:

The Generator ($G$): An LLM writing Python code for a Kaggle notebook. This is the unrestricted Type-0 machine, capable of emitting arbitrary code, paths, and data transformations.
The Verifier ($V$): A 13-state DFA with 76 transitions, checking that the notebook’s trace follows the correct pipeline ordering and contains essential keywords. This is Type-3 (regular) — it cannot understand ONNX graphs, cost formulas, or task patterns.
The Composition $G \parallel V$: The notebook must pass DFA checks at three checkpoints. If any fails, the generator must repair the trace.

Empirical Outcome

Metric	Pre-Verifier (v5)	With DFA (v10)	Improvement
Competition Score	2739.27	4127.12	+50%
Tasks with Bundle Coverage	169/400	398/400	+135%
Datasets Discovered	1 (partial)	4 (all)	+300%
Submission Size	0.1 MB	0.74 MB	Within limit
DFA Checks Passing	N/A	All 3/3	✓

How the Theory Applied

The DFA verifier enforced three critical patterns derived from log analysis of top notebooks:

Pipeline Ordering: Discover bundles before building, build before verifying, verify before blending, blend before packaging. The notebook generator could not reorder steps without the DFA rejecting the trace.
Keyword-Aware Phases: The 29-keyword coverage set ensured each pipeline phase contained the right vocabulary — e.g., the optimization phase required dim_scrub, graph_rewrite, or fp16 keywords matching real notebook patterns.
Accepting States: The DFA defined SUBMITTED and PACKAGED as accepting states. Any trace that reached these states corresponded to a structurally valid submission — without the DFA ever needing to understand cost calculations or ONNX op semantics.

Significance for the Paper

Experiment 6 demonstrates that the trace-language framework is not just a theoretical construct but a practical tool for agent verification. The same DFA mechanisms that caught drop_table sequences in Experiment 5 and enforced pipeline ordering in Experiment 1 also guided a competition submission to a 50% improvement. This confirms that the Type-3 regular language envelope is a sufficient constraint for real-world agent tasks, not just synthetic benchmarks.