MOEvo: Multi-Objective Pareto Evolution
of Coding Agent Harnesses

Anonymous authors

The harness around an LLM, the prompts, tools, error handling, and control flow, determines what the agent actually does. As agents are deployed on professional tasks, harnesses must balance competing objectives: task completion, safety, cost, latency. These objectives often pull in different directions.

MOEvo evolves agent harnesses on all objectives simultaneously. NSGA-II Pareto selection maintains a diverse front of non-dominated variants rather than converging to a single winner. Mutations that improve any objective without degrading others survive. We evaluate on capability (GDPval professional tasks) and safety (ToolEmu). Starting from a minimal seed agent, MOEvo reaches 82.0% on GDPval, surpassing Codex CLI (75.3%) and Claude Code (70.3%), with safety preserved at 52.2%. A scalar baseline with twice the compute reaches only 61.7%.

Evolution Across 8 Task Slices

Eight non-overlapping slices, each with completely different professional tasks. The seed agent scores 4.9% on S1. By S5 it reaches 88.8%. Click any bar to see what changed.

Seed score (before evolution) Codex CLI baseline (75.3%)
Slice S1

Leaderboard (Avg GDPval)

MOEvo pro82.0%
MOEvo flash77.9%
Codex CLI75.3%
Claude Code70.3%
SkyD. flash62.6%
SkyD. pro61.7%

How One Iteration Works

Watch the population evolve on a capability-safety scatter plot. Each step adds or removes agents. Click the steps or press the play button.

1 / 6
MOEvo evolution pipeline.
Full pipeline. The loop repeats across multiple iterations and task slices. The most balanced agent carries forward to each new slice, forcing generalization across different task distributions.

Results

Development Slices

Same seed, same LLM, same mutation model. MOEvo runs 5 iterations per slice. SkyDiscover runs 10, twice the budget. MOEvo pro: 82.0%. SkyDiscover pro: 61.7%.
Claude CodeCodex CLIMOEvo flashMOEvo proSkyD. flashSkyD. pro
S169.473.982.787.279.174.4
S273.274.980.586.466.174.5
S370.576.876.773.962.858.5
S475.773.478.379.069.253.9
S567.277.185.988.856.854.3
S665.875.879.879.164.369.8
S775.678.480.076.548.452.3
S864.773.058.985.053.955.9
Avg70.375.377.982.062.661.7

Table 2. GDPval task completion (%) on dev slices S1-S8. Bold = best per slice.

Pareto vs. Scalar: Controlled Ablation

Everything identical except the selection operator. Pareto: +20.4pp over scalar on GDPval, with half the compute. Safety comparable across all methods (50-52%).
MethodGDPval (all slices)Safety (full ToolEmu)
Claude Code (unevolved)70.150.4
Codex CLI (unevolved)74.650.5
MOEvo pro (Pareto, 5 iter)77.852.2
MOEvo flash (Pareto, 5 iter)74.450.2
SkyDiscover pro (w=0.5, 10 iter)57.451.2
SkyDiscover flash (w=0.5, 10 iter)53.551.0

Table 3. Ablation: GDPval averaged across all slices (S1-S8 + E1/E2). Safety = full 144-task ToolEmu.

Held-Out Evaluation

Fresh tasks never seen during evolution. The Pareto advantage persists: MOEvo pro 61.1% vs. SkyDiscover pro 40.0% (+21pp). SkyDiscover flash collapses to 17.0%. MOEvo uses a minimal seed and small evolution budget (5 iterations per slice); scaling the budget is a direct path to closing the gap with heavily engineered commercial baselines.

MethodE1E2AvgToolEmu
Claude Code59.479.569.450.4
Codex CLI64.578.171.350.5
MOEvo pro52.769.561.152.2
MOEvo flash50.571.160.850.2
SkyD. pro (w=0.5)55.424.640.051.2
SkyD. flash (w=0.5)14.020.017.051.0

Table 4. Held-out evaluation. Pareto selection advantage persists: MOEvo pro outperforms SkyDiscover pro by +21.1pp. SkyDiscover flash collapses entirely.