When the tool isn't the story: four coding agents took the same Motion.dev prompts, and their behavior diverged before their speed did.
Codex was the fastest agent and the heaviest Motion.dev MCP user. Gemini CLI logged zero Motion.dev MCP calls and still won half the prompts. Availability and behavior, this benchmark argues, are not the same thing.
The cleanest way to break the comfortable story about Model Context Protocol tooling is to put two agents next to each other. On a battery of four Motion.dev animation prompts — a magnetic dock, an interactive SVG blob, a scroll-driven cinematic timeline, and a liquid menu — codex invoked Motion.dev-specific tools 31 times across every prompt and finished the entire benchmark first, at 26m 50s. gemini-cli invoked them zero times, never once, on any prompt — and still won two of the four sprints outright.
That contrast is the spine of this report. The benchmark was designed to measure how four coding agents — claude-code, codex, gemini-cli, and opencode-minimax-2.5 — handle Motion.dev-heavy tasks where the Motion.dev MCP server is available and named in the prompt. The presumed punchline was that MCP availability would translate into MCP usage, and MCP usage would translate into better outcomes. Neither half of that chain held cleanly.
The headline numbers do favor the MCP-enthusiast. Codex finished fastest overall and was the only agent to record Motion.dev MCP calls on all 4 of 4 prompts. Claude Code, second on tool-use breadth, invoked Motion.dev MCP on 3 of 4 prompts and finished third in total time. But the cleanest one-line summary — more MCP, more speed — falls apart the moment you look at Gemini CLI, which sat at the top of half the leaderboard with a tool ledger that reads as a column of zeros.
The timeline prompt is where the field cracked open
Of the four prompts, the scroll-driven cinematic timeline produced the most dramatic spread. Codex finished it in 3m 24s. Claude Code took 20m 25s on the same instruction — six times longer — and called the Motion.dev MCP more than Codex did along the way (6 versus 5). This is the single clearest moment in the benchmark where MCP call count moves in the opposite direction of speed. The same prompt, the same target library, the same access to the same toolkit, and a 17-minute gap.
It is also the moment that should warn anyone tempted to write "MCP makes agents faster" on a slide. What the data supports is narrower: when Codex invoked Motion.dev tooling, it tended to do so efficiently. When Claude Code invoked it, the relationship between calls and wall time was less stable. The benchmark does not let us see why. It does let us see that the relationship is not linear.
The dock prompt was the real stress test
The MagneticDock — the context-rich prompt of the four — was the most expensive in the field. Its average completion time across agents was 13m 37s, the highest of the four. Yet it produced the most unconventional winner: claude-code took it in 7m 25s while logging zero Motion.dev MCP calls. Codex, in second on the dock, made 13 Motion.dev MCP calls — its single heaviest tool-use prompt of the benchmark — and still finished a minute behind. The hardest prompt was won by selective restraint, not by tool enthusiasm.
| Prompt | claude-code | codex | gemini-cli | opencode-minimax-2.5 |
|---|
Tool-use behavior diverged before performance did
The benchmark prompts were identical across agents. The tool-use ledgers were not even close. codex logged Motion.dev calls on every prompt; claude-code on three of four; opencode-minimax-2.5 on one of four; gemini-cli on none. This is the part of the result most worth dwelling on, because it is the part that survives any concern about completion time. Even before we ask which agent was fastest, we have to register that the same instruction produced four meaningfully different tool strategies.
Across the four prompts, Codex took the timeline outright and was the fastest agent in aggregate. Gemini CLI won the blob (2m 47s, the fastest single run in the benchmark) and the liquid menu (5m 14s). Claude Code won the dock. OpenCode running MiniMax 2.5 finished last on three of the four prompts and last overall by a wide margin — 61m 49s against Codex's 26m 50s, more than double the total.
That spread suggests two separate axes are in play. One is integration depth: does the agent know to reach for Motion.dev tooling, and does it use the toolkit in a way that compresses work into fewer turns? Codex's profile fits that description. The other is something closer to baseline implementation throughput on familiar web-animation tasks. Gemini CLI's profile — fastest twice, zero recorded MCP usage, lowest total tool count at 38 — fits that one. The benchmark cannot tell us whether the resulting code is equivalently good. It can tell us that two very different approaches produced top-half results.
What the benchmark does not say
This is a history-based benchmark. It compares what these four agents actually did, on this machine, in these working directories, on this day. It measures completion speed and tool-use patterns. It does not evaluate animation fidelity, code quality, visual polish, or how the produced components behave at runtime. A faster finish is not a better outcome by itself, and a heavier tool ledger is not a worse outcome by itself. The honest framing is that we now know how four agents chose to attack the same four prompts, and what that cost them in wall-clock time. The question of which dock actually feels good to use is unanswered here.
There are also normalization caveats. Codex names its Motion.dev tools without the conventional mcp__ prefix, so counts there come from a curated list of Motion-specific tool names. Claude Code's Motion.dev calls are namespaced cleanly. OpenCode's session timings come from session metadata rather than per-turn fields. Some runs had follow-up turns; only the first completed benchmark turn was counted. None of these shift the directional finding, but readers should know the rules of the count.
The takeaway worth keeping is the one the data refuses to simplify: MCP availability and MCP behavior are not the same thing. Putting the toolkit in the room does not guarantee an agent will pick it up. The strongest single result here — Codex's 31 Motion.dev calls coexisting with the fastest total time — is real, and it points toward integration quality mattering. The strongest counterweight — Gemini CLI's two wins with a clean tool sheet — is also real, and it points toward the limits of judging an agent on its tool ledger alone.
Key findings
- Codex finished first overall and used Motion.dev MCP the most. Total time 26m 50s, 31 Motion.dev MCP calls across 4 of 4 prompts. It is the only agent in the field with both attributes.
- Gemini CLI logged zero Motion.dev MCP calls and still won two of four prompts. It took the blob in 2m 47s and the menu in 5m 14s, the fastest single runs in their prompt brackets.
- The timeline prompt produced the largest gap in the benchmark. Codex: 3m 24s. Claude Code: 20m 25s. A roughly sixfold difference on identical instructions.
- The MagneticDock was the hardest prompt on average. Average completion time 13m 37s. Notably, it was won by Claude Code in 7m 25s with zero Motion.dev MCP calls.
- Tool-use behavior diverged sharply under identical prompts. Motion.dev MCP was invoked on 4/4 prompts by Codex, 3/4 by Claude Code, 1/4 by OpenCode/MiniMax 2.5, and 0/4 by Gemini CLI.
- OpenCode running MiniMax 2.5 was the slowest profile. Total 61m 49s, more than double Codex's total, and Motion.dev MCP usage was limited to a single prompt (the dock).
- Tool call count is not a proxy for speed. Codex's 97 total tool calls produced the fastest total time; OpenCode's 78 total tool calls produced the slowest.
- This benchmark measures speed and tool use, not quality. It does not score animation fidelity, visual correctness, or runtime behavior of the output.
Social summary
See the actual outputs
Every cell below links to the static export of that agent's generated output for that prompt. Cell label = recorded completion time. Prompt winners are starred. OpenCode's dock is published at the agent root rather than in a sub-route, and its timeline run completed without producing a static export.
| Prompt | claude-code | codex | gemini-cli | opencode-minimax-2.5 |
|---|---|---|---|---|
| MagneticDockdock · context-rich | 7m 25s | 8m 25s | 14m 09s | 24m 27s |
| Interactive SVG Blobblob · goal-oriented | 7m 27s | 9m 01s | 2m 47s | 3m 49s |
| Cinematic Timelinetimeline · spec-driven | 20m 25s | 3m 24s | 10m 44s | 14m 21sno static export |
| Liquid Menumenu · visual-first | 7m 33s | 6m 01s | 5m 14s | 19m 13s |