Motion.dev MCP Benchmark Issue 01 · 2026-05-15
Benchmark · Coding Agents · Tool Use

When the tool isn't the story: four coding agents took the same Motion.dev prompts, and their behavior diverged before their speed did.

Codex was the fastest agent and the heaviest Motion.dev MCP user. Gemini CLI logged zero Motion.dev MCP calls and still won half the prompts. Availability and behavior, this benchmark argues, are not the same thing.

codex
26:50
GPT-5.4 (medium) · 31 calls
gemini-cli
32:54
Gemini 3.1 Pro Preview · 0 calls
claude-code
42:51
Sonnet 4.6 (high) · 16 calls
opencode-minimax-2.5
61:49
MiniMax M2.5 · 4 calls

The cleanest way to break the comfortable story about Model Context Protocol tooling is to put two agents next to each other. On a battery of four Motion.dev animation prompts — a magnetic dock, an interactive SVG blob, a scroll-driven cinematic timeline, and a liquid menu — codex invoked Motion.dev-specific tools 31 times across every prompt and finished the entire benchmark first, at 26m 50s. gemini-cli invoked them zero times, never once, on any prompt — and still won two of the four sprints outright.

That contrast is the spine of this report. The benchmark was designed to measure how four coding agents — claude-code, codex, gemini-cli, and opencode-minimax-2.5 — handle Motion.dev-heavy tasks where the Motion.dev MCP server is available and named in the prompt. The presumed punchline was that MCP availability would translate into MCP usage, and MCP usage would translate into better outcomes. Neither half of that chain held cleanly.

The headline numbers do favor the MCP-enthusiast. Codex finished fastest overall and was the only agent to record Motion.dev MCP calls on all 4 of 4 prompts. Claude Code, second on tool-use breadth, invoked Motion.dev MCP on 3 of 4 prompts and finished third in total time. But the cleanest one-line summary — more MCP, more speed — falls apart the moment you look at Gemini CLI, which sat at the top of half the leaderboard with a tool ledger that reads as a column of zeros.

Figure 01 · Per-prompt completion time
The same prompt, different outcomes
Each prompt was sent to each agent. Bars show minutes to the first completed benchmark turn. Shorter is faster.
Source: motion-dev-mcp-benchmark.json · duration_ms (first completed turn)

The timeline prompt is where the field cracked open

Of the four prompts, the scroll-driven cinematic timeline produced the most dramatic spread. Codex finished it in 3m 24s. Claude Code took 20m 25s on the same instruction — six times longer — and called the Motion.dev MCP more than Codex did along the way (6 versus 5). This is the single clearest moment in the benchmark where MCP call count moves in the opposite direction of speed. The same prompt, the same target library, the same access to the same toolkit, and a 17-minute gap.

It is also the moment that should warn anyone tempted to write "MCP makes agents faster" on a slide. What the data supports is narrower: when Codex invoked Motion.dev tooling, it tended to do so efficiently. When Claude Code invoked it, the relationship between calls and wall time was less stable. The benchmark does not let us see why. It does let us see that the relationship is not linear.

"Saying 'use the Motion.dev MCP' in a prompt does not produce uniform agent behavior. On identical instructions, one agent invoked Motion.dev tools 31 times, another invoked them never."

The dock prompt was the real stress test

The MagneticDock — the context-rich prompt of the four — was the most expensive in the field. Its average completion time across agents was 13m 37s, the highest of the four. Yet it produced the most unconventional winner: claude-code took it in 7m 25s while logging zero Motion.dev MCP calls. Codex, in second on the dock, made 13 Motion.dev MCP calls — its single heaviest tool-use prompt of the benchmark — and still finished a minute behind. The hardest prompt was won by selective restraint, not by tool enthusiasm.

Figure 02 · Motion.dev MCP usage
Same prompts. Wildly different tool ledgers.
Number of Motion.dev-specific tool calls recorded per agent per prompt. Cell intensity scales with call count; pale cells are zero.
Prompt claude-code codex gemini-cli opencode-minimax-2.5
Source: motion-dev-mcp-benchmark.json · motion_dev_mcp_calls

Tool-use behavior diverged before performance did

The benchmark prompts were identical across agents. The tool-use ledgers were not even close. codex logged Motion.dev calls on every prompt; claude-code on three of four; opencode-minimax-2.5 on one of four; gemini-cli on none. This is the part of the result most worth dwelling on, because it is the part that survives any concern about completion time. Even before we ask which agent was fastest, we have to register that the same instruction produced four meaningfully different tool strategies.

Figure 03 · Calls vs. clock
There is no clean line between tool use and speed
Each point is one prompt run. Horizontal axis: Motion.dev MCP calls. Vertical axis: completion time in minutes.
codex gemini-cli claude-code opencode-minimax-2.5
Source: motion-dev-mcp-benchmark.json · per-record duration_ms × motion_dev_mcp_calls
Figure 04 · Leaderboard
Codex first, OpenCode last — by more than double
Total time per agent to finish all four prompts. OpenCode's run was 2.3× Codex's despite using Motion.dev MCP on only one prompt; Gemini CLI placed second with no recorded Motion.dev MCP usage at all.
Source: motion-dev-mcp-benchmark.json · aggregate_by_agent.total_duration_ms

Across the four prompts, Codex took the timeline outright and was the fastest agent in aggregate. Gemini CLI won the blob (2m 47s, the fastest single run in the benchmark) and the liquid menu (5m 14s). Claude Code won the dock. OpenCode running MiniMax 2.5 finished last on three of the four prompts and last overall by a wide margin — 61m 49s against Codex's 26m 50s, more than double the total.

Figure 05 · Tool ledger
Gemini CLI runs with the smallest ledger overall
Total tool calls per agent across the four prompts, split into Motion.dev MCP calls and everything else (shell, file, task tools). Gemini's 38 total invocations are under half of every other agent's footprint — a leaner operational style, not just a different one.
Source: motion-dev-mcp-benchmark.json · aggregate_by_agent.{motion_dev_mcp_calls, total_tool_calls}

That spread suggests two separate axes are in play. One is integration depth: does the agent know to reach for Motion.dev tooling, and does it use the toolkit in a way that compresses work into fewer turns? Codex's profile fits that description. The other is something closer to baseline implementation throughput on familiar web-animation tasks. Gemini CLI's profile — fastest twice, zero recorded MCP usage, lowest total tool count at 38 — fits that one. The benchmark cannot tell us whether the resulting code is equivalently good. It can tell us that two very different approaches produced top-half results.

Figure 06 · Prompt difficulty
The dock was the heaviest lift
Average completion time per prompt across all four agents. The MagneticDock was the most expensive prompt in the benchmark; the blob was the lightest.
Source: motion-dev-mcp-benchmark.json · aggregate_by_prompt

What the benchmark does not say

This is a history-based benchmark. It compares what these four agents actually did, on this machine, in these working directories, on this day. It measures completion speed and tool-use patterns. It does not evaluate animation fidelity, code quality, visual polish, or how the produced components behave at runtime. A faster finish is not a better outcome by itself, and a heavier tool ledger is not a worse outcome by itself. The honest framing is that we now know how four agents chose to attack the same four prompts, and what that cost them in wall-clock time. The question of which dock actually feels good to use is unanswered here.

There are also normalization caveats. Codex names its Motion.dev tools without the conventional mcp__ prefix, so counts there come from a curated list of Motion-specific tool names. Claude Code's Motion.dev calls are namespaced cleanly. OpenCode's session timings come from session metadata rather than per-turn fields. Some runs had follow-up turns; only the first completed benchmark turn was counted. None of these shift the directional finding, but readers should know the rules of the count.

The takeaway worth keeping is the one the data refuses to simplify: MCP availability and MCP behavior are not the same thing. Putting the toolkit in the room does not guarantee an agent will pick it up. The strongest single result here — Codex's 31 Motion.dev calls coexisting with the fastest total time — is real, and it points toward integration quality mattering. The strongest counterweight — Gemini CLI's two wins with a clean tool sheet — is also real, and it points toward the limits of judging an agent on its tool ledger alone.

Key findings

Social summary

See the actual outputs

Every cell below links to the static export of that agent's generated output for that prompt. Cell label = recorded completion time. Prompt winners are starred. OpenCode's dock is published at the agent root rather than in a sub-route, and its timeline run completed without producing a static export.

Prompt claude-code codex gemini-cli opencode-minimax-2.5
MagneticDockdock · context-rich 7m 25s 8m 25s 14m 09s 24m 27s
Interactive SVG Blobblob · goal-oriented 7m 27s 9m 01s 2m 47s 3m 49s
Cinematic Timelinetimeline · spec-driven 20m 25s 3m 24s 10m 44s 14m 21sno static export
Liquid Menumenu · visual-first 7m 33s 6m 01s 5m 14s 19m 13s
Prompt winner on completion time. Hover any cell to highlight it; click to open the generated artifact in its own page.

Methodology and caveats

  • Models behind the agents. codex was running GPT-5.4 (medium reasoning); claude-code was running Claude Sonnet 4.6 (high reasoning); gemini-cli was running Gemini 3.1 Pro Preview; opencode-minimax-2.5 was running MiniMax M2.5. Both the agent harness and its underlying model shape what we measured here; "agent" is shorthand for the combination.
  • History-based, not lab-replay. Durations come from real session histories on a single machine, normalized to the first completed benchmark turn for each prompt.
  • Motion.dev MCP call counts are tool-name based. For Claude Code, only mcp__motion-dev__* calls. For OpenCode, only motion-dev_* calls. For Codex, Motion.dev tool names (e.g. search_motion_docs, get_component_api) are counted explicitly because Codex does not use an mcp__ prefix. For Gemini CLI, no Motion.dev-specific calls were recorded.
  • Generic discovery calls excluded. Tools like list_mcp_resources were not counted as Motion.dev MCP usage.
  • OpenCode durations are derived from session time_created → time_updated rather than per-turn fields, due to history format differences.
  • Output quality is not measured. This benchmark does not score animation fidelity, runtime behavior, or visual polish of the generated components.

Source files: motion-dev-mcp-benchmark.json, motion-dev-mcp-benchmark.csv, motion-dev-mcp-data-story-report.md, motion-dev-mcp-executive-summary.md.

Written by Claude Opus 4.7 (1M context) via Claude Code, from the prompt and iteration notes recorded at PROMPT.md.