Motion.dev MCP Benchmark Issue 02 · ← Static prompts story · 2026-05-26
Field Study · Coding Agents · MCP Productivity

The Iteration Premium

Three coding agents were asked to build the same multi-day animated portfolio. The story isn't who finished — it's how MCP servers compress the cost of every loop after the first draft.

Active work
5h 49m
12h 48m of idle wait removed from raw spans
Feedback turns
35/38
92% of all prompts were follow-up
Tool calls
1,165
Logged across the iterative window
Browser MCP
341
Verification dominated the late loop

A coding agent's first answer is the part everyone benchmarks. The part that decides whether the work ships is everything that happens after — the layout that's almost right, the scroll handler that jitters on the third reload, the SVG path that drifts a few pixels off its label. This study tracks three agents through that second half on the same multi-day build, and the story it tells is small and useful: where MCP servers were available and used, each loop got cheaper; where they weren't, the loop just kept costing what it cost.

The brief was the same in every session: build a <ImmersivePortfolio /> React component using Motion.dev, layered in over multiple turns — hero, displaced background, project grid, scroll-driven process timeline, draggable clients carousel, and a final integration pass. Across codex, claude-code, and gemini-cli, the same prompts produced 38 turns, 35 of which were follow-ups: bug reports, alignment fixes, scroll fidelity complaints, screenshot-driven debugging. That is the shape this work has in real product engineering, and it's the shape MCP servers either help with or don't.

End-of-session builds

Each link opens that agent's deployed build at the point where its iterative session ended. These aren't finished products. They show what the agents were capable of converging on inside the prompt set used here — every further refinement, polish pass, or feature addition is bounded by the creativity of the prompts a user is willing to write. The static prompts story is at agents-comparison.html; the build prompts themselves are recorded in PROMPT.md, and the verbatim iterative prompt archive is in iterative-prompts.md.

Agent Active work Prompt turns Final build
codex019e5d97 · GPT-5.4 (medium) 1h 1m 30s
+ 8h 27m idle
14 Open build →
claude-code0fbef8f3 · Sonnet 4.6 (high) 2h 41m 42s
+ 1h 48m idle
12 Open build →
gemini-cli410aac00 · Gemini 3.1 Pro Preview 2h 5m 59s
+ 2h 35m idle
12 Open build →

The work was never one-shot

Of 38 prompt turns across the three sessions, only 3 were initial build requests — one per agent. The remaining 35 turns were either feature extensions ("now add the carousel"), corrective bug reports ("the timeline is misaligned with the cards"), or screenshot-attached debugging ("the text is overlapping"). Two turns explicitly asked the agent to verify a fix using Playwright MCP. One referenced a PNG attachment of a broken layout.

That ratio is the most important fact in the dataset. The productivity question for a coding agent isn't "can it generate a working hero on the first try" — all three did. The productivity question is what happens on turns 6 through 12, when the user has seen the running interface, found something wrong, and needs the agent to absorb that feedback without a human walking it through the diff.

Figure 01 · The iterative tail
Tool activity doesn't taper — it spikes during corrective work
Tool calls recorded against each prompt turn. The build phase (turns 1–5) is moderate; the corrective tail (turns 6+) is where the heaviest tool spend happens. Gemini's session ends on a single 258-call turn fighting a sticky contact-section bug.
Source: motion-dev-mcp-iterative-session-metrics.json · prompts[].tool_calls

Active work vs wall clock

The raw session spans run anywhere from four and a half to nine and a half hours, but that is not how long the agents were actually working. Most of the difference is human idle time — the user walked away after a feature landed, came back forty minutes later, looked at it on their phone, sent the next prompt. After subtracting those waits, the combined active work time across all three sessions is 5h 49m, against 12h 48m of removed idle.

Figure 02 · Active vs idle
Most of the wall clock was the user not being there
Per-agent active work time (user present, agent responding) vs the idle gaps removed from the raw session span. Codex's nine-and-a-half-hour session was about an hour of real work and the rest waiting; Claude Code was the most consistently engaged.
Source: motion-dev-mcp-iterative-session-metrics.json · active_elapsed_ms, total_idle_removed_ms

That reframes how to read everything else. Codex's 9h 29m raw span was 1h 1m of active work and 8h 27m of waiting for the next user prompt. Gemini CLI's 4h 40m ran about evenly — 2h 6m active, 2h 35m idle. Claude Code had the most active work time at 2h 41m and the least idle removed. The productivity question is what each agent gets done during the active part.

That density is striking: Codex executed 350 tool calls in roughly 61 active minutes — about 5.7 calls per minute, the densest tool-use rate in the field. Gemini CLI ran at a similar density (~4.9 calls/min), but the calls were skewed toward shell rebuilds and file edits. Claude Code's profile is the opposite: more active minutes, more reading before patching, ~1.2 calls per minute.

The idle column has a second meaning worth keeping. Those gaps aren't wasted time for the human — they're time recovered. The async shape of an MCP-equipped coding agent means the user can hand off a feature, step into a meeting, ship something else, then come back and send the next prompt. Even the active span isn't fully claimed: while the agent is patching code, running builds, and verifying in the browser, the human is free to context-switch. The 5h 49m of active work tracked here was the agent's clock, not the user's. The compounding productivity story is that an MCP-equipped agent extends the user's working day without requiring more hours from the user.

"The 5 hours 49 minutes of active work was the agent's clock, not the user's. The 12 hours 48 minutes of idle was time the user got back."

Where MCP shows up in the loop

Across the three sessions, tool activity falls into six recognizable categories. The breakdown matters more than the total, because it tells you what kind of work each loop was doing.

Figure 03 · Workflow shape
Same task, three different operational mixes
Tool calls per agent, split by category. Motion.dev MCP and Browser MCP are the two specialized surfaces; shell, file edits, file reads, and planning are the generic ones. Higher share of specialized tooling = more leverage per call.
Source: motion-dev-mcp-iterative-session-metrics.json · tool_calls_by_category

The codex session is roughly balanced: a third of its calls go to browser verification, a third to shell, and the rest split between file edits, Motion.dev lookups, and planning. The claude-code session is read-heavy — the largest single category is Read, which means the agent spent a lot of its tool budget gathering context before patching. The gemini-cli session is the heaviest in raw volume and the most generic: a lot of shell, a lot of file edits, almost a hundred planning/topic-update calls, and not a single Motion.dev MCP call.

That last fact is worth pausing on. Every agent received the same instruction to use the Motion.dev MCP. The three sessions converted that instruction into 17, 6, and 0 Motion.dev tool calls respectively. The data isn't telling a story about which is "right" — it's telling a story about how unevenly MCP availability becomes MCP utilization, and what happens to the rest of the workflow when it doesn't.

"Putting an MCP server in the room doesn't change how the agent works. The agents that picked it up traded shell turns for tool turns; the ones that didn't, paid for it in shell."
Figure 04 · Two MCP layers, two stories
Domain MCP diverged. Verification MCP converged.
Motion.dev MCP calls (the domain-specific layer) varied by an order of magnitude across agents. Browser/Playwright MCP calls (the verification layer) were heavy in all three sessions.
Source: motion-dev-mcp-iterative-session-metrics.json · motion_mcp_calls, browser_mcp_calls

The split is the cleanest single result in this study. Domain MCP — the Motion.dev-specific surface for component lookup, animation patterns, API docs, served by motion-dev-mcp — was used heavily by one agent, lightly by another, and not at all by the third. Browser MCP — Playwright navigation, screenshots, console reads, evaluated DOM queries, served by playwright-mcp — was used by all three, and used more as the sessions wore on. By turn 8 in every session, verification had become the dominant tool category.

That suggests the two MCP layers do different jobs. Domain MCP is upstream: it shapes what code gets written. Browser MCP is downstream: it shapes how quickly the agent learns whether the code worked. The first is easier to opt out of; the second isn't, because without it the only feedback loop is the user looking at a screen and typing.

Figure 05 · Where the domain MCP showed up
Motion.dev MCP is a front-loaded tool
Per-turn Motion.dev MCP calls. In sessions that used it at all, usage concentrated in the early feature-build turns and trailed off as the work shifted to corrective debugging. By turn 6 onward, virtually all domain-MCP activity has stopped.
Source: motion-dev-mcp-iterative-session-metrics.json · prompts[].tool_calls_by_category.motion_mcp

The verification surface is the real productivity layer

If domain MCP belongs to the build phase, browser MCP belongs to the long tail. The cumulative Playwright/browser calls across the three sessions tell the story:

Figure 06 · Verification compounds
Browser MCP calls accumulate fastest exactly when the user gets picky
Cumulative Browser/Playwright MCP calls across each session's prompt turns. The slope steepens at the same turn in every session — the moment the spec shifts from "build the section" to "fix the section."
Source: motion-dev-mcp-iterative-session-metrics.json · prompts[].tool_calls_by_category.browser_mcp

The shape is the same in all three agents, even though their totals differ. There is a roughly flat early section while features are being added, and a steep climb starting around the first bug-report turn. The verification surface is what lets the agent close the loop on a visual problem without the user having to inspect, diff, and explain. It is, in a real sense, the part of the toolchain that lets an agent iterate.

Walkthrough · codex iterative build
The portfolio as it actually behaves at the end of a session
A short capture of codex's final iterative output, plus two stills from the sections that took the most corrective turns: the scroll-driven process timeline and the contact flow. claude-code and gemini-cli builds are available via the links above; the walkthrough is intentionally limited to a single agent so the comparison stays about workflow, not visual polish.
Process section — the scroll-driven timeline that took the most corrective turns across all three sessions
Contact section — where Gemini CLI's session ended on a 258-call corrective turn
All three assets from ./codex/iterative/. Process section (left): the alignment / scroll-animation surface that drew repeated bug reports. Contact section (right): the sticky-scroll bug that dominated the late corrective tail.

The corrective phase is where MCP earns its keep

Reading the prompt logs in order makes the pattern obvious. The first five turns in every session are feature requests: build a hero, add a displaced image, add a project grid, add the scroll-driven process timeline, add the carousel. From turn six onward, the prompts shift in character entirely. They become things like "Fix process section, since the text isn't aligning with the cards. Use playwright mcp to verify the fix", or "The scroll jitter is still there", or screenshots attached with two-line captions.

Figure 07 · What the user actually asked for
Most of the work was not feature-building
Feedback turn breakdown per agent. "Initial build" is the opening prompt; everything else is iterative work. Bugfix turns outnumber feature-extension turns in two of three sessions, and total feedback turns sit between 11 and 13 per agent.
Source: motion-dev-mcp-iterative-session-metrics.json · feedback_kind_counts

That is the part of the work where the cost of each loop matters most. A feature can be specified in a paragraph; a bug usually can't. The user describes a symptom, and the agent has to find the cause. Without browser MCP, that means asking the user to confirm what's on screen. With browser MCP, the agent can navigate to the running build, read the DOM, take a screenshot, execute a probe, and converge on the fix in the same turn.

The error rate column in the dataset is the closest thing to a direct cost signal. It's heuristic — three different log formats — but the direction is consistent: codex ran at a normalized 4.29% tool-error rate, claude-code at 8.08%, and gemini-cli at 12.32%. The agent with the broadest specialized tool surface also had the cleanest error profile per call.

Figure 08 · The friction tax
Specialized tooling, lower normalized error rate
Heuristic tool-error rate — failed tool calls divided by total tool calls. Different agents log failures differently, so this is directional rather than precise. The agent that leaned most on specialized MCP surfaces also had the lowest error rate per call.
Source: motion-dev-mcp-iterative-session-metrics.json · tool_error_rate

An MCP is only as good as its fit

One detail in the Claude Code session is worth keeping. Two of its Motion.dev MCP calls failed early: generate_motion_component returned "Unknown animation pattern: stagger", and get_component_api returned "No API documentation found for component: motion.span". That's not a problem with MCP in general — it's a problem with the specific Motion.dev MCP's coverage on the specific patterns the agent reached for. After those failures, Claude Code's domain-MCP usage thinned out and the session leaned more on file reads and shell builds.

That's the kind of detail that should land in any productivity story about MCP servers. Availability isn't enough; alignment matters too. A domain MCP that doesn't recognize the patterns or components an agent needs becomes a brief detour, not a productivity lever. The good news is the recovery pattern was fast — Claude Code didn't get stuck — but the leverage from that MCP was real only on the calls where the documentation matched the request.

"The strongest version of this story isn't 'MCP makes agents faster.' It's: when the MCP fits the task, every loop after the first one gets shorter."

Reading this as a productivity case

If you take this dataset as a small but real field study, three things hold up:

1. The iteration loop is the unit of productivity. Initial generation is solved well enough that the variance between agents on turn 1 is small. The variance shows up on turns 6 through 14, where the user is feeding back live impressions of a running UI. That's where MCP servers either help or don't.

2. The two MCP layers do different jobs. A domain MCP (Motion.dev here) compresses the cost of knowing — what components exist, what props they take, what patterns are idiomatic. A verification MCP (Playwright/browser here) compresses the cost of seeing — what the build actually looks like, whether the fix landed, whether the regression is real. Together they cover the two failure modes of iterative work: writing code from outdated guesses, and committing fixes without checking them.

3. The absence of a domain MCP doesn't stop the work — it shifts it. The Gemini CLI session finished the same six-section build the other two did. It just spent 617 tool calls doing it, more than the other two combined, with most of the extra going into shell rebuilds, file edits, and a much larger corrective tail on the contact section. That's the cost the domain MCP would have lowered — not zeroed out, but lowered.

Key findings

Social summary

Method and caveats

  • History-derived, not lab-replay. Numbers are extracted from each agent's local session log on a single machine, normalized into a shared schema. Source paths are in motion-dev-mcp-iterative-session-metrics.json.
  • MCP servers under test. The domain MCP was Abhishekrajpurohit/motion-dev-mcp (Motion.dev component lookup, animation patterns, API docs). The verification MCP was microsoft/playwright-mcp (browser navigation, screenshots, console reads, evaluated DOM queries). Both were available to every agent; how each agent reached for them is what this study measures.
  • Active work time excludes idle gaps. Wherever the last agent event in a turn was logged and the session sat waiting for the next user prompt, that gap was subtracted from the raw session span. active_elapsed_ms is the productivity unit in this story; the raw elapsed_ms is reported as wall-clock context but not used to compare agents.
  • Models behind the agents. codex ran GPT-5.4 (medium); claude-code ran Claude Sonnet 4.6 (high); gemini-cli ran Gemini 3.1 Pro Preview. "Agent" is shorthand for the harness + model combination.
  • Tool categorization is heuristic. Each agent logs tool calls in its own format; categories (motion_mcp, browser_mcp, shell, file_edit, file_read, planning) were normalized post-hoc.
  • Tool-error rate is approximate. Non-zero shell exits with explicit error markers were counted as failures. Comparing absolute rates across agents is directional, not exact.
  • Output quality is not measured. This study is about workflow and instrumentation, not visual polish or animation fidelity. Final builds are linked above so readers can judge for themselves.
  • The end-of-session builds are not finished products. They show the capability ceiling reached inside this specific prompt set. Anything beyond — additional sections, deeper polish, accessibility, performance work, content — is bounded only by the creativity of the prompts the user is willing to write. The study measures what these agents do with the prompts they were given, not what they could do with a different set.
  • Repeated corrective turns are part of the story. Several sessions included multiple prompts on the same bug (process-section alignment, contact-section scroll lock). The iterative tail is the productivity story, not an artifact to discount.

Source files: motion-dev-mcp-iterative-session-metrics.json, motion-dev-mcp-iterative-story-handoff.md, iterative-prompts.md. Companion study: agents-comparison.html (static one-shot prompts).

Written by Claude Opus 4.7 (1M context) via Claude Code.