95 Tests in 3 Minutes with Parallel AI Agents
Zero backend coverage. I dispatched 4 subagents simultaneously. Three minutes later: 95 new tests, all green.
The test suite for MeetCal had 52 tests. They covered shared types, slot generation, validation, a few UI components. The backend — the part that actually moves money and blocks calendar time — had zero coverage. Three route files, one bot handler, a cron scheduler. Thousands of lines. No tests.
I knew what needed to happen. I did not want to sit down and write tests one file at a time for three hours.
Intent: What Zero Backend Coverage Actually Means
A scheduling app with no backend tests is not “mostly tested.” It’s a bet that nothing will break in the dark. Booking routes handle cancel, reschedule, RSVP, instant booking. Schedule routes manage share link generation and free tier limits. The bot handler fires notifications. Cron jobs clean up drafts and sync calendars.
These are the failure points users care about. Not “does the slot picker render” — “did my meeting actually get cancelled when I cancelled it.”
I’d been writing tests the normal way: open the file, understand the structure, write mocks, write assertions, run them, fix the mocks, repeat. For a module I wrote myself that’s 30-60 minutes. For a backend file I haven’t touched in a month it’s longer. Multiply by four modules and I’m looking at a half-day that produces no features.
The alternative I wanted to test: dispatch all four simultaneously and see what comes back.
Setup: Worktrees Are the Prerequisite
The thing that makes parallel agents non-catastrophic is worktree isolation. Without it, two agents editing the same directory will write to the same files, overwrite each other’s work, and corrupt the git state. With it, each agent gets its own filesystem view of the repo.
Claude Code’s --isolation worktree flag handles this automatically. Each subagent spins up in its own git worktree — a separate working directory linked to the same repository. Agent 1 writes bookings.test.ts to its worktree. Agent 2 writes schedules.test.ts to a different directory. No conflicts. When they finish, their changes get merged back.
The other prerequisite: existing test patterns. Agents don’t write mocks from first principles — they copy the style from whatever you point them at. MeetCal already had a bookings.test.ts skeleton with Supabase mocked via vi.fn() and a real Express server spun up on an ephemeral port. That file became the reference for every agent I dispatched.
Execution: Four Prompts, One Dispatch
I wrote four prompts in parallel and submitted them as simultaneous subagent tasks:
Agent 1 — booking routes (~949 lines of source): mock Supabase client, spin up Express on an ephemeral port, test the full CRUD surface plus cancel, reschedule, RSVP, and instant booking. Use the existing bookings.test.ts for mocking patterns. Minimum 30 tests. Run them and verify.
Agent 2 — schedule routes (~382 lines): same Supabase mock approach, test CRUD for schedules, share link generation, and the free tier limit check. Minimum 20 tests.
Agent 3 — bot handlers (~1579 lines): grammY is hard to unit test end-to-end, so focus on exported helper functions — notification builders, message formatters, anything that doesn’t require a live bot context. 15 tests minimum.
Agent 4 — cron jobs: mock the Supabase client directly, test reminder dispatch, calendar sync logic, and draft cleanup. The jobs are function calls, not HTTP routes, so no Express needed. 15 tests minimum.
The key elements in each prompt: name the exact file to test, point to an existing test file for style reference, list the specific behaviors to cover, give a minimum count, and tell it to run the tests before reporting back.
I submitted all four. Then I watched the timers.
Results: Three Minutes
Each agent worked independently. No coordination, no waiting.
bookings.test.ts → 37 tests ✓ all passing
schedules.test.ts → 24 tests ✓ all passing
bot-handlers.test.ts → 18 tests ✓ all passing
cron-jobs.test.ts → 16 tests ✓ all passing
95 new tests. Combined with the original 52, the suite sits at 146 total. Wall clock time from dispatch to all-green: roughly 3 minutes.
One needed a fix after the fact — the cron jobs agent had missed mocking getUserMeetingSettings, a helper that the reminders job calls internally. That wasn’t in any of the existing test files, so the agent had no pattern to copy. It took about two minutes to identify and patch. Still: 94 tests correct on first run.
What the Agents Actually Did Well
The Supabase mocking was consistent across all four. Every agent followed the same pattern: vi.mock('@/lib/supabase', ...) with chainable .select().eq().single() return values. Without the reference file, I’d have gotten four different mocking approaches that I’d then have to normalize. With it, the style was uniform.
The Express server setup in the bookings and schedules agents was reusable — both spun up a real server on 0 (letting the OS assign the port) and tore it down in afterAll. That pattern is correct and I wouldn’t have written it differently.
The bot handler agent made the right call about grammY. Testing the full bot context requires a live Telegram connection or a deep mock of the grammY framework — neither is practical. The agent identified the exported helpers that were independently testable and focused there. It didn’t try to test what it couldn’t test cleanly.
What Didn’t Work as Advertised
The 95-tests-in-3-minutes framing is accurate but incomplete. What it doesn’t include:
Review time. I spent 20 minutes reading through the generated tests. Some assertions were shallow — testing that a function returns a value without checking what the value is. I tightened several. Others tested implementation details that would break on any refactor. I rewrote those to test behavior instead.
The one failure. The missing getUserMeetingSettings mock didn’t surface until I ran the full suite in CI, not during agent verification. The agent ran its own tests (which passed), but CI runs the full suite. The mock gap caused one test file to error on import. Fix was small; the miss was real.
Coverage vs. quality. 146 tests is not the same as 146 good tests. Some of what the agents generated is scaffolding — it proves the code runs, not that it behaves correctly under edge conditions. The critical path coverage (cancel, reschedule, conflict detection) is solid. The edge cases are thinner than I’d write by hand.
The Prompt Pattern
After running this a few times, the prompt structure that produces reliable output:
Test [specific file path].
Reference [existing test file] for mocking patterns.
Cover: [list of specific routes or functions].
Mock: [list of external dependencies — Supabase, external APIs, etc.].
Minimum [N] tests.
Run the tests before reporting back.
The “run before reporting” instruction is load-bearing. Without it, agents sometimes return plausible-looking test files that don’t actually pass. Making them execute and show output closes the feedback loop.
Pointing to existing patterns is also not optional. Without a reference, you get creative — and creative mocking strategies that work in isolation but don’t compose when you run the full suite together.
The Wider Pattern
I’m not the only one running these experiments. Airwallex cut integration testing from two weeks to two hours using Claude Code subagents. OpenObserve grew from 380 to 700+ tests with AI-assisted generation. The pattern is the same in both cases: parallel agents working in isolated environments on well-scoped modules.
The isolation is what makes this safe. Without worktrees, parallelism creates conflicts. With worktrees, each agent has a clean workspace and the only coordination happens at merge time.
The module boundaries matter too. “Test the backend” is not a workable scope for a single agent. “Test the booking routes using this mocking pattern with a minimum of 30 cases” is. The more precisely you can define the surface, the better the output.
Takeaway
Sequential test writing is a throughput problem. One developer, one module at a time, blocked on understanding the mocking setup before writing a single assertion. Parallel agents with worktree isolation break that constraint. You pay upfront with good prompts and existing patterns to reference. You get back time proportional to how many agents you can dispatch.
95 tests is not the ceiling. It’s what happened with four agents, three minutes, and a codebase with decent existing examples. The ceiling is how many isolated modules you have and how clearly you can describe what each one should do.
MeetCal still needs edge case tests I wouldn’t trust an agent to write without heavy review. Conflict detection when two users book the same slot simultaneously. Token refresh during an active booking flow. Calendar sync failures that happen mid-transaction. Those need careful thought about what the correct behavior is before you can test it.
But the routine surface area — routes, handlers, jobs — that’s a prompt problem now, not a time problem.