Test Plans

A test plan chains several small tests into one larger scenario. Each test still does one focused thing in one browser; the plan decides the order they run in, which browser session each one uses, and how data flows from one test to the next.

You reach for a plan when a flow does not fit inside a single prompt. Take an admin inviting a teammate: the admin sends the invite, the teammate accepts it from their own inbox in their own browser, then the admin confirms the new seat appeared. That is three actions, by two people, in two browsers, in a fixed order. You could force it into one giant prompt, but it turns brittle fast. Splitting it into small tests and letting the plan orchestrate them does not.

A plan gives you three primitives to compose with:

Step dependencies: which steps must finish before another can start. This sets the order of work, and by extension what can run in parallel.
Browser profiles: which browser session (its cookies, storage, and logged-in state) each test gets access to. Steps can share one session or run in fully isolated ones.
Step outputs: named values a step records during its run. Every value an ancestor recorded is handed to the agent automatically — later prompts simply refer to them in plain language.

The clearest way to see how they fit together is to build one. If you would rather have the exact JSON payloads, addressing rules, and endpoint reference, head to the Test Plans API reference.

A worked example

Suppose you are testing an events app. A user registers, logs in, creates an event, and invites a guest who accepts. We will build that as a five-step plan and introduce each primitive exactly when the flow needs it. Here is the whole thing first, then step by step.

Registration ─▶ Login ─▶ Create Event ─▶ Accept Invite ─▶ Cleanup

It is a straight chain: each step depends on the one before it, and Cleanup at the end runs even if an earlier step fails. The steps differ in who runs them and what data they pass along, which is where the three primitives come in. You build it on a canvas, adding one step at a time and pointing each at a saved test:

An empty test plan on the canvas, ready for its first step.

1. Registration: producing outputs

The first step runs a test that registers a brand-new account. The account it creates is not known ahead of time, and later steps need to log in with it. That is what step outputs are for: a step declares named values its test will produce during the run, and any later step can read them. Registration declares two outputs, username and password; the test records the credentials it generated as it goes.

The step also needs to know where to register. We do not want to hard-code https://staging.example.com into the test, because the same plan should run against staging and production. So we declare a plan variable called host, a value the caller supplies when they start the run. The test's prompt references it with double-brace syntax:

Go to {{ plan.variables.host }}/register and create a brand-new account with a unique username and password. Record the username and password you chose as outputs. Verify: the dashboard loads and shows the newly created account.

Start the run with host = https://staging.example.com and the whole plan points at staging; start it with the production host and nothing else changes. That is the everyday use for plan variables: one value, declared once, threaded into every step that needs it.

The Registration step wired from Start. Its panel shows the prompt referencing the host plan variable, the browser-session toggle, and the two declared outputs.

2. Login: using an earlier step's outputs

Login depends on Registration, since it cannot run until the account exists, so we connect Login to Registration. That dependency is also what makes Registration's outputs available here: every value recorded by a step's ancestors is handed to the agent automatically. The prompt just refers to them by name — no special syntax:

Go to {{ plan.variables.host }}/login. Sign in with the username and password recorded by the registration step. Verify: you land on the dashboard, signed in.

The agent receives every ancestor output — name, value, and description — alongside its instructions, so “the username recorded by the registration step” resolves naturally. A step only sees outputs from steps it depends on; that dependency is what guarantees the value exists by the time the step runs.

Login connected after Registration. The prompt injects the registration step's username and password outputs, and the browser session is set to reuse the organizer profile.

3. Create Event: reusing the browser session

Creating an event has to happen as the logged-in user. We could make this step log in again, but that is wasteful and brittle. Instead we give Login and Create Event the same browser profile, call it organizer. Steps that share a profile share a browser session: the same cookies, the same storage, the same logged-in identity. So Create Event starts exactly where Login left off, already authenticated, and its prompt just gets on with the work:

You should already be signed in. If you are not, go to {{ plan.variables.host }}/login and sign in with the credentials recorded by the registration step. Create a new public event called "Launch Party". Verify: the event appears on your events list.

Notice the fallback in the first lines. The shared session is the happy path, but browser sessions can time out between steps. A robust step prompt assumes the logged-in state and also spells out how to recover if the session is gone — the registration credentials are already in the agent's context, so the fallback costs one sentence. The session reuse saves the work; the fallback keeps a stale session from failing the run. Create Event also declares its own invite_url output, the public link the guest will use in the next step.

Create Event reusing the organizer session. The prompt opens with a login fallback, and the step declares an invite_url output for the guest to consume.

4. Accept Invite: a second actor as a fresh session

The guest is a different person. We do not want them sharing the organizer's logged-in browser; they should arrive as their own client. Modeling that takes nothing special: we just give Accept Invite a different profile, say guest. A profile the organizer never touched is simply a clean, isolated browser session with no cookies and no logged-in identity, which is precisely what a guest user is.

Accept Invite also depends on Create Event, since there is no invite to accept until the event exists, so we connect it to Create Event. This is the step dependency doing real work: it both enforces the order and makes Create Event's invite_url output available to the guest's prompt:

You are a guest visitor with no account. Open the invite URL recorded by the create_event step. Accept the invitation to the event. Verify: a confirmation that you are attending "Launch Party" is shown.

Accept Invite on its own guest profile. The canvas colours it differently from the organizer steps, and its panel shows a fresh browser session, a clean client with no logged-in identity.

5. Cleanup: a step that always runs

Registration created a real account, and we do not want it lingering after the run. So we add a Cleanup step that deletes it, reading the same username output to know which account to remove.

There is a catch. By default, when a step fails the runner skips its descendants, usually what you want, since a later step is rarely meaningful once an earlier one broke. But Cleanup is the exception: if Create Event or Accept Invite fails, the account still exists and still needs deleting. So Cleanup opts out of that default and is marked to run regardless of whether its ancestors succeeded. That is the right tool for teardown, cleanup, and any verification that should happen win or lose.

The finished five-step plan. Cleanup is selected, showing the "If a previous step fails" control set to Run anyway so the account is always deleted.

That is the whole plan. Five small tests, each short enough that an agent completes it reliably, composed with dependencies, profiles, and variables into a flow no single prompt could express cleanly. For the JSON shapes behind variables, outputs, profiles, and the run-regardless flag, see the Test Plans API reference.

Designing your own plans

A useful way to approach a new plan is to write down the actors first: who is involved in the scenario, and which of them needs their own browser session. That gives you the set of profiles. Then write down what each actor does, in order, and which of those actions only make sense after some other action has happened. That gives you the steps and the edges between them. Anything that is not connected by an edge is a hint that those steps can run in parallel; if that is not what you want, add the edge.

For long, deep flows like a full e-commerce checkout, a multi-day approval workflow compressed into a single run, or an onboarding sequence that touches half a dozen screens, the payoff is that you stop writing one giant brittle test and start composing small ones. Each test stays short enough that an agent can complete it reliably. The test plan owns the orchestration, the parallelism, and the actor switching, so the tests themselves do not need to know they are part of something larger. When a step fails, you get a precise failure on a small surface, not a forty-step prompt that died somewhere in the middle.

Where to next

The Test Plans API reference covers JSON payloads, addressing rules, and run semantics.
The Tests API reference describes the building block every step points at.
The Schedules API reference shows how to run a plan on a recurring cadence.