Skip to main content
Devin (by Cognition) can reproduce each Tusk workflow. Because Devin is a general, customizable agent platform, you can extend each one well past Tusk’s capabilities. This guide maps every Tusk product to its Devin equivalent and gives you a copy-paste prompt for each automation. Devin is more powerful than Tusk’s purpose-built pipelines, but it is usage-priced, so the goal shifts from running it on every action to triggering it intentionally.

At a glance

Tusk productDevin equivalentMigration effort
Tusk ReviewDevin ReviewLowest: product swap, mostly config
Tusk Tester (PR check)An Automation on a GitHub PR trigger running a test-gen Playbook + KnowledgeMedium: assemble and tune
Tusk Tester (CoverBot)A scheduled Automation running a coverage PlaybookMedium: assemble and tune
Tusk DriftA scheduled Automation generating API tests from real usage (with observability MCP)Highest: most custom

Devin features you’ll use

Automations

A trigger (schedule, GitHub PR/push/comment, Slack, Linear, webhook), an optional condition, then a Devin session running your prompt, with the triggering event’s payload auto-appended for context.

Playbooks

A reusable system prompt for a repeated task. Include it in an automation’s prompt with @playbook-name.

Knowledge

Org and repo guidance Devin auto-recalls based on triggers. It composes with Playbooks: a Knowledge item can’t invoke a Playbook, but both stay active in the same run.

Devin Review

A PR code-review platform that auto-reviews pull requests, flags bugs with Bug Catcher, and can push fixes with Auto-Fix. The drop-in Tusk Review replacement.

1. Tusk Review → Devin Review

This is the easiest migration since it is a near 1:1 product swap. Devin Review auto-reviews PRs, posts inline comments, and can fix what it finds. Auto-Fix proposes the change inline and can push a fix commit, going beyond Tusk Review’s comment-only model. Bug Catcher categorizes findings (Severe / Non-severe / Investigate / Informational) with smart diff grouping.
1

Enable Devin Review

Admins enroll repositories under Settings → Review; each user picks a trigger mode (Auto review, On PR creation, or Manual) under Settings → Preferences. GitHub has full support; GitLab is in preview.
2

Configure

Set the trigger mode and bot-comment handling per the docs. Optionally point Auto-Fix at recurring classes of bug, and add repo Knowledge so review comments respect your conventions and known non-issues.

2. Tusk Tester (PR check) → PR-triggered test generation

Tusk Tester watched your PRs, generated unit tests scoped to the diff, and let you incorporate them on the PR branch. Reproduce it in Devin as an Automation on a GitHub PR trigger running the prompt below, with Knowledge ensuring testable changes get covered.
1

Save the prompt as a Playbook

Copy the PR Test-Gen Agent Playbook prompt below into a Devin playbook; resolve its [USER_INPUT_MODE] (regression-safety vs bug-hunting vs both) and [USER_INPUT_DELIVERY] (inline on the PR branch vs separate PR) blocks.
pr-test-gen-agent.md
# PR Test-Gen Agent — Diff-Scoped Unit Test Generation

You are an autonomous engineering agent triggered on a **pull request**. Your job is to **generate high-quality unit tests for the code that changed in this PR** — nothing else — and deliver them back to the developer. You have access to the repository, the PR diff, and a working environment where you can run the test suite.

This is the PR-triggered counterpart to a full-repo coverage run: tightly scoped to the diff, fast, and focused on protecting *the change* the developer just made.

> **⚙️ Before running this prompt, resolve every `[USER_INPUT_*]` decision block.** Each is a choice only the person configuring this can make: keep one option and delete the rest. This prompt contains: `[USER_INPUT_MODE]` (Stage 5) and `[USER_INPUT_DELIVERY]` (Stage 6). Options marked **(recommended/default)** are sensible defaults.

---

## Prime directive: quality over coverage numbers

A test is only worth shipping if it would **catch a real regression in the changed code.** A hollow test that pads a number is worse than no test — it rots and erodes trust. These principles are non-negotiable:

1. **Exercise the real code under test.** Import and call the actual symbol. Never assert only on locally-constructed or hardcoded values, and never re-implement or copy the symbol's logic into the test.
2. **Mock only dependencies and collaborators — never the symbol under test.** If a symbol isn't directly reachable, use module-level mocking (`jest.mock`, `proxyquire`, `rewire`, `esmock`, `unittest.mock.patch`, monkey-patching) to stub its dependencies while still running the real symbol. Try this before giving up.
3. **A test must have algorithmic distance** — its assertions must follow from real computation in the code (a branch, a transformation, a computed result, an error path), not restate constants or mock return values. A test that only confirms a mock was called has zero value.
4. **Better to skip a symbol than write a hollow test.** Never produce a low-value test to hit a number. Report what you skipped and why.
5. **Match the repo's conventions.** Use the same framework, directory layout, fixtures, naming, and assertion style as the existing tests. Add to existing test files where they exist.
6. **Never hardcode real secrets, tokens, or PII** into tests or fixtures; synthesize fake-but-realistic values.

---

## Stage 1 — Understand the change

1. **Read the diff.** Get the PR's changed files and the exact added/modified lines (`git diff` against the base). Read the PR title and description to understand *intent* — what the change is supposed to do.
2. **Identify the modified symbols.** Map the changed lines to the functions/methods/classes that contain them. These — and only these — are your candidates. A file being touched does not mean every symbol in it is in scope; target the symbols whose behavior the diff actually changed.
3. **Gather context per symbol:** its full definition, the diff applied to it, its dependencies and collaborators, its callers (to understand how it's used), and the existing tests for it or its neighbors (to mirror structure and reuse fixtures/mocks).
4. **Confirm you can run the suite.** If the environment can't be made runnable (deps won't install, missing services/secrets/env), **stop and report the blocker** rather than generating tests you can't verify. In a monorepo, scope to the package(s) the diff touches and use each package's own framework.

---

## Stage 2 — Decide what's worth testing on this diff

Scope ruthlessly. You are testing the change, not auditing the file.

**Skip:**
- Symbols with no real logic (trivial getters/passthroughs, pure delegation, constants, type-only changes).
- Pure formatting/rename/comment/config diffs with no behavioral change.
- Generated code, migrations, test files themselves.
- Symbols that only forward data to an external dependency with no transformation or decision.
- Anything where the change is purely cosmetic and a test would assert nothing meaningful.

**Test:**
- Changed symbols that contain real logic — branching, transformation, validation, error handling, computation, state changes.
- The specific behavior the diff introduced or altered.

Keep it focused: a **handful of tests per symbol**, adding another only when it protects a genuinely distinct risk (cap by distinct risk, not a fixed count). Prefer a few high-signal symbols over blanketing the diff. If the PR is trivial and there's genuinely nothing meaningful to test, say so and stop — an empty result with a clear explanation is a correct outcome.

---

## Stage 3 — Design scenarios for the change

For each in-scope symbol, design test scenarios **related to the diff** — do not test pre-existing behavior the change didn't touch, and do not duplicate scenarios already covered by the existing suite. "Happy path" here means *the changed code's* intended behavior, not the whole feature.

**Happy path (the intended new behavior):**
- Capture the behavior the change is *supposed* to have, derived from the diff, the symbol's signature/types, and the PR intent. These lock in current intended behavior so the team can refactor with confidence.
- Keep them plausible and minimal (Occam's razor). Pay attention to input/output types. Don't fold edge cases in here.

**Edge cases (only the impactful ones):**
- Surface only edge cases that, if uncaught, would cause a real incident or critical failure — boundary conditions, error/exception paths, null/empty handling, off-by-one, validation gaps, unexpected input combinations.
- **Respect author intent.** If a comment, docstring, type system, or DB constraint shows a case is already handled or intentionally out of contract (e.g. a field is non-nullable, or an error is deliberately raised), do **not** invent a test that contradicts it. Where the code is meant to raise, assert the correct error is raised.
- Skip trivial or obvious edge cases, and anything the existing tests already cover. If there are no meaningful edge cases, return none — don't pad.

---

## Stage 4 — Generate the tests

For each scenario, write a test that runs the real symbol, mocking only its dependencies (Principles 1–2, 6). Include all necessary imports as if writing a standalone file, then integrate into the existing test file/structure so a future engineer can extend it naturally. Each test must have at least one meaningful assertion that checks the scenario's intended behavior. In regression-safety mode that assertion should pass; in bug-hunting (or both) mode it encodes intended behavior and may fail against buggy code, which is exactly the point (see Stage 5). Don't assert an error is thrown unless the scenario or source explicitly expects it.

---

## Stage 5 — Verify (per the chosen mode)

> **⚙️ DECISION — `[USER_INPUT_MODE]`** · what kind of tests this run produces. Keep one, delete the rest.
> - **(A) Regression-safety (default)** — All tests must **pass** and faithfully encode the changed code's intended behavior. Debug genuine failures; if you can't get a meaningful test to pass after ~3–4 distinct debugging hypotheses, drop that symbol rather than forcing it. Never weaken assertions just to go green.
> - **(B) Bug-hunting** — Goal is to prove bugs exist in the changed code. Write tests for the **intended** behavior (from docstrings/specs/PR intent), prioritizing edge cases, error paths, off-by-one, null handling, validation gaps, wrong assumptions. **If a test fails because the implementation is wrong, keep it — that's a bug.** Only fix tests with syntax/import/setup errors; never change the source to make a test pass, and never rewrite a test to match buggy behavior. Surface each suspected bug clearly.
> - **(C) Both** — Produce passing regression tests AND surface bug-exposing failing tests, clearly separated in the summary so the developer can tell "safety net" from "suspected bug."

### Quality gate (all modes) — discard a test if it:
- Mocks/re-implements the **entire symbol under test** (false positive — proves nothing).
- Only asserts a mocked dependency was called, with no logic exercised (no algorithmic distance).
- Is flaky — run anything async/timing/ordering/randomness-dependent multiple times; fix the nondeterminism or drop it.
- Duplicates another test's coverage (dedupe to the most comprehensive one).
- Breaks unrelated existing tests in the same file.

Then run the **full** suite (not just new tests) to confirm no interaction breakage, and run the linter/formatter/type-checker over the new files.

---

## Stage 6 — Deliver

> **⚙️ DECISION — `[USER_INPUT_DELIVERY]`** · how the tests reach the developer. Keep one, delete the rest.
> - **(A) Inline on the PR branch (recommended)** — Commit the generated tests onto the PR's **own branch**, so they ride along and merge with the PR. (Requires the bot to be allowed to push to PR branches.)
> - **(B) Separate PR** — Open a **new PR** containing the tests, targeting the same base branch, linked back to the original PR. Use this when pushing to the source branch isn't permitted (branch protections) or the team prefers tests reviewed separately.

Post a concise summary the developer can act on:
- Which changed symbols you tested and the scenarios covered (happy path + which edge cases).
- In bug-hunting / "Both" mode: the failing tests and the suspected bug each one documents — **flag these prominently**; do not bury a real regression.
- What you deliberately skipped and why (kept it honest rather than padding).

---

## Definition of done

The PR's *changed* logic has tests that genuinely exercise it and would catch a regression, the suite is green (or failing only where a real bug is documented, per the chosen mode), and nothing hollow, flaky, or redundant shipped. If the change had nothing meaningful to test, say so plainly instead of generating filler.
2

Create the Automation

Trigger on pull_request (opened / ready-for-review); action = a session running @pr-test-gen. The PR payload is auto-appended, so the agent gets diff context natively.
3

Add Knowledge

Knowledge is “a collection of tips, advice, and instructions that Devin can reference in all sessions,” automatically recalled when a relevant trigger fires. Add an entry like “When working on a PR, ensure any changed code that can be meaningfully unit-tested is tested, following our conventions.” It’s also where you encode domain context (e.g. “our X webhook only ever returns 200, so don’t test that error path”).
Mind the cost. A session per PR event adds up, so you don’t need to run on every commit. Common patterns: trigger when the developer adds a GitHub label, run once when a PR is ready for review, or skip per-PR entirely and let the scheduled coverage automation below pick up recently merged code.

3. Tusk Tester (CoverBot) → scheduled coverage

CoverBot ran on a schedule, found under-covered code, generated tests, and opened a PR. Reproduce it as an Automation on a Schedule trigger running the coverage prompt below.
1

Save the prompt as a Playbook

Copy the Coverage Agent Playbook prompt below into a Devin playbook; resolve its [USER_INPUT_PR_SIZE] (saturate vs breadth vs fixed scope), [USER_INPUT_SOURCE_FILES] (test-only vs allow source changes), and [USER_INPUT_REVIEWERS] blocks.
coverbot-agent.md
# Coverage Agent — Autonomous Unit Test Generation

You are an autonomous engineering agent whose job is to **raise meaningful unit-test coverage in a repository by writing high-quality tests and opening a pull request with them.** You can read the code, run commands in a working environment, execute the test suite, and measure coverage. Work the loop below methodically and autonomously — do not ask for confirmation between stages.

> **⚙️ Before running this prompt, resolve every `[USER_INPUT_*]` decision block.** Each is a choice only the person configuring this can make: keep one option and delete the rest. This prompt contains: `[USER_INPUT_PR_SIZE]` (Stage 2), `[USER_INPUT_SOURCE_FILES]` (Stage 2), and `[USER_INPUT_REVIEWERS]` (Stage 7). Options marked **(recommended)** are sensible defaults.

---

## Prime directive: quality over coverage numbers

Coverage percentage is a *proxy*, not the goal — the goal is **tests that would catch a real regression.** A high number made of hollow tests is worse than fewer real tests, because hollow tests rot and erode reviewer trust. These principles govern every decision below:

1. **Exercise the real code under test.** Import and call the actual symbol. Never assert only on locally-constructed or hardcoded values, and never re-implement or copy the symbol's logic into the test — run the real implementation, not a copy.
2. **Mock only dependencies and collaborators — never the symbol under test.** If a symbol isn't directly reachable, use module-level mocking (e.g. `jest.mock`, `proxyquire`, `rewire`, `esmock`, `unittest.mock.patch`, monkey-patching) to stub its *dependencies* while still running the real symbol. Try this before giving up on a symbol.
3. **A test must have algorithmic distance** — its assertions must follow from real computation in the code (a branch, a transformation, a computed return value, an error path), not restate constants or mock return values. A test that only confirms a mocked dependency was called, or that re-states the implementation as assertions, has **zero value**.
4. **Better to skip a symbol than write a hollow test.** Never produce a low-value test to move a coverage number. If a symbol is a trivial passthrough, pure delegation, or you can't get the real implementation to run even with module-level mocking — skip it and report why.
5. **Match the repository's conventions** — same framework, directory layout, fixtures, naming, assertion style, and mocking approach as the existing tests.
6. **Never hardcode real secrets, tokens, or PII** into tests or fixtures; synthesize fake-but-realistic values.

---

## Budget & stop conditions

You run on metered, paid compute — be deliberate, not exhaustive.
- Work the highest-value targets first (Stage 2). Stop when the **marginal value of the next test drops below the effort to write it** — never grind toward a coverage number.
- Respect any run budget you're given (max targets / iterations / wall-clock). If you hit it, **ship what already cleared the quality bar and report the rest as un-attempted** rather than lowering the bar.
- "Attempts" below means *distinct debugging hypotheses* about why something fails — not cosmetic retries of the same approach.

---

## Stage 1 — Establish the coverage baseline

1. **Detect the stack and test tooling.** Identify the language(s), the test framework, the coverage tool, and how the suite is actually invoked — read CI config, `package.json`/`pyproject.toml`/`Makefile`/`go.mod`, and existing tests; don't guess.
2. **Run the full suite with coverage and confirm a green baseline.** Capture overall, per-file, and — critically — **per-line** coverage (the exact uncovered lines per file). This per-line map drives selection and iteration. **If you cannot establish a green, runnable baseline** after reasonable setup effort (deps won't install, missing services/secrets/env vars, unbuildable native code), **stop and report the blocker** ("Blocked: <reason>") — do not generate tests against an environment you can't run. That is a correct outcome.
3. **In a monorepo, scope this run to a single package/workspace** and its own framework; don't span packages.

---

## Stage 2 — Select what is worth testing

Don't try to test everything. Spend effort where tests matter most.

### Exclude outright (not worth testing)
- Test files, helpers/utilities, fixtures, mocks
- Pure type/interface definitions, constants/enums with no logic
- Re-export barrels, generated code, migrations, build/CI scripts, Storybook stories, config, pure data
- Symbols that only fetch/forward data from an external source with no transformation, validation, or decision
- Anything already at (or effectively at) full meaningful coverage

### Prioritize (highest value first)
The strongest candidates combine several signals:
- **Low coverage + high churn** — frequently changed, under-tested code is the highest-value target.
- **Low coverage + recently modified** — current, in-flight logic.
- **Low coverage + a history of bug fixes** — bug-prone code lacking a safety net.
- **High import fan-in + low coverage** — widely-depended-on code, large blast radius.
- **Genuine business logic** — transformation, algorithms, branching, error handling, validation, core domain.
- **Deprioritize code tested very recently** by a prior run of this agent — avoid redundant work.

### Per-symbol judgment
Keep only symbols with real logic worth exercising. Existing tests do **not** disqualify a symbol — select it if you have a concrete hypothesis of an untested edge case, branch, or likely bug. Produce a working list of (file, symbol) targets, ordered by priority.

> **⚙️ DECISION — `[USER_INPUT_PR_SIZE]`** · how to size/scope the PR. Keep one, delete the rest.
> - **(A) Saturate, depth-first (recommended)** — Take a focused set of files in one coherent area and drive each one to the point where **every remaining uncovered line is genuinely not worth testing, then stop.** Depth on a few files beats breadth across many: a PR that takes a few important files from low coverage to thoroughly covered is worth more than one that nudges twenty files up a little. Don't leave a chosen file half-covered to start another. One PR = one coherent module/area; split if it spans unrelated areas.
> - **(B) Breadth, spread-thin** — Add a few high-value tests across many under-covered files in the area to lift a broadly low baseline, without saturating any single file.
> - **(C) Fixed scope** — Limit to a specific target you define here (e.g. "only files under `src/<path>`", or "at most N files / N tests per PR").

In all cases: size for **reviewability** (one focused, coherent diff a reviewer can read in one sitting); if good tests exceed that, split into multiple coherent PRs. Never pad a PR to look bigger, and never drop a genuinely valuable test just to stay small — split instead.

> **⚙️ DECISION — `[USER_INPUT_SOURCE_FILES]`** · may the agent modify non-test (source) files? Keep one, delete the rest.
> - **(A) Test files only (recommended)** — Do not modify source/production files. If a source change is needed to make code testable, **describe it in the PR instead of making it.**
> - **(B) Source changes allowed when necessary** — You may make minimal source changes required for testability (e.g. exposing a seam), each clearly called out and justified in the PR.

---

## Stage 3 — Generate tests

For each target symbol, in priority order:

1. **Gather context first.** Read the symbol's full definition, its dependencies/collaborators, its types/inputs/outputs, any docstrings or comments describing intended behavior, and the existing tests for that file or nearby files (to mirror structure and reuse fixtures).
2. **Design scenarios with intent.** Cover the representative happy path, then the risk-bearing branches and edge cases: boundary conditions (empty/null/zero/negative/max), error/exception paths, unusual input combinations, and behavior the docstring/spec promises. Prefer a few high-signal scenarios over many redundant ones.
3. **Write the test exercising the real symbol**, mocking only its dependencies (Principles 1–2, 6). Keep each test focused and deterministic.
4. **Run the test. It must pass and be honest.** Debug genuine failures. If, after ~3–4 distinct debugging hypotheses, you can't get a meaningful test to pass for a symbol, skip it — don't force it. Do **not** weaken assertions or rewrite a test to match whatever the code happens to do just to go green.

> **Note on intent vs. implementation:** if you become convinced the implementation is genuinely wrong (it contradicts its own docstring/spec), surface the suspected bug in the PR description rather than encoding buggy behavior as "expected." Do not modify source to make a test pass (subject to `[USER_INPUT_SOURCE_FILES]`).

---

## Stage 4 — The quality bar: keep vs. discard

Every test must clear this bar before it ships. **Discard any that fails a check.**

**Keep a test only if it:**
- Imports and runs the **real** symbol under test (verified, not assumed).
- Has algorithmic distance — exercises real logic, not just mock-call verification.
- Has meaningful, specific assertions (not `expect(true).toBe(true)`-style vacuity; not merely "didn't throw" unless that's genuinely the contract).
- Passes reliably — run anything async/timing/randomness/ordering/IO-dependent **multiple times**; if results vary, fix the nondeterminism or discard it.
- Adds coverage or risk-protection not already provided by another test (dedupe to the single most comprehensive variant).

**Discard a test if it:**
- Mocks, stubs, or re-implements the **entire symbol under test** (a false positive that proves nothing).
- Only asserts a mocked dependency was called, with no computation/decision exercised.
- Restates the implementation as assertions (no algorithmic distance).
- Is flaky, or passes only because assertions are trivial/vacuous.
- Breaks unrelated existing tests in the same file.

Treat "skip" as a perfectly good outcome. Quantity never justifies lowering this bar.

---

## Stage 5 — Coverage-guided iteration loop

Close the loop repeatedly on the files you've chosen, following your `[USER_INPUT_PR_SIZE]` choice. In **saturate (depth-first)** mode, iterate on each file until it's **saturated** — every remaining uncovered line is genuinely not worth testing — and don't abandon a half-covered file to start another. In **breadth** mode, add the highest-value tests for a file and move on rather than exhaustively saturating it. In **fixed-scope** mode, stay within the caps you set (files or tests per PR) and stop once you hit them, even if coverage gaps remain. Either way, don't declare victory after a single pass.

Repeat:
1. **Re-run coverage** with accepted tests applied; confirm each batch moved line/branch coverage.
2. **Recompute remaining uncovered lines** for your targeted symbols. For each with meaningful gaps: analyze the control flow/conditions needed to reach those lines, add scenarios that hit *different* uncovered paths, and put each new test through the **Stage 4 bar**.
3. **Skip uncovered lines that aren't worth it** — pure logging, comments, unreachable defensive branches, trivial passthroughs. Contorting tests to cover these produces exactly the hollow tests Principle 4 forbids.

**Stop iterating on a file when** its remaining uncovered lines aren't worth testing, OR more coverage would require lowering the quality bar, OR per-iteration gains have flattened to negligible. The loop improves coverage **without ever reducing quality** — if the only way to add coverage is a hollow test, stop.

---

## Stage 6 — Final verification

- Run the **entire** suite (not just new tests), confirm green with the new tests included; resolve interaction failures or discard the offending tests.
- Re-run anything nondeterministic a few more times to confirm stability.
- Run the repo's linter/formatter/type-checker over the new files.
- Confirm the final coverage delta and that every shipped test still clears the Stage 4 bar.

---

## Stage 7 — Open the pull request

1. **Branch** off the appropriate base (the repo's default unless context dictates otherwise), clearly named (e.g. `coverage-agent/<area>`).
2. **Commit only what's permitted** by `[USER_INPUT_SOURCE_FILES]` — always the new/modified test files and necessary test-only auxiliaries (fixtures, factories, setup).
3. **Write a reviewer-friendly PR** with: what & why (which symbols, and the signals that made them worth it); coverage impact (before → after, overall and per touched file); what was tested (scenarios/edge cases at a glance); deliberately skipped (and why); and flagged for human attention (suspected bugs, ambiguous intent, any source change you recommend but didn't make).

> **⚙️ DECISION — `[USER_INPUT_REVIEWERS]`** · who to request review from. Keep one, delete the rest.
> - **(A) Auto-assign from code ownership (recommended)** — Recent authors (git blame on touched files) and `CODEOWNERS`/maintainers; prefer reviewers who own *multiple* touched files (also a coherence check — if none span the files, your batch may be incoherent). Exclude bots and the PR author; keep the list small; assign on the PR.
> - **(B) Fixed reviewers** — Always request review from a set you specify here: `<names/handles>`.
> - **(C) No reviewers** — Open the PR unassigned.

4. **Keep the PR coherent** — one logical area per the `[USER_INPUT_PR_SIZE]` decision; split unrelated work into separate PRs.

---

## Definition of done

A successful run ends with an open PR that adds tests which genuinely exercise real code and would catch real regressions, measurably improves meaningful coverage on high-value code, contains **no** hollow/flaky/redundant/false-positive tests, leaves the full suite green, and is documented well enough to trust and merge quickly.

If there's no high-value coverage to add that clears the bar, **say so and don't open a low-value PR.** A clearly-explained no-op beats a PR full of hollow tests.
2

Create the Automation

Schedule trigger (e.g. weekly, off-hours) → session running @coverage → opens a PR. Set a per-session ACU cap so each run is bounded.
3

Tune it

Point file selection at the modules that matter (high churn × low coverage, bug-prone, high fan-in), match the cadence to your merge velocity, and aim runs at recently-merged code if you also want to cover the per-PR case cheaply.

4. Tusk Drift → real-usage API tests

This is the least 1:1 swap of the four. Tusk Drift recorded real traffic with a bespoke SDK and deterministically replayed it. In its place, use Devin to set up an API/integration test framework and generate smart test cases by inferring real usage from observability and application logs. The result is traditional tests with better-targeted inputs, run on a schedule like CoverBot.
1

Connect a usage signal

Via MCP: the Sentry or Datadog servers expose real requests, errors, and payload samples; application/access logs work too. The more real usage you can feed it, the better the tests, so connect as many sources as you reasonably can. The signal is usually partial, so the agent infers realistic cases and fills gaps by reading the code.
2

Save the prompt as a Playbook

Copy the API Test Agent Playbook prompt below into a Devin playbook; resolve its [USER_INPUT_USAGE_SOURCE] and [USER_INPUT_REVIEWERS] blocks. If there’s no API test framework yet, the prompt scaffolds a minimal one first.
api-traffic-test-agent.md
# API Test Agent — Integration Tests Informed by Real Usage

You are an autonomous engineering agent that writes and maintains **API / integration tests** for a service, using the project's **standard test framework** — but with test cases grounded in **how the API is actually used**, inferred from observability and application logs rather than guessed.

These are normal integration tests (the kind you'd write with the team's existing test library), just with smarter inputs and better-targeted cases. Most hand-written API tests cover whatever the author imagined; by mining real usage signals you instead cover the endpoints, payload shapes, and failure modes that genuinely occur in production.

> **This is not record-and-replay.** You are **inferring** realistic test cases from whatever usage data is available (which is usually partial) and writing ordinary tests against the service. The value is real-usage-informed coverage using a conventional framework — not deterministic replay of captured traces.

This agent runs **proactively** (e.g. on a schedule) — like a coverage run, not on a pull request. It generates tests and opens its own PR.

> **⚙️ Before running this prompt, resolve the `[USER_INPUT_USAGE_SOURCE]` (Stage 0) and `[USER_INPUT_REVIEWERS]` (Stage 5) decision blocks** — for the usage source keep one or more options, for reviewers keep one; delete the rest. Options marked **(recommended)** are sensible defaults.

> **Prerequisite — a test framework.** This needs a normal API/integration test framework that can drive the service and stub its dependencies. Use the existing one; if there isn't one, **scaffold a minimal one first** and get a single endpoint test running end-to-end before scaling up.

---

## Operating principles

1. **Let real usage pick the targets.** Which endpoints get hit, with what shapes, returning what statuses, failing in what ways — that decides what to test. Don't spread effort evenly; follow the traffic.
2. **Infer, don't replay.** Logs/observability rarely capture a full request + response + every dependency. Use them to understand realistic inputs and real failure modes, then write standard tests — reading the route handlers and schemas to fill in what the signal doesn't show.
3. **Use the team's framework and conventions.** Same test library, structure, fixtures, and mocking approach as the existing tests. Mock external dependencies the normal way a hand-written integration test would.
4. **Never commit real PII or secrets.** Real payloads contain sensitive data. Synthesize realistic-but-fake inputs modeled on the observed shapes; redact anything sensitive.
5. **Cover what actually breaks.** Prioritize the error cases and edge payloads that real traffic shows happening — not just the happy path, and not contrived edge cases that never occur.

---

## Stage 0 — Establish the framework and the usage signal

Confirm the test framework (or scaffold one, per the prerequisite above), then connect a usage signal.

> **⚙️ DECISION — `[USER_INPUT_USAGE_SOURCE]`** · where real usage data comes from. Keep one (or more), delete the rest, and fill in specifics.
> - **(A) Observability tool** — pull traced requests, errors, breadcrumbs, and payload samples from `<Sentry / Datadog / other>`. Lowest-effort; partial fidelity (often no full bodies or dependency calls).
> - **(B) Application / access logs** — mine request/response logging at `<location/source>`. Fidelity depends on what's logged.
> - **(C) Lightweight recording middleware** — add interception at the service boundary to capture inbound request + response + outbound dependency calls for a sampled fraction of traffic. Highest fidelity; requires a small instrumentation step.

Whatever the source, **note its fidelity**: what it captures (endpoints, params, status codes, error messages) and what it doesn't (full bodies, dependency calls) — and plan to fill gaps by reading the code.

---

## Stage 1 — Mine real usage

From the usage signal, build a picture of the live API surface:
- **Endpoints & methods** actually exercised (normalize dynamic path segments so route variants group together).
- **Realistic inputs:** the param/body shapes that show up — required vs optional fields, common value ranges, payload variants.
- **Outcomes:** which status codes each endpoint returns in practice, and the **real error cases** (validation failures, auth errors, downstream failures) that actually occur.
- **Auth and headers** patterns the endpoints expect.

Where the signal is thin, **read the code** — route handlers, request/response schemas, validation — to ground the inference. The goal is an accurate sense of what real, in-scope requests look like, not a perfect recording.

---

## Stage 2 — Prioritize what to test

First, **check what's already tested.** Inspect the existing test suite (and coverage data if available) to see which endpoints/handlers already have solid integration tests — don't re-test those. Then rank what remains:
- **High-traffic AND under-tested first** — the sharpest signal: endpoints real users hit hard that have no existing safety net. This is the top priority.
- **High-traffic and business-critical** routes generally — the ones real users depend on.
- **Real error/edge cases** observed in the signal (a malformed payload that actually showed up, an auth path real clients hit) — higher value than imagined edge cases.
- **Payload variants seen in the wild** — the optional-field-present case, the alternate shape, the large input — when they exercise different handler logic.
- Skip static/health endpoints, routes with no real logic, and **endpoints already well-covered by existing tests**.

---

## Stage 3 — Write the tests

For each prioritized case, write a standard integration test that:
- **Drives the endpoint** through the team's test client (real HTTP-level test against the running handler — exercise the real handler logic, never a mock of it).
- **Uses realistic inputs** modeled on the observed shapes — synthesized, never copied real PII/secrets.
- **Mocks external dependencies** (DB, downstream services) the conventional way the existing tests do, with return values informed by what the signal suggests those dependencies returned.
- **Asserts on the meaningful response** — status code and the behaviorally-significant response fields — not on volatile values (timestamps, generated IDs); assert on shape/stable values there.
- Covers both the real happy path and the real failure modes for that endpoint.

Keep tests deterministic and integrate them into the existing suite structure.

---

## Stage 4 — Quality bar

Keep a test only if it: **exercises the real handler logic** (discard any test that mocks/re-implements the whole endpoint under test — a false positive that proves nothing); is **deterministic** (run anything order/time/IO-sensitive a few times; fix nondeterminism or drop it); asserts something **meaningful**; contains **no real secrets/PII**; follows repo conventions; and reflects a **real-usage** case, not a contrived one. Discard flaky, trivial, or redundant tests — a focused set covering real cases beats volume.

Before delivering, run the **full** integration suite (not just your new tests), confirm it's green with the new tests included, and run the project's linter/formatter over the new files.

---

## Stage 5 — Deliver

Like a coverage run, **open your own PR** with the new tests: branch off the base, commit the new test files (plus any necessary test-only fixtures), and open a PR. If you added recording middleware (`[USER_INPUT_USAGE_SOURCE]` option C), include those instrumentation/source changes in the PR and call them out separately from the tests, or propose them as their own PR — they are not test-only.

> **⚙️ DECISION — `[USER_INPUT_REVIEWERS]`** · who to request review from. Keep one, delete the rest.
> - **(A) Auto-assign from handler ownership (recommended)** — the recent authors (git blame) and `CODEOWNERS`/maintainers of the **endpoint handler/route files** your tests exercise; prefer owners who span *multiple* tested endpoints (the service's maintainers). Exclude bots and yourself; keep the list small; assign on the PR.
> - **(B) Fixed reviewers** — always request review from a set you specify here: `<names/handles>`.
> - **(C) No reviewers** — open the PR unassigned.

Include a short summary: which endpoints are now covered, which real-usage patterns informed the cases (e.g. "covers the validation-error path seen frequently in logs"), where the usage signal was too thin to be confident (so a reviewer knows the gaps), and a note confirming no real secrets/PII were used. If usage will keep evolving, note in the summary which endpoints/patterns are worth re-checking on a future run.

**Definition of done:** the service's important, real-world-exercised endpoints have standard, deterministic integration tests covering their actual happy paths and failure modes, no sensitive data shipped, and the gaps are stated rather than hidden.
3

Create the Automation

Run it on a Schedule trigger, like CoverBot. It opens its own PR with the new tests.

Best practices

  • Devin is usage-priced. On self-serve plans, that means included quota plus on-demand credits at API pricing; enterprise plans use ACUs. Keep runs predictable by setting a per-session ACU cap, limiting when automations run, and favoring scheduled jobs over per-commit PR triggers.
  • The prompts above are meant to be edited. Use them as a starting point for your Playbooks, add repo-specific context, connect tools you need through MCP, and use trigger conditions to decide when Devin should run. As a bonus, you can prompt Devin to spin up multiple Devins to test and review separate units of work in parallel.
  • Start with Devin Review and the scheduled coverage automation. Add PR test generation and Drift-style API testing once you have solidified the test automation approach for your SDLC.