How We Test · Sales AI Prompts

If it didn’t close, it didn’t ship.

A look at exactly how every prompt in Promptifi gets graded, run, and re-run before it lands in your library — and what happens when it falls below the line.

Most prompt libraries are generated. Ours is tested.

The internet is drowning in prompts no one ever ran in a real deal. We built the testing protocol because we hated trying them ourselves and finding out, mid-call, that nothing came out the other end.

Three things we don’t do: ship from theory, score with an AI judge, or publish anything we wouldn’t run on our own pipeline.

i

Tested in real deals, not test data

Every prompt is run against an actual prospect, an actual call transcript, an actual CRM record. Synthetic test cases are too forgiving.

ii

Graded by the rep, not the model

Output gets a five-point score from the rep who ran it. AI-on-AI evaluation is where prompt quality goes to die.

iii

Three runs minimum, five preferred

One opinion isn’t a signal. Three across different verticals starts to be. We average and weight by deal-stage match.

iv

Re-tested every quarter

Models drift, sales tactics evolve, what worked in Q1 may flop in Q3. Every published prompt cycles back through the protocol.

The five-step protocol

From draft to library.

Same five gates for every prompt — discovery question, cold email, MEDDPICC builder, renewal play. No shortcuts, no special cases.

Step 01

Draft

An in-house operator or beta tester writes V1 against a specific deal scenario. Stage-tagged from day one. Time-boxed to 20 minutes.

∼ 20 min
Step 02

Dry run

Output is reviewed against three reference deals. Obvious failures get killed here — hallucinated company facts, generic copy, wrong stage tone.

internal
Step 03

Field test

Three to five testers run the prompt in live pipeline. Output goes into actual emails, calls, CRM notes. They log time saved and outcome.

5 to 14 days
Step 04

Score & revise

Five-criterion rubric averaged across testers. Below 4.0 goes back to revision. Above and it earns the “Tested” flag.

≥ 4.0 to ship
Step 05

Publish & re-run

Goes live in the library with tester names and scores. Auto-flagged for re-test 90 days later. Score below 3.5 on retest — pulled.

quarterly

The protocol, by the numbers.

Cumulative numbers since we started recording the protocol.

3,420
Prompts entered field testing
2,000+
Shipped to the library
~29%
Pass rate (of field-tested)
4.3
Average score at publish
What gets killed

The prompts we didn’t ship.

Killed at field test

“Aggressive renewal nudge”

Output sounded like a debt collector. Two of three CSMs said they’d never send it. We tried three rewrites — none cleared 3.0 on tone fit.

Final score · 2.4 / 5.0
Killed at dry run

“LinkedIn post generator”

Hallucinated stats, fabricated quotes, sounded like every other AI thread. Wasn’t a sales workflow — bad fit for the library.

Final score · n/a
Pulled at retest

“Cold email opener — generic”

Shipped at 4.1 last quarter. Re-test scored 3.2 — reply rates collapsed once the pattern got common across the wider market.

Pulled · superseded

Want to see the tested prompts?

Or apply to be one of the reps who tests them. Both take less than three minutes.