A look at exactly how every prompt in Promptifi gets graded, run, and re-run before it lands in your library — and what happens when it falls below the line.
The internet is drowning in prompts no one ever ran in a real deal. We built the testing protocol because we hated trying them ourselves and finding out, mid-call, that nothing came out the other end.
Three things we don’t do: ship from theory, score with an AI judge, or publish anything we wouldn’t run on our own pipeline.
Every prompt is run against an actual prospect, an actual call transcript, an actual CRM record. Synthetic test cases are too forgiving.
Output gets a five-point score from the rep who ran it. AI-on-AI evaluation is where prompt quality goes to die.
One opinion isn’t a signal. Three across different verticals starts to be. We average and weight by deal-stage match.
Models drift, sales tactics evolve, what worked in Q1 may flop in Q3. Every published prompt cycles back through the protocol.
Same five gates for every prompt — discovery question, cold email, MEDDPICC builder, renewal play. No shortcuts, no special cases.
An in-house operator or beta tester writes V1 against a specific deal scenario. Stage-tagged from day one. Time-boxed to 20 minutes.
∼ 20 minOutput is reviewed against three reference deals. Obvious failures get killed here — hallucinated company facts, generic copy, wrong stage tone.
internalThree to five testers run the prompt in live pipeline. Output goes into actual emails, calls, CRM notes. They log time saved and outcome.
5 to 14 daysFive-criterion rubric averaged across testers. Below 4.0 goes back to revision. Above and it earns the “Tested” flag.
≥ 4.0 to shipGoes live in the library with tester names and scores. Auto-flagged for re-test 90 days later. Score below 3.5 on retest — pulled.
quarterlyCumulative numbers since we started recording the protocol.
Output sounded like a debt collector. Two of three CSMs said they’d never send it. We tried three rewrites — none cleared 3.0 on tone fit.
Hallucinated stats, fabricated quotes, sounded like every other AI thread. Wasn’t a sales workflow — bad fit for the library.
Shipped at 4.1 last quarter. Re-test scored 3.2 — reply rates collapsed once the pattern got common across the wider market.
Or apply to be one of the reps who tests them. Both take less than three minutes.