Nudge

Testing and Improving Prompts with Evalite

A/B test Nudge prompt variants and iteratively improve them using Evalite.

This guide shows how Nudge and Evalite work together to A/B test prompts and drive continuous improvement.

  • Nudge defines prompt structure and variants
  • Evalite executes those variants on real data and scores behavior

A/B Testing Prompt Variants

Variants in Nudge represent controlled changes on top of a shared base prompt.

import { prompt } from "@nudge-ai/core";

export const summarizerPrompt = prompt("summarizer", (p) =>
  p
    .persona("expert summarizer")
    .input("text to summarize")
    .output("concise summary")
    .variant("short", (v) =>
      v.constraint("limit to 1–2 sentences")
    )
    .variant("detailed", (v) =>
      v
        .do("explain context")
        .do("include examples")
    )
);

Each variant produces a distinct system prompt while keeping intent consistent.

Comparing Variants with Evalite

Evalite’s each() API maps directly to Nudge variants.

import { evalite } from "evalite";
import { generateText } from "ai";
import { exactMatch } from "evalite/scorers";
import { summarizerPrompt } from "./summarizer.prompt";
import "./prompts.gen";

evalite.each(
  summarizerPrompt.variantNames.map((name) => ({
    name,
    input: { variant: name },
  }))
)("Summarizer Variants", {
  data: async () => [
    {
      input: "Climate change refers to long-term shifts...",
      expected: "Climate change involves long-term shifts...",
    },
  ],

  task: async (input, variant) => {
    return generateText({
      model: "gpt-4o-mini",
      system: summarizerPrompt.toString({ variant: variant.variant }),
      prompt: input.input,
    });
  },

  scorers: [
    {
      scorer: ({ output, expected }) =>
        exactMatch({ actual: output, expected }),
    },
  ],
});

All variants are evaluated on the same inputs, making comparisons repeatable.

From A/B Results to Prompt Contracts

Evalite answers what performs better in practice. Nudge enforces what must never regress.

Once Evalite exposes stable failure modes (length, tone, missing facts), encode them directly:

.test(
  "Climate change refers to long-term shifts...",
  (output) => output.split(/\s+/).length <= 150,
  "Must stay under 150 words"
);

Self-Improvement

Nudge can automatically refine prompts based on failing tests:

npx @nudge-ai/cli improve --judge

On this page