Benchmarking prompts using Sand-Bob

Benchmarking prompts using Sand-Bob#

You can use Sand-Bob to benchmark different prompts. This might be useful for automated prompt engineering.

prompts = [
    "Count the number of b's in blueberry.",
    "Count the number of b's in blueberry and print the number only.",
    "I would like to know the number of B in blueberry."
]
ground_truth = 2

model = "gemma3:4b"

import numpy as np
from sand_bob import generate_code, config_llms

prompt_performance = {}
for prompt in prompts:
    config_llms(model=model)

    results = generate_code(prompt,
                  n_parallel=3,
                  n_iterative=2,
                  n_codefix_attempts=2,
                  n_feedback_iterations=0, # do not use a vision-model
                  final_touch=False, # do not beautify the final notebook, we just need the result
    )

    print(prompt, "results:", [str(r.final_result)[:10] for r in results])

    correct = 0
    for r in results:
        try:
            if int(r.final_result) == ground_truth:
                correct += 1
        except:
            pass

    prompt_performance[prompt] = correct / len(results)
    print("Success rate:", prompt_performance[prompt])

Count the number of b's in blueberry. results: ['Final resu', 'Final resu', 'Final Resu', 'Final resu', 'Final resu', 'Final Resu']
Success rate: 0.0

Count the number of b's in blueberry and print the number only. results: ['2.0', '2.0', '2.0', '2.0', '2.0', '2.0']
Success rate: 1.0

I would like to know the number of B in blueberry. results: ['Final resu', 'Final resu', 'Final resu', "2 'B's", 'Final resu', '0.0']
Success rate: 0.0

The prompt which performed best was:

max(prompt_performance, key=prompt_performance.get)

"Count the number of b's in blueberry and print the number only."