Benchmarking LLMs using Sand-Bob

Benchmarking LLMs using Sand-Bob#

You can use Sand-Bob to benchmark different language models solving coding-tasks. In this notebook we demonstrate how. Before starting, we define the task and the correct result.

prompt = "Count the number of b's in blueberry and print the number only."
ground_truth = 2

Next, we determine which LLMs are available on the server.

import openai
client = openai.OpenAI(base_url="http://localhost:11434/v1",
                       api_key="none"
                      )

print("\n".join([model.id for model in client.models.list().data]))
gemma3:4b
mistral:latest
embeddinggemma:latest
mxbai-embed-large:latest
qwen:0.5b
phi3:instruct
phi3:latest
llama3.1:latest
llms_to_test = ["gemma3:4b", "mistral", "qwen:0.5b"]
import numpy as np
from sand_bob import generate_code, config_llms

model_performance = {}
for model in llms_to_test:
    config_llms(model=model)

    results = generate_code(prompt,
                  n_parallel=3,
                  n_iterative=2,
                  n_codefix_attempts=2,
                  n_feedback_iterations=0, # do not use a vision-model
                  final_touch=False, # do not beautify the final notebook, we just need the result
    )

    print(model, "results:", [str(r.final_result)[:10] for r in results])

    correct = 0
    for r in results:
        try:
            if int(r.final_result) == ground_truth:
                correct += 1
        except:
            pass

    model_performance[model] = correct / len(results)
    print("Success rate:", model_performance[model])
gemma3:4b results: ['2.0', '2.0', '2.0', '2.0', '2.0', '2.0']
Success rate: 1.0
mistral results: ['2', 'None', '2', '2', 'Warning: T', 'None']
Success rate: 0.5
qwen:0.5b results: ['None', 'None', 'None', 'None', 'None', 'None']
Success rate: 0.0

The model which performed best was:

max(model_performance, key=model_performance.get)
'gemma3:4b'