Visualizing result consistency during iterative code improvement

Visualizing result consistency during iterative code improvement#

Sand-Bob gives the opportunity to inspect results of iterative code-improvement steps. After generating and executing code, the execution result is stored. Afterwards, Sand-Bob improves the code, executes it again and also these results are stored. By running the entire process multiple times, we can get an idea about how consistend results are.

Note: Consistent results are necessary but not sufficient condition for functional correctness. If 10 executions lead to 10 different results, one can conclude that at least 9 of the result must be wrong. But if 10 executions lead to equal results, this does not necessarily mean that those are correct.

from sand_bob import generate_code, config_kiara, config_scadsai_llm

#config_kiara()
config_scadsai_llm()

First, we demonstrate consistency between results and also over iterative code improvement using a code-generation prompt where the goal of the data analysis is quite precisely defined.

results = generate_code("""
I would like to segment the bright blobs in input_data/image.tif and count them.
For segmentation, use otsu-thresholding, connected component labeling and remove the objects sitting on the image border.
The final result should be a number.
""", 
              input_host_path="input_data",
              dependencies=["numpy", "scikit-image", "tifffile", "pandas", "matplotlib"],
              n_parallel=3,
              n_iterative=3,
              n_codefix_attempts=2,
              n_feedback_iterations=1,
              final_touch=False,
                       )

We can visualize a summary of the final results:

results.display_result_summary()

9 results: Numeric: 9, String: 0, Image: 0, Dataframe: 0, Other: 0

And we can also visualize how these results changed from one code-improvement iteration to another. The final results are shown on the right:

results.display_result_history()
Process 1TypeError4646.0
Process 2TypeError4646.0
Process 36046.0
Process 44646.0
Process 54646.0
Process 64646.0
Process 74646.0
Process 84646.0
Process 9TypeError4346.0

Note that in the example above, the iterations end earlier compared to the example below.

Imprecise prompting#

Second, we execute a prompt which is more vague and the goal is less clear.

results = generate_code("""
I would like to segment the bright blobs in input_data/image.tif and count them.
""", 
              input_host_path="input_data",
              dependencies=["numpy", "scikit-image", "tifffile", "pandas", "matplotlib"],
              n_parallel=3,
              n_iterative=3,
              n_codefix_attempts=2,
              n_feedback_iterations=1,
              final_touch=False,
            )

results.display_result_summary()
results.display_result_history()

9 results: Numeric: 9, String: 0, Image: 0, Dataframe: 0, Other: 0

Process 161NameError61
Process 261AttributeErrorTypeError120.0
Process 360AttributeError114.0
Process 461AttributeError81
Process 561AttributeErrorTypeError123.0
Process 643TypeError174
Process 760TypeErrorAttributeError120
Process 859AttributeError115
Process 9TypeError60AttributeError115.0