Large language model fine tuning

Large language model fine tuning#

In this notebook we will fine-tune an openai chatGPT model to become a bio-image analysis expert which responds with recent Python code for processing images.

See also:

Todo: We could submit training and validation data separately. This notebook does not cover this yet due to limited training data. As soon as we have a larger pool of training data, we can give this a try.

import json
import openai
import time

We first set up training data, a list of chat examples. Most code examples used below are taken from the BioImageAnalysisNotebooks.

training_data = []

def add_example(input, output):
    training_data.append(
    {
      "messages": [
          { "role": "system", "content": """
You are a chatbot with strong bio-image analysis expertise.
You are an expert python programmer. 
The code you produce will be executed from a Jupyter notebook.
"""},
          {"role": "user", "content": input},
          {"role": "assistant", "content": output}
      ]       
    })

add_example(input="""
How can one open CZI or LIF files using Python?
""", output="""
To open CZI or LIF files, you can use the AICSImageIO package. 
In the following code the file `filename` will be loaded and 
the image data will be stored in `image`.

```python
from aicsimageio import AICSImage
aics_image = AICSImage("../../data/EM_C_6_c0.ome.tif")
aics_image
```

np_image = aics_image.get_image_data("ZYX")
""")

add_example(input="""
How can one show an image, its histogram and some basic statistics in a Jupyter notebook?
""", output="""
You can show an image stored in the variable `image`, its histogram and some basic statistics using the `stackview` library:

```python
import stackview

stackview.insight(image)
```
""")

add_example(input="""
How can one show a label image with every label in a different colour in a Jupyter notebook?
""", output="""
You can show an image stored in the variable `image`, its histogram and some basic statistics using the `stackview` library:

```python
import stackview

stackview.insight(image)
```
""")

add_example(input="""
How can I show an image in Napari?
""", output="""
Napari is a Python based image viewer and you can use it for showing images and label images like demonstrateed in the following:
```python
import napari

# create a viewer
viewer = napari.Viewer()

# show an image in the viewer
viewer.add_image(image)

# show a label image in the viewer
viewer.add_labels(label_image)
```
""")

add_example(input="""
How can I remove the background in an image?
""", output="""
You can remove the background in an image using scikit-image's `white_tophat` function. 
In the following example the `input_image` will be background-subtracted and 
the result is stored in `output_image`. 
The variable `radius` allows to specify how large objects should stay while removing the background.

```python
from skimage.morphology import disk
from skimage.morphology import white_tophat

radius = 25

output_image = white_tophat(input_image, disk(radius))
```
""")

add_example(input="""
How can I remove noise in an image?
""", output="""
You can denoise an image using the median filter, e.g. using scikit-image:
In the following example the noise from `input_image` will be removed and 
the result is stored in `output_image`

```python
output_image = filters.median(input_image, disk(1))
```
""")

add_example(input="""
How can I segment small roundish objects, such as nuclei or granules, in an image?
""", output="""
You can segment and label small roundish objects in an image using the Voronoi-Otsu-Labeling algorithm.
In the following example the objects in `input_image` will be segmented and 
the resulting objects will be labeled and stored in the `label_image`.
The algorithm has two parameters:
* `spot_sigma` allows you to configure how large the objects are supposed to be.
* `outline_sigme` allows you to tune how fine the outlines should be.

In this code, the python library `napari-segment-blobs-and-things-with-membranes` will be used:
```python
import napari_segment_blobs_and_things_with_membranes as nsbatwm

label_image = nsbatwm.voronoi_otsu_labeling(input_image, spot_sigma=2, outline_sigma=2)
```

In case you are using the python library `pyclesperanto-prototype`, you can also call the algorithm like this:

```python
import pyclesperanto_prototype as cle

label_image = cle.voronoi_otsu_labeling(input_image, spot_sigma=2, outline_sigma=2)
```
""")

add_example(input="""
How can I segment small roundish objects, such as nuclei, in an image in case they are very densely packed?
""", output="""
You can segment densely packed nuclei in an image using the StarDist library.
In the following example the nuclei in the `input_image` will be segmented and
stored in the `label_image`:

```python
from stardist.models import StarDist2D
from csbdeep.utils import normalize

# creates a pretrained model
model = StarDist2D.from_pretrained('2D_versatile_fluo')

# normalize the image
axis_norm = (0,1)
image = normalize(image, 1,99.8, axis=axis_norm)

# segment the image
label_image, details = model.predict_instances(input_image)
```
""")

add_example(input="""
How can I segment cells in an image?
""", output="""
You can use the CellPose library for that.
In the following example the cells in the `input_image` will be segmented and
stored in the `label_image`:

```python
from cellpose import models, io

model = models.Cellpose(gpu=False, model_type='cyto')

channels = [0,0] # This means we are processing single-channel greyscale images.

label_image, flows, styles, diams = model.eval(input_image, diameter=None, channels=channels)
```
""")

add_example(input="""
How can re-label labels sequentially in a label image?
""", output="""
You can re-label labels in a label image using scikit-image's `relabel_sequential` function like this:

```python
from skimage.segmentation import relabel_sequential

relabeled_image, _, _ = relabel_sequential(label_image)
```
""")

add_example(input="""
How can expand labels e.g. with a radius of 10 pixels in a label image?
""", output="""
You can expand labels in a label image using scikit-image's `expand_labels` function like this:

```python
from skimage.segmentation import expand_labels

expanded_labels = expand_labels(filtered_label_image, distance=10)
```
""")

Preparing training data#

We need to reformat the data in the JSON-lines format.

def prepare_data(dictionary_data, final_file_name):
    
    with open(final_file_name, 'w') as outfile:
        for entry in dictionary_data:
            json.dump(entry, outfile)
            outfile.write('\n')

prepare_data(training_data, "training_data.jsonl")

Uploading training data#

We can upload the training data like this:

training_file = openai.File.create(
  file=open('training_data.jsonl'),
  purpose='fine-tune',
)
training_file

<File file id=file-xiweY94J061YDcLr7gJZAiWz at 0x1b3f63d2810> JSON: {
  "object": "file",
  "id": "file-xiweY94J061YDcLr7gJZAiWz",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 8156,
  "created_at": 1698937116,
  "status": "processed",
  "status_details": null
}

Afterwards, we need to wait until the file is done processing.

while openai.File.retrieve(training_file.id).status != "processed":
    print("Still processing")
    time.sleep(30)

Model training#

Once the training data is uploaded and processed, we can use it for training.

job = openai.FineTuningJob.create(training_file=training_file.id, model="gpt-3.5-turbo")
job

<FineTuningJob fine_tuning.job id=ftjob-3ZhPTB0pT3yJENRfutRUQju5 at 0x1b3f4b501d0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-3ZhPTB0pT3yJENRfutRUQju5",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1698937123,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-1cjvymf23JI7jxFOJCdnbsE2",
  "result_files": [],
  "status": "validating_files",
  "validation_file": null,
  "training_file": "file-xiweY94J061YDcLr7gJZAiWz",
  "hyperparameters": {
    "n_epochs": "auto"
  },
  "trained_tokens": null,
  "error": null
}

Also training will take a while and we need to wait until it succeeded.

while openai.FineTuningJob.retrieve(job.id).status !=  "succeeded":
    print("Still training")
    time.sleep(30)

Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training
Still training

We can then retrieve the model name and execute the model.

job_status = openai.FineTuningJob.retrieve(job.id)

model_name = job_status.fine_tuned_model
model_name

'ft:gpt-3.5-turbo-0613:personal::8GTbD6B7'

completion = openai.ChatCompletion.create(
  model=model_name,
  messages=[
    {"role": "system", "content": "You are a chatbot with strong bio-image analysis expertise."},
    {"role": "user", "content": "How can I segment an image showing dense nuclei in Python?"}
  ]
)
print(completion.choices[0].message["content"])

You can segment an image showing dense nuclei in Python using the scikit-image library. Here's an example using the Otsu thresholding method:

```python
from skimage.filters import threshold_otsu
from skimage.segmentation import clear_border
from skimage.measure import label
from skimage.segmentation import watershed

# Assume your input image is stored in the `input_image` variable

# Apply Otsu thresholding
threshold = threshold_otsu(input_image)
binary_image = input_image > threshold

# Remove border artifacts
cleared = clear_border(binary_image)

# Label the image
label_image = label(cleared)

# Apply watershed
image_label_overlay = watershed(-input_image, label_image)

# The resulting segmented image is stored in the `image_label_overlay` variable
```

For more advanced segmentation algorithms, you can have a look at the Cellpose or StarDist libraries, which provide more accurate segmentation of cells and nuclei:

- Cellpose: https://cellpose.readthedocs.io
- StarDist: https://github.com/stardist/stardist

Let me know if you need more help!

Large language model fine tuning

Contents

Large language model fine tuning#

Preparing training data#

Uploading training data#

Model training#