{ "cells": [ { "cell_type": "markdown", "id": "fa63f581-f411-4e3c-8a36-255c1c986389", "metadata": {}, "source": [ "# Combining tables\n", "\n", "When workging with many tables in a relational database, it often makes sense to combine tables in summary tables and store them before further processing. Just as a reminder, we aquired `images` during the `experiments` and corresponding `measurements`. We now would like to extract an `overview` table that contains the most important information from `image`, `experiments` and `measurements`." ] }, { "cell_type": "code", "execution_count": 1, "id": "91c81c85-dce7-4b20-9654-ad52d051743d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pandasql import sqldf" ] }, { "cell_type": "markdown", "id": "0cc6f21c-83ac-4265-b78a-96fa48992420", "metadata": {}, "source": [ "We again start by loading our tables from disc." ] }, { "cell_type": "code", "execution_count": 2, "id": "6a98379e-c60d-4ce2-81e7-2ad47c5225e4", "metadata": {}, "outputs": [], "source": [ "experiment = pd.read_csv(\"../../data/experiments.csv\")\n", "image = pd.read_csv(\"../../data/images.csv\")\n", "measurement = pd.read_csv(\"../../data/measurements.csv\")" ] }, { "cell_type": "markdown", "id": "3dfa47bc-7f85-49fb-8a62-57b6b65054ae", "metadata": {}, "source": [ "Again, we define a function that can query the database." ] }, { "cell_type": "code", "execution_count": 3, "id": "bdec6f91-22c5-404f-a436-b23a29506e9a", "metadata": {}, "outputs": [], "source": [ "query_database = lambda q: sqldf(q, globals())" ] }, { "cell_type": "markdown", "id": "6ac03be8-a9b5-484b-94ac-691fae805a66", "metadata": {}, "source": [ "## Inner Join\n", "SQL also allows combining tables using the `JOIN` statement. Most common is the `INNER JOIN` which allows to combine two tables in a way that rows are matched according to a defined connection `ON` which the tables are combined. The `INNER JOIN` makes sure that only rows are shown to which both tables have entries. That's why experiment `ID 4` is not shown in the following table. There were no images stored during experiment 4." ] }, { "cell_type": "code", "execution_count": 4, "id": "085814f5-0cfc-4df5-9661-12e99077c55a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0IDCommentConcentrationUnnamed: 0IDFilenameexperiment_ID
001High dose5001image0.tif1
101High dose5012image1.tif1
201High dose5023image2.tif1
312Medium dose2034image3.tif2
412Medium dose2045image4.tif2
512Medium dose2056image5.tif2
623Control067image6.tif3
723Control078image7.tif3
823Control089image8.tif3
923Control0910image9.tif3
\n", "
" ], "text/plain": [ " Unnamed: 0 ID Comment Concentration Unnamed: 0 ID Filename \\\n", "0 0 1 High dose 50 0 1 image0.tif \n", "1 0 1 High dose 50 1 2 image1.tif \n", "2 0 1 High dose 50 2 3 image2.tif \n", "3 1 2 Medium dose 20 3 4 image3.tif \n", "4 1 2 Medium dose 20 4 5 image4.tif \n", "5 1 2 Medium dose 20 5 6 image5.tif \n", "6 2 3 Control 0 6 7 image6.tif \n", "7 2 3 Control 0 7 8 image7.tif \n", "8 2 3 Control 0 8 9 image8.tif \n", "9 2 3 Control 0 9 10 image9.tif \n", "\n", " experiment_ID \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 2 \n", "4 2 \n", "5 2 \n", "6 3 \n", "7 3 \n", "8 3 \n", "9 3 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT * \n", "FROM experiment\n", "INNER JOIN image ON experiment.ID = image.experiment_ID\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "3720de6f-5400-44da-b80a-3490e153f81b", "metadata": {}, "source": [ "## Outer Join\n", "\n", "If one wanted to combine the tables in a way that also experiment 4 is listed, we need to use an `OUTER JOIN`. In this case, as the experiments table is on the left of our SQL statement, we use the `LEFT OUTER JOIN`. Experiment 4 is then listed in our table with some `NaNs` in the columns of the images." ] }, { "cell_type": "code", "execution_count": 5, "id": "81fc2293-82f2-4e21-9c9b-180e0de8e0da", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0IDCommentConcentrationUnnamed: 0IDFilenameexperiment_ID
001High dose500.01.0image0.tif1.0
101High dose501.02.0image1.tif1.0
201High dose502.03.0image2.tif1.0
312Medium dose203.04.0image3.tif2.0
412Medium dose204.05.0image4.tif2.0
512Medium dose205.06.0image5.tif2.0
623Control06.07.0image6.tif3.0
723Control07.08.0image7.tif3.0
823Control08.09.0image8.tif3.0
923Control09.010.0image9.tif3.0
1034Super high dose1000NaNNaNNoneNaN
\n", "
" ], "text/plain": [ " Unnamed: 0 ID Comment Concentration Unnamed: 0 ID \\\n", "0 0 1 High dose 50 0.0 1.0 \n", "1 0 1 High dose 50 1.0 2.0 \n", "2 0 1 High dose 50 2.0 3.0 \n", "3 1 2 Medium dose 20 3.0 4.0 \n", "4 1 2 Medium dose 20 4.0 5.0 \n", "5 1 2 Medium dose 20 5.0 6.0 \n", "6 2 3 Control 0 6.0 7.0 \n", "7 2 3 Control 0 7.0 8.0 \n", "8 2 3 Control 0 8.0 9.0 \n", "9 2 3 Control 0 9.0 10.0 \n", "10 3 4 Super high dose 1000 NaN NaN \n", "\n", " Filename experiment_ID \n", "0 image0.tif 1.0 \n", "1 image1.tif 1.0 \n", "2 image2.tif 1.0 \n", "3 image3.tif 2.0 \n", "4 image4.tif 2.0 \n", "5 image5.tif 2.0 \n", "6 image6.tif 3.0 \n", "7 image7.tif 3.0 \n", "8 image8.tif 3.0 \n", "9 image9.tif 3.0 \n", "10 None NaN " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT * \n", "FROM experiment\n", "LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "358d911d-a0f9-49af-8a79-e37265900588", "metadata": {}, "source": [ "## Combining many tables\n", "\n", "The `JOIN` statements above can also be used in chains to combine more than two tables. For example, we might be interested in the number of cells over time for every image in every experiment" ] }, { "cell_type": "code", "execution_count": 6, "id": "abccf546-bff8-4fdb-9767-caf87ed0ac78", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0IDCommentConcentrationUnnamed: 0IDFilenameexperiment_IDUnnamed: 0image_IDtime_after_exposure_in_snumber_of_cells
001High dose5001image0.tif101096
101High dose5001image0.tif111147
201High dose5001image0.tif121232
301High dose5001image0.tif131324
401High dose5001image0.tif141419
.......................................
9523Control0910image9.tif395105142
9623Control0910image9.tif396106148
9723Control0910image9.tif397107144
9823Control0910image9.tif398108141
9923Control0910image9.tif399109140
\n", "

100 rows × 12 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 ID Comment Concentration Unnamed: 0 ID Filename \\\n", "0 0 1 High dose 50 0 1 image0.tif \n", "1 0 1 High dose 50 0 1 image0.tif \n", "2 0 1 High dose 50 0 1 image0.tif \n", "3 0 1 High dose 50 0 1 image0.tif \n", "4 0 1 High dose 50 0 1 image0.tif \n", ".. ... .. ... ... ... .. ... \n", "95 2 3 Control 0 9 10 image9.tif \n", "96 2 3 Control 0 9 10 image9.tif \n", "97 2 3 Control 0 9 10 image9.tif \n", "98 2 3 Control 0 9 10 image9.tif \n", "99 2 3 Control 0 9 10 image9.tif \n", "\n", " experiment_ID Unnamed: 0 image_ID time_after_exposure_in_s \\\n", "0 1 0 1 0 \n", "1 1 1 1 1 \n", "2 1 2 1 2 \n", "3 1 3 1 3 \n", "4 1 4 1 4 \n", ".. ... ... ... ... \n", "95 3 95 10 5 \n", "96 3 96 10 6 \n", "97 3 97 10 7 \n", "98 3 98 10 8 \n", "99 3 99 10 9 \n", "\n", " number_of_cells \n", "0 96 \n", "1 47 \n", "2 32 \n", "3 24 \n", "4 19 \n", ".. ... \n", "95 142 \n", "96 148 \n", "97 144 \n", "98 141 \n", "99 140 \n", "\n", "[100 rows x 12 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT * \n", "FROM experiment\n", "LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID\n", "INNER JOIN measurement ON image.ID = measurement.image_ID\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "c4e3c84f-ce67-4371-8cbb-23b249e4db6b", "metadata": {}, "source": [ "## Selecting columns\n", "When a table becomes too wide and has columns we do not really need for our analysis, it is recommended to replace the `*` in the `SELECT` statement with column names." ] }, { "cell_type": "code", "execution_count": 7, "id": "bdd1174b-8da6-43ee-8e82-7ecca2af9bb9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CommentFilenametime_after_exposure_in_snumber_of_cells
0High doseimage0.tif096
1High doseimage0.tif147
2High doseimage0.tif232
3High doseimage0.tif324
4High doseimage0.tif419
...............
95Controlimage9.tif5142
96Controlimage9.tif6148
97Controlimage9.tif7144
98Controlimage9.tif8141
99Controlimage9.tif9140
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " Comment Filename time_after_exposure_in_s number_of_cells\n", "0 High dose image0.tif 0 96\n", "1 High dose image0.tif 1 47\n", "2 High dose image0.tif 2 32\n", "3 High dose image0.tif 3 24\n", "4 High dose image0.tif 4 19\n", ".. ... ... ... ...\n", "95 Control image9.tif 5 142\n", "96 Control image9.tif 6 148\n", "97 Control image9.tif 7 144\n", "98 Control image9.tif 8 141\n", "99 Control image9.tif 9 140\n", "\n", "[100 rows x 4 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT experiment.Comment, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells\n", "FROM experiment\n", "LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID\n", "INNER JOIN measurement ON image.ID = measurement.image_ID\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "60bf8445-ebec-4d4a-b027-420e316c66c8", "metadata": {}, "source": [ "## Renaming columns\n", "\n", "Using the `AS` statement, we can also rename columns and make our new table easier to read and understand." ] }, { "cell_type": "code", "execution_count": 8, "id": "6e230a1f-bcdf-4c37-a6f5-3941d1cc1abe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
experiment_nameFilenametime_after_exposure_in_snumber_of_cells
0High doseimage0.tif096
1High doseimage0.tif147
2High doseimage0.tif232
3High doseimage0.tif324
4High doseimage0.tif419
...............
95Controlimage9.tif5142
96Controlimage9.tif6148
97Controlimage9.tif7144
98Controlimage9.tif8141
99Controlimage9.tif9140
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " experiment_name Filename time_after_exposure_in_s number_of_cells\n", "0 High dose image0.tif 0 96\n", "1 High dose image0.tif 1 47\n", "2 High dose image0.tif 2 32\n", "3 High dose image0.tif 3 24\n", "4 High dose image0.tif 4 19\n", ".. ... ... ... ...\n", "95 Control image9.tif 5 142\n", "96 Control image9.tif 6 148\n", "97 Control image9.tif 7 144\n", "98 Control image9.tif 8 141\n", "99 Control image9.tif 9 140\n", "\n", "[100 rows x 4 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT experiment.Comment AS experiment_name, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells\n", "FROM experiment\n", "LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID\n", "INNER JOIN measurement ON image.ID = measurement.image_ID\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "65dda887-e902-4eb8-822d-84cdcc0671c6", "metadata": {}, "source": [ "As this statement is quite complicated and we should not copy past it over and over, we can store it in a pandas DataFrame `overview`." ] }, { "cell_type": "code", "execution_count": 9, "id": "9d8cc542-24f4-4b66-9bc9-294c80b021eb", "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT experiment.Comment AS experiment_name, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells\n", "FROM experiment\n", "LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID\n", "INNER JOIN measurement ON image.ID = measurement.image_ID\n", "\"\"\"\n", "\n", "overview = query_database(query)" ] }, { "cell_type": "markdown", "id": "24bfdcd8-b925-4560-96b3-5ff44b2b979f", "metadata": {}, "source": [ "The `overview` table is part of our database. Note: The overview was not saved to disc yet." ] }, { "cell_type": "code", "execution_count": 10, "id": "933543b3-5bab-4cdc-b37e-48c7cf7c52e3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
experiment_nameFilenametime_after_exposure_in_snumber_of_cells
0High doseimage0.tif096
1High doseimage0.tif147
2High doseimage0.tif232
3High doseimage0.tif324
4High doseimage0.tif419
...............
95Controlimage9.tif5142
96Controlimage9.tif6148
97Controlimage9.tif7144
98Controlimage9.tif8141
99Controlimage9.tif9140
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " experiment_name Filename time_after_exposure_in_s number_of_cells\n", "0 High dose image0.tif 0 96\n", "1 High dose image0.tif 1 47\n", "2 High dose image0.tif 2 32\n", "3 High dose image0.tif 3 24\n", "4 High dose image0.tif 4 19\n", ".. ... ... ... ...\n", "95 Control image9.tif 5 142\n", "96 Control image9.tif 6 148\n", "97 Control image9.tif 7 144\n", "98 Control image9.tif 8 141\n", "99 Control image9.tif 9 140\n", "\n", "[100 rows x 4 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT * \n", "FROM overview\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "62b464c1-c67b-4b1b-8353-a96da5e26761", "metadata": {}, "source": [ "For reuse in later lessons, we save the summary to disc." ] }, { "cell_type": "code", "execution_count": 11, "id": "e34a8b8d-08f1-4d15-90ed-cdf8b98cf389", "metadata": {}, "outputs": [], "source": [ "overview.to_csv(\"../../data/overview.csv\")" ] }, { "cell_type": "code", "execution_count": null, "id": "60c4b936-d4e6-4076-8c76-fd47826eca63", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }