{
"cells": [
{
"cell_type": "markdown",
"id": "fa63f581-f411-4e3c-8a36-255c1c986389",
"metadata": {},
"source": [
"# Querying databases using SQL\n",
"\n",
"The first step is to load and explore data. We use pandas for loading the data and all we do is store it in variables."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "91c81c85-dce7-4b20-9654-ad52d051743d",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from pandasql import sqldf"
]
},
{
"cell_type": "markdown",
"id": "6cd8dc6b-b640-4c2c-9f0d-82978a08db82",
"metadata": {},
"source": [
"When working with SQL, common practice is to call tables (a.k.a. DataFrames) as singluar. This allows to read SQL code better. For example in a database containing experiments, we can later use `experiment.concentration` to describe the concentration of a drug that was used to expose cells."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6a98379e-c60d-4ce2-81e7-2ad47c5225e4",
"metadata": {},
"outputs": [],
"source": [
"experiment = pd.read_csv(\"../../data/experiments.csv\")\n",
"measurement = pd.read_csv(\"../../data/measurements.csv\")"
]
},
{
"cell_type": "markdown",
"id": "3dfa47bc-7f85-49fb-8a62-57b6b65054ae",
"metadata": {},
"source": [
"Next we combine these tables in a database. Therefore, we define a function that can query the database."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "bdec6f91-22c5-404f-a436-b23a29506e9a",
"metadata": {},
"outputs": [],
"source": [
"query_database = lambda q: sqldf(q, globals())"
]
},
{
"cell_type": "markdown",
"id": "ebe7d09c-fe64-4d17-a6f7-00b910980899",
"metadata": {},
"source": [
"# Select From\n",
"Next, we define a query in SQL. SQL has a quite nicely human readable format. Typically we start by defining what we want to read (`SELECT`) and from where (`FROM`)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6fb1244e-70a6-4c6c-bb1f-8bfe7e550f2c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" image_ID | \n",
" time_after_exposure_in_s | \n",
" number_of_cells | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 99 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 52 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 33 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" 25 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 1 | \n",
" 4 | \n",
" 21 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 95 | \n",
" 95 | \n",
" 10 | \n",
" 5 | \n",
" 52 | \n",
"
\n",
" \n",
" 96 | \n",
" 96 | \n",
" 10 | \n",
" 6 | \n",
" 54 | \n",
"
\n",
" \n",
" 97 | \n",
" 97 | \n",
" 10 | \n",
" 7 | \n",
" 54 | \n",
"
\n",
" \n",
" 98 | \n",
" 98 | \n",
" 10 | \n",
" 8 | \n",
" 54 | \n",
"
\n",
" \n",
" 99 | \n",
" 99 | \n",
" 10 | \n",
" 9 | \n",
" 51 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells\n",
"0 0 1 0 99\n",
"1 1 1 1 52\n",
"2 2 1 2 33\n",
"3 3 1 3 25\n",
"4 4 1 4 21\n",
".. ... ... ... ...\n",
"95 95 10 5 52\n",
"96 96 10 6 54\n",
"97 97 10 7 54\n",
"98 98 10 8 54\n",
"99 99 10 9 51\n",
"\n",
"[100 rows x 4 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT * \n",
"FROM measurement\n",
"\"\"\"\n",
"\n",
"query_database(query)"
]
},
{
"cell_type": "markdown",
"id": "37e59421-2e1e-4ae8-9380-bbed2f0127d7",
"metadata": {},
"source": [
"By the way, SQL is not case-sensitive per-se but you typically find SQL command words in capital letters."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5d7d7f3a-b3c4-403d-8de1-86d9f9ca9750",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" image_ID | \n",
" time_after_exposure_in_s | \n",
" number_of_cells | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 99 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 52 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 2 | \n",
" 33 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" 25 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 1 | \n",
" 4 | \n",
" 21 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 95 | \n",
" 95 | \n",
" 10 | \n",
" 5 | \n",
" 52 | \n",
"
\n",
" \n",
" 96 | \n",
" 96 | \n",
" 10 | \n",
" 6 | \n",
" 54 | \n",
"
\n",
" \n",
" 97 | \n",
" 97 | \n",
" 10 | \n",
" 7 | \n",
" 54 | \n",
"
\n",
" \n",
" 98 | \n",
" 98 | \n",
" 10 | \n",
" 8 | \n",
" 54 | \n",
"
\n",
" \n",
" 99 | \n",
" 99 | \n",
" 10 | \n",
" 9 | \n",
" 51 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells\n",
"0 0 1 0 99\n",
"1 1 1 1 52\n",
"2 2 1 2 33\n",
"3 3 1 3 25\n",
"4 4 1 4 21\n",
".. ... ... ... ...\n",
"95 95 10 5 52\n",
"96 96 10 6 54\n",
"97 97 10 7 54\n",
"98 98 10 8 54\n",
"99 99 10 9 51\n",
"\n",
"[100 rows x 4 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"select * \n",
"from measurement\n",
"\"\"\"\n",
"\n",
"query_database(query)"
]
},
{
"cell_type": "markdown",
"id": "091108ac-f5fe-474d-b3b1-1a63492f3608",
"metadata": {},
"source": [
"You can also select specific columns by name. Also here, SQL is not case-sensitive. The selected column `concentration` shows up as `Concentration` because that's its name in the database."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c93f5a11-53e7-4318-987c-66cd39b25139",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Comment | \n",
" Concentration | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" High dose | \n",
" 50 | \n",
"
\n",
" \n",
" 1 | \n",
" Medium dose | \n",
" 20 | \n",
"
\n",
" \n",
" 2 | \n",
" Control | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" Super high dose | \n",
" 1000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Comment Concentration\n",
"0 High dose 50\n",
"1 Medium dose 20\n",
"2 Control 0\n",
"3 Super high dose 1000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"select Comment, concentration\n",
"from experiment\n",
"\"\"\"\n",
"\n",
"query_database(query)"
]
},
{
"cell_type": "markdown",
"id": "7f18fe8a-1273-40d5-a68c-bd74c02d2eed",
"metadata": {},
"source": [
"## Sorting tables\n",
"\n",
"This table can also be sorted using the `ORDER BY` statement, for example to see the highest number of cells. In this case we sort the column descending using the `DESC` statement. The ascenting order `ASC` would be default."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9faf21e7-745d-49d7-bf9f-d11420cf5e81",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" image_ID | \n",
" time_after_exposure_in_s | \n",
" number_of_cells | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" 2 | \n",
" 0 | \n",
" 201 | \n",
"
\n",
" \n",
" 1 | \n",
" 75 | \n",
" 8 | \n",
" 5 | \n",
" 161 | \n",
"
\n",
" \n",
" 2 | \n",
" 50 | \n",
" 6 | \n",
" 0 | \n",
" 158 | \n",
"
\n",
" \n",
" 3 | \n",
" 78 | \n",
" 8 | \n",
" 8 | \n",
" 158 | \n",
"
\n",
" \n",
" 4 | \n",
" 70 | \n",
" 8 | \n",
" 0 | \n",
" 157 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 95 | \n",
" 47 | \n",
" 5 | \n",
" 7 | \n",
" 8 | \n",
"
\n",
" \n",
" 96 | \n",
" 38 | \n",
" 4 | \n",
" 8 | \n",
" 7 | \n",
"
\n",
" \n",
" 97 | \n",
" 48 | \n",
" 5 | \n",
" 8 | \n",
" 7 | \n",
"
\n",
" \n",
" 98 | \n",
" 39 | \n",
" 4 | \n",
" 9 | \n",
" 6 | \n",
"
\n",
" \n",
" 99 | \n",
" 49 | \n",
" 5 | \n",
" 9 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
100 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells\n",
"0 10 2 0 201\n",
"1 75 8 5 161\n",
"2 50 6 0 158\n",
"3 78 8 8 158\n",
"4 70 8 0 157\n",
".. ... ... ... ...\n",
"95 47 5 7 8\n",
"96 38 4 8 7\n",
"97 48 5 8 7\n",
"98 39 4 9 6\n",
"99 49 5 9 6\n",
"\n",
"[100 rows x 4 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT * \n",
"FROM measurement\n",
"ORDER BY number_of_cells DESC\n",
"\"\"\"\n",
"\n",
"query_database(query)"
]
},
{
"cell_type": "markdown",
"id": "70c7789a-de21-4b14-8126-b256e4f4a4df",
"metadata": {},
"source": [
"## Specifying the number of row to query\n",
"\n",
"In case a table is very large, retrieving all rows may take a long time. To get insights into the data anyway, it may make sense to `LIMIT` the table by the top 10 rows."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "4c586b76-44a2-4207-a987-b2b4cb0260e6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Unnamed: 0 | \n",
" image_ID | \n",
" time_after_exposure_in_s | \n",
" number_of_cells | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 10 | \n",
" 2 | \n",
" 0 | \n",
" 201 | \n",
"
\n",
" \n",
" 1 | \n",
" 75 | \n",
" 8 | \n",
" 5 | \n",
" 161 | \n",
"
\n",
" \n",
" 2 | \n",
" 50 | \n",
" 6 | \n",
" 0 | \n",
" 158 | \n",
"
\n",
" \n",
" 3 | \n",
" 78 | \n",
" 8 | \n",
" 8 | \n",
" 158 | \n",
"
\n",
" \n",
" 4 | \n",
" 70 | \n",
" 8 | \n",
" 0 | \n",
" 157 | \n",
"
\n",
" \n",
" 5 | \n",
" 77 | \n",
" 8 | \n",
" 7 | \n",
" 157 | \n",
"
\n",
" \n",
" 6 | \n",
" 76 | \n",
" 8 | \n",
" 6 | \n",
" 154 | \n",
"
\n",
" \n",
" 7 | \n",
" 60 | \n",
" 7 | \n",
" 0 | \n",
" 153 | \n",
"
\n",
" \n",
" 8 | \n",
" 64 | \n",
" 7 | \n",
" 4 | \n",
" 153 | \n",
"
\n",
" \n",
" 9 | \n",
" 66 | \n",
" 7 | \n",
" 6 | \n",
" 153 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells\n",
"0 10 2 0 201\n",
"1 75 8 5 161\n",
"2 50 6 0 158\n",
"3 78 8 8 158\n",
"4 70 8 0 157\n",
"5 77 8 7 157\n",
"6 76 8 6 154\n",
"7 60 7 0 153\n",
"8 64 7 4 153\n",
"9 66 7 6 153"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"query = \"\"\"\n",
"SELECT * \n",
"FROM measurement\n",
"ORDER BY number_of_cells DESC\n",
"LIMIT 10;\n",
"\"\"\"\n",
"\n",
"query_database(query)"
]
},
{
"cell_type": "markdown",
"id": "4e71d016-b617-47c2-997f-d8ad9ddeefde",
"metadata": {},
"source": [
"## Exercise\n",
"Determine the longest `time_after_exposure_in_s` applied in any experient."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9d6d5e3-b630-4a7b-9741-f9f9f289cb4f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}