{ "cells": [ { "cell_type": "markdown", "id": "fa63f581-f411-4e3c-8a36-255c1c986389", "metadata": {}, "source": [ "# Querying databases using SQL\n", "\n", "The first step is to load and explore data. We use pandas for loading the data and all we do is store it in variables." ] }, { "cell_type": "code", "execution_count": 1, "id": "91c81c85-dce7-4b20-9654-ad52d051743d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pandasql import sqldf" ] }, { "cell_type": "markdown", "id": "6cd8dc6b-b640-4c2c-9f0d-82978a08db82", "metadata": {}, "source": [ "When working with SQL, common practice is to call tables (a.k.a. DataFrames) as singluar. This allows to read SQL code better. For example in a database containing experiments, we can later use `experiment.concentration` to describe the concentration of a drug that was used to expose cells." ] }, { "cell_type": "code", "execution_count": 2, "id": "6a98379e-c60d-4ce2-81e7-2ad47c5225e4", "metadata": {}, "outputs": [], "source": [ "experiment = pd.read_csv(\"../../data/experiments.csv\")\n", "measurement = pd.read_csv(\"../../data/measurements.csv\")" ] }, { "cell_type": "markdown", "id": "3dfa47bc-7f85-49fb-8a62-57b6b65054ae", "metadata": {}, "source": [ "Next we combine these tables in a database. Therefore, we define a function that can query the database." ] }, { "cell_type": "code", "execution_count": 3, "id": "bdec6f91-22c5-404f-a436-b23a29506e9a", "metadata": {}, "outputs": [], "source": [ "query_database = lambda q: sqldf(q, globals())" ] }, { "cell_type": "markdown", "id": "ebe7d09c-fe64-4d17-a6f7-00b910980899", "metadata": {}, "source": [ "# Select From\n", "Next, we define a query in SQL. SQL has a quite nicely human readable format. Typically we start by defining what we want to read (`SELECT`) and from where (`FROM`)." ] }, { "cell_type": "code", "execution_count": 4, "id": "6fb1244e-70a6-4c6c-bb1f-8bfe7e550f2c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0image_IDtime_after_exposure_in_snumber_of_cells
001099
111152
221233
331325
441421
...............
959510552
969610654
979710754
989810854
999910951
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells\n", "0 0 1 0 99\n", "1 1 1 1 52\n", "2 2 1 2 33\n", "3 3 1 3 25\n", "4 4 1 4 21\n", ".. ... ... ... ...\n", "95 95 10 5 52\n", "96 96 10 6 54\n", "97 97 10 7 54\n", "98 98 10 8 54\n", "99 99 10 9 51\n", "\n", "[100 rows x 4 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT * \n", "FROM measurement\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "37e59421-2e1e-4ae8-9380-bbed2f0127d7", "metadata": {}, "source": [ "By the way, SQL is not case-sensitive per-se but you typically find SQL command words in capital letters." ] }, { "cell_type": "code", "execution_count": 5, "id": "5d7d7f3a-b3c4-403d-8de1-86d9f9ca9750", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0image_IDtime_after_exposure_in_snumber_of_cells
001099
111152
221233
331325
441421
...............
959510552
969610654
979710754
989810854
999910951
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells\n", "0 0 1 0 99\n", "1 1 1 1 52\n", "2 2 1 2 33\n", "3 3 1 3 25\n", "4 4 1 4 21\n", ".. ... ... ... ...\n", "95 95 10 5 52\n", "96 96 10 6 54\n", "97 97 10 7 54\n", "98 98 10 8 54\n", "99 99 10 9 51\n", "\n", "[100 rows x 4 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "select * \n", "from measurement\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "091108ac-f5fe-474d-b3b1-1a63492f3608", "metadata": {}, "source": [ "You can also select specific columns by name. Also here, SQL is not case-sensitive. The selected column `concentration` shows up as `Concentration` because that's its name in the database." ] }, { "cell_type": "code", "execution_count": 6, "id": "c93f5a11-53e7-4318-987c-66cd39b25139", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CommentConcentration
0High dose50
1Medium dose20
2Control0
3Super high dose1000
\n", "
" ], "text/plain": [ " Comment Concentration\n", "0 High dose 50\n", "1 Medium dose 20\n", "2 Control 0\n", "3 Super high dose 1000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "select Comment, concentration\n", "from experiment\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "7f18fe8a-1273-40d5-a68c-bd74c02d2eed", "metadata": {}, "source": [ "## Sorting tables\n", "\n", "This table can also be sorted using the `ORDER BY` statement, for example to see the highest number of cells. In this case we sort the column descending using the `DESC` statement. The ascenting order `ASC` would be default." ] }, { "cell_type": "code", "execution_count": 7, "id": "9faf21e7-745d-49d7-bf9f-d11420cf5e81", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0image_IDtime_after_exposure_in_snumber_of_cells
01020201
17585161
25060158
37888158
47080157
...............
9547578
9638487
9748587
9839496
9949596
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells\n", "0 10 2 0 201\n", "1 75 8 5 161\n", "2 50 6 0 158\n", "3 78 8 8 158\n", "4 70 8 0 157\n", ".. ... ... ... ...\n", "95 47 5 7 8\n", "96 38 4 8 7\n", "97 48 5 8 7\n", "98 39 4 9 6\n", "99 49 5 9 6\n", "\n", "[100 rows x 4 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT * \n", "FROM measurement\n", "ORDER BY number_of_cells DESC\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "70c7789a-de21-4b14-8126-b256e4f4a4df", "metadata": {}, "source": [ "## Specifying the number of row to query\n", "\n", "In case a table is very large, retrieving all rows may take a long time. To get insights into the data anyway, it may make sense to `LIMIT` the table by the top 10 rows." ] }, { "cell_type": "code", "execution_count": 8, "id": "4c586b76-44a2-4207-a987-b2b4cb0260e6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0image_IDtime_after_exposure_in_snumber_of_cells
01020201
17585161
25060158
37888158
47080157
57787157
67686154
76070153
86474153
96676153
\n", "
" ], "text/plain": [ " Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells\n", "0 10 2 0 201\n", "1 75 8 5 161\n", "2 50 6 0 158\n", "3 78 8 8 158\n", "4 70 8 0 157\n", "5 77 8 7 157\n", "6 76 8 6 154\n", "7 60 7 0 153\n", "8 64 7 4 153\n", "9 66 7 6 153" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT * \n", "FROM measurement\n", "ORDER BY number_of_cells DESC\n", "LIMIT 10;\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "4e71d016-b617-47c2-997f-d8ad9ddeefde", "metadata": {}, "source": [ "## Exercise\n", "Determine the longest `time_after_exposure_in_s` applied in any experient." ] }, { "cell_type": "code", "execution_count": null, "id": "d9d6d5e3-b630-4a7b-9741-f9f9f289cb4f", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }