Querying databases using SQL#

The first step is to load and explore data. We use pandas for loading the data and all we do is store it in variables.

import pandas as pd
from pandasql import sqldf

When working with SQL, common practice is to call tables (a.k.a. DataFrames) as singluar. This allows to read SQL code better. For example in a database containing experiments, we can later use experiment.concentration to describe the concentration of a drug that was used to expose cells.

experiment = pd.read_csv("../../data/experiments.csv")
measurement = pd.read_csv("../../data/measurements.csv")

Next we combine these tables in a database. Therefore, we define a function that can query the database.

query_database = lambda q: sqldf(q, globals())

Select From#

Next, we define a query in SQL. SQL has a quite nicely human readable format. Typically we start by defining what we want to read (SELECT) and from where (FROM).

query = """
SELECT * 
FROM measurement
"""

query_database(query)
Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells
0 0 1 0 99
1 1 1 1 52
2 2 1 2 33
3 3 1 3 25
4 4 1 4 21
... ... ... ... ...
95 95 10 5 52
96 96 10 6 54
97 97 10 7 54
98 98 10 8 54
99 99 10 9 51

100 rows × 4 columns

By the way, SQL is not case-sensitive per-se but you typically find SQL command words in capital letters.

query = """
select * 
from measurement
"""

query_database(query)
Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells
0 0 1 0 99
1 1 1 1 52
2 2 1 2 33
3 3 1 3 25
4 4 1 4 21
... ... ... ... ...
95 95 10 5 52
96 96 10 6 54
97 97 10 7 54
98 98 10 8 54
99 99 10 9 51

100 rows × 4 columns

You can also select specific columns by name. Also here, SQL is not case-sensitive. The selected column concentration shows up as Concentration because that’s its name in the database.

query = """
select Comment, concentration
from experiment
"""

query_database(query)
Comment Concentration
0 High dose 50
1 Medium dose 20
2 Control 0
3 Super high dose 1000

Sorting tables#

This table can also be sorted using the ORDER BY statement, for example to see the highest number of cells. In this case we sort the column descending using the DESC statement. The ascending order ASC would be default.

query = """
SELECT * 
FROM measurement
ORDER BY number_of_cells DESC
"""

query_database(query)
Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells
0 10 2 0 201
1 75 8 5 161
2 50 6 0 158
3 78 8 8 158
4 70 8 0 157
... ... ... ... ...
95 47 5 7 8
96 38 4 8 7
97 48 5 8 7
98 39 4 9 6
99 49 5 9 6

100 rows × 4 columns

Specifying the number of rows to query#

In case a table is very large, retrieving all rows may take a long time. To get insights into the data anyway, it may make sense to LIMIT the table by the top 10 rows.

query = """
SELECT * 
FROM measurement
ORDER BY number_of_cells DESC
LIMIT 10;
"""

query_database(query)
Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells
0 10 2 0 201
1 75 8 5 161
2 50 6 0 158
3 78 8 8 158
4 70 8 0 157
5 77 8 7 157
6 76 8 6 154
7 60 7 0 153
8 64 7 4 153
9 66 7 6 153

Exercise#

Determine the longest time_after_exposure_in_s applied in any experient.