Querying databases using SQL#
The first step is to load and explore data. We use pandas for loading the data and all we do is store it in variables.
import pandas as pd
from pandasql import sqldf
When working with SQL, common practice is to call tables (a.k.a. DataFrames) as singluar. This allows to read SQL code better. For example in a database containing experiments, we can later use experiment.concentration
to describe the concentration of a drug that was used to expose cells.
experiment = pd.read_csv("../../data/experiments.csv")
measurement = pd.read_csv("../../data/measurements.csv")
Next we combine these tables in a database. Therefore, we define a function that can query the database.
query_database = lambda q: sqldf(q, globals())
Select From#
Next, we define a query in SQL. SQL has a quite nicely human readable format. Typically we start by defining what we want to read (SELECT
) and from where (FROM
).
query = """
SELECT *
FROM measurement
"""
query_database(query)
Unnamed: 0 | image_ID | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|
0 | 0 | 1 | 0 | 99 |
1 | 1 | 1 | 1 | 52 |
2 | 2 | 1 | 2 | 33 |
3 | 3 | 1 | 3 | 25 |
4 | 4 | 1 | 4 | 21 |
... | ... | ... | ... | ... |
95 | 95 | 10 | 5 | 52 |
96 | 96 | 10 | 6 | 54 |
97 | 97 | 10 | 7 | 54 |
98 | 98 | 10 | 8 | 54 |
99 | 99 | 10 | 9 | 51 |
100 rows × 4 columns
By the way, SQL is not case-sensitive per-se but you typically find SQL command words in capital letters.
query = """
select *
from measurement
"""
query_database(query)
Unnamed: 0 | image_ID | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|
0 | 0 | 1 | 0 | 99 |
1 | 1 | 1 | 1 | 52 |
2 | 2 | 1 | 2 | 33 |
3 | 3 | 1 | 3 | 25 |
4 | 4 | 1 | 4 | 21 |
... | ... | ... | ... | ... |
95 | 95 | 10 | 5 | 52 |
96 | 96 | 10 | 6 | 54 |
97 | 97 | 10 | 7 | 54 |
98 | 98 | 10 | 8 | 54 |
99 | 99 | 10 | 9 | 51 |
100 rows × 4 columns
You can also select specific columns by name. Also here, SQL is not case-sensitive. The selected column concentration
shows up as Concentration
because that’s its name in the database.
query = """
select Comment, concentration
from experiment
"""
query_database(query)
Comment | Concentration | |
---|---|---|
0 | High dose | 50 |
1 | Medium dose | 20 |
2 | Control | 0 |
3 | Super high dose | 1000 |
Sorting tables#
This table can also be sorted using the ORDER BY
statement, for example to see the highest number of cells. In this case we sort the column descending using the DESC
statement. The ascending order ASC
would be default.
query = """
SELECT *
FROM measurement
ORDER BY number_of_cells DESC
"""
query_database(query)
Unnamed: 0 | image_ID | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|
0 | 10 | 2 | 0 | 201 |
1 | 75 | 8 | 5 | 161 |
2 | 50 | 6 | 0 | 158 |
3 | 78 | 8 | 8 | 158 |
4 | 70 | 8 | 0 | 157 |
... | ... | ... | ... | ... |
95 | 47 | 5 | 7 | 8 |
96 | 38 | 4 | 8 | 7 |
97 | 48 | 5 | 8 | 7 |
98 | 39 | 4 | 9 | 6 |
99 | 49 | 5 | 9 | 6 |
100 rows × 4 columns
Specifying the number of rows to query#
In case a table is very large, retrieving all rows may take a long time. To get insights into the data anyway, it may make sense to LIMIT
the table by the top 10 rows.
query = """
SELECT *
FROM measurement
ORDER BY number_of_cells DESC
LIMIT 10;
"""
query_database(query)
Unnamed: 0 | image_ID | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|
0 | 10 | 2 | 0 | 201 |
1 | 75 | 8 | 5 | 161 |
2 | 50 | 6 | 0 | 158 |
3 | 78 | 8 | 8 | 158 |
4 | 70 | 8 | 0 | 157 |
5 | 77 | 8 | 7 | 157 |
6 | 76 | 8 | 6 | 154 |
7 | 60 | 7 | 0 | 153 |
8 | 64 | 7 | 4 | 153 |
9 | 66 | 7 | 6 | 153 |
Exercise#
Determine the longest time_after_exposure_in_s
applied in any experient.