intro_to_sql/week3.qmd at main · fhdsl/intro_to_sql · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
---
title: "Week 3: `GROUP BY`/`HAVING`"
---

## Connecting to our database

Let's connect to our database.

```{r}
#| context: setup
library(duckdb)
library(DBI)
library(tidyverse)

con <- DBI::dbConnect(duckdb::duckdb(),
                      "data/GiBleed_5.3_1.1.duckdb")
```

## `GROUP BY`

Say we want to count, calculate totals, or averages for a particular column by a particular grouping variable. For example, suppose we want to group `gender_source_value` column in the `person` table and count the number of `person_id`s for each value of `gender_source_value`. We can use a `SELECT/GROUP BY` pattern to do this.

There are some requirements to using `SELECT`/`GROUP BY`:

-   Grouping variable should be categorical (such as `c.concept_name`)
-   Grouping variable must be in `SELECT` clause (`c.concept_name`)

Count the number of persons by `gender_source_value`:

```{sql}
#| connection: "con"
SELECT gender_source_value, COUNT(person_id) AS person_count
  FROM person
  GROUP BY gender_source_value
```

Notice that we use the `AS` alias to rename `COUNT(person_id)` to `person_count` in the column name.

We summarize our column in other ways besides `COUNT`:

-   `MEAN`

-   `MIN`

-   `MAX`

-   `MEDIAN`

For example, we can look at the minimum `year_of_birth` for each gender in the `person` table:

```{sql}
#| connection: "con"

SELECT gender_source_value, MIN(year_of_birth)
 FROM person
 GROUP BY gender_source_value
```

### Check on Learning

If we look at `concept` table, we notice that there are groups of concepts organized by the `domain_id` column:

```{sql connection}
#| connection: "con"

SELECT concept_id, concept_name, domain_id, vocabulary_id
  FROM concept
  LIMIT 10
```

`COUNT` the number of `concept_id`s grouped by `domain_id` in the `concept` table:

```{sql}
#| connection: "con"
#| eval: false
SELECT domain_id, COUNT(------) AS count_domain
  FROM concept
  GROUP BY -------
  ORDER BY count_domain DESC
```

You can also group by multiple variables. What happens if you group by `domain_id` *and* `vocabulary_id`?

## GROUP BY with JOINs

Recall that table `procedure_occurrence` records the procedures of each person. Suppose that we do a `GROUP BY` on each `procedure_concept_id` and count the number of `person_id`s to understand how many people were treated for each procedure:

```{sql}
#| connection: "con"
SELECT procedure_concept_id, COUNT(person_id) AS person_count
  FROM procedure_occurrence
  GROUP BY procedure_concept_id
  ORDER BY person_count DESC
```

We wish we know what the `procedure_concept_id` referred to. We need to join it with `concept` table.

Here, we're combining `SELECT`/`GROUP_BY` with an `INNER JOIN`:

```{sql}
#| connection: "con"
SELECT c.concept_name AS procedure, COUNT(person_id) AS person_count
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  GROUP BY c.concept_name
  ORDER BY person_count DESC
```

Even more complicated: We can group by multiple variables. Here is a triple join where we are counting by both `gender_source_value` and `concept_name`:

```{sql}
#| connection: "con"
SELECT c.concept_name AS procedure, p.gender_source_value, COUNT(p.person_id) AS person_count
  FROM procedure_occurrence AS po
  INNER JOIN person AS p
  ON p.person_id = po.person_id
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  GROUP BY c.concept_name, p.gender_source_value
  ORDER BY person_count DESC
```

## `HAVING`

We can filter by these aggregate variables. But we can't use them in a `WHERE` clause. There is an additional clause `HAVING`:

```{sql}
#| connection: "con"
SELECT c.concept_name AS procedure, COUNT(person_id) AS person_count
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  GROUP BY c.concept_name
  HAVING person_count > 500
  ORDER BY person_count DESC
```

Why can't we use `WHERE`?

Well, it turns out that SQL clauses have different priorities, which tells the engine how to order the clauses to execute as your queries become bigger. The `WHERE` clause has *higher priority* than the `GROUP BY` clause, which means if you had written `WHERE person_count > 500`, it would be evaluated before `GROUP BY`, thus it has no idea `person_count` exists and throws an error. Here is the full list of SQL clause priorities:

| Priority | Clause     | Purpose                                                        |
|-----------------|-----------------|--------------------------------------|
| 1        | `FROM`     | Choose tables to query and specify how to `JOIN` them together |
| 2        | `WHERE`    | Filter tables based on criteria                                |
| 3        | `GROUP BY` | Aggregates the Data                                            |
| 4        | `HAVING`   | Filters Aggregated Data                                        |
| 5        | `SELECT`   | Selects columns in table and calculate new columns             |
| 6        | `ORDER BY` | Sorts by a database field                                      |
| 7        | `LIMIT`    | Limits the number of records returned                          |

In general, you need to put `WHERE` to do any filtering before running `GROUP BY`. Then, after the data is grouped and aggregrated, you can do additional filtereing on the aggregated data via `HAVING`. Your SQL statement will not work if you put `WHERE` after `GROUP BY` / `HAVING`.

Here is an example of using both `WHERE` and `HAVING`:

```{sql}
#| connection: "con"
SELECT domain_id, COUNT(concept_id) AS count_domain
  FROM concept
  WHERE domain_id != 'Drug'
  GROUP BY domain_id
  HAVING count_domain > 40
  ORDER BY count_domain DESC
```

Here's what happens when you put `WHERE` after `GROUP BY`/`HAVING`. Can you fix it?

```{sql}
#| connection: "con"
#| eval: false
SELECT domain_id, COUNT(concept_id) AS count_domain
  FROM concept
  GROUP BY domain_id
  HAVING count_domain > 40
  WHERE domain_id != 'Drug'
  ORDER BY count_domain DESC
```

Here is `WHERE`/`GROUP BY`/`HAVING` combined with an `INNER JOIN`:

```{sql}
#| connection: "con"
SELECT c.concept_name AS procedure, COUNT(person_id) AS person_count
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id
  WHERE date_part('YEAR', po.procedure_datetime) > 2000
  GROUP BY c.concept_name
  HAVING person_count > 500
  ORDER BY person_count DESC
```

### Check on learning

Suppose we were given this join, with the column `year` extracted from `procedure_datatime`.

```{sql}
#| connection: "con"
SELECT date_part('YEAR', po.procedure_datetime) AS year, person_id, procedure_occurrence_id
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id

```

Build on top of this query: Group by `year`, and then aggregate by the `COUNT` of `person_id`. Finally, filter it so that the `year` is higher than 1990. Should you be using `WHERE` or `HAVING`?

```{sql}
#| connection: "con"
#| eval: false
SELECT date_part('YEAR', po.procedure_datetime) AS year, person_id
  FROM procedure_occurrence AS po
  INNER JOIN concept AS c
  ON po.procedure_concept_id = c.concept_id


```

## `IN`/`LIKE`

A couple of twists to `WHERE`. We can use `IN` to search on multiple conditions. We put the multiple words in a `()` separated by commas:

```{sql}
#| connection: "con"
SELECT concept_name, domain_id
  FROM concept
  WHERE domain_id IN ('Drug', 'Condition')
```

We can use `NOT` with `IN` to exclude a list of conditions:

```{sql}
#| connection: "con"
SELECT concept_name, domain_id
  FROM concept
  WHERE domain_id NOT IN ('Drug', 'Condition')
```

One note. It is usually faster to make a temporary table with your values and join on that temporary table. We'll talk more about this below.

`LIKE` is one way to do wild card searches.

```{sql}
#| connection: "con"
SELECT concept_name, domain_id
  FROM concept
  WHERE domain_id LIKE 'Dru%'
```

You can find more informaiton about pattern matching [here](https://duckdb.org/docs/stable/sql/functions/pattern_matching).

## Creating Temporary Tables

Temporary tables can be very useful for storing intermediate results.

Temporary tables only last for the session - they disappear after you disconnect, so don't use them for permanent storage.

You can use `CREATE OR REPLACE TEMPORARY TABLE` clause, followed by your temporary table name and `AS`, and then give a query of choice:

```{sql}
#| connection: "con"
CREATE OR REPLACE TEMPORARY TABLE temp_person AS
SELECT person_id, gender_source_value
  FROM person
```

You can also use `CREATE TEMPORARY TABLE` clause, but it will give you an error if the temporary table has been created already.

We can also load in a spreadsheet as a temporary table. Suppose we want to load in the following:

```{r}
read_csv("data/temp_cost.csv")

```

Then we can use `COPY` from DuckDB to load it in:

```{sql}
#| connection: "con"
CREATE OR REPLACE TEMP TABLE cost(
  concept_name VARCHAR,
  procedure_concept_id INT,
  cost INT
);
COPY cost FROM 'data/temp_cost.csv'
```

DuckDB also is smart enough to infer the column types and names from the data:

```{sql}
#| connection: "con"
CREATE OR REPLACE TEMP TABLE cost AS
  SELECT * FROM read_csv('data/temp_cost.csv')
```

Now our table exists in our database, and we can work with it.

```{sql}
#| connection: "con"
SELECT * FROM cost
```

Now we can merge our temporary `cost` table with `procedure_occurrence` and calculate the sum cost per year:

```{sql}
#| connection: "con"
SELECT date_part('YEAR', po.procedure_datetime) AS year, SUM(cost) AS sum_cost_month
  FROM procedure_occurrence AS po
  INNER JOIN cost AS c
  ON po.procedure_concept_id = c.procedure_concept_id
  GROUP BY year
  ORDER BY year DESC
```

We'll talk more about Subqueries and Views next time, which are another options to split queries up.

## Data Integrity

We talked a little bit last week about database constraints, such as `FOREIGN KEY` constraints, where we can't add a row that refers to a foreign key if that foreign key doesn't exist.

These constraints exist to ensure the *data integrity* of a database. For example, we don't want to have rows in `procedure_occurrence` that have `procedure_concept_id` that don't exist in the `concept` table.

Another way to keep data integrity is to have all operations be **ACID** compliant transactions. That is, all operations (inserting and removing rows) needs to be done in full before the next set of transactions (which could come from another user) are done to the database.

ACID is short for:

-   **Atomicity** - the operation must be all or none
-   **Consistency** - the operation must be done the same way
-   **Isolation** - the operation is not dependent on other operations, and is done in series, not parallel.
-   **Durability** - the operation must be robust to disruptions (like power outages). If a database is interrupted in an update, there must be a *rollback* mechanism to get the previous version of the data.

Finally, the design of the tables and what information they contain, and how they relate to each other is also important to data integrity. The process of deciding which columns belong to which tables is called *normalization*.

## Database Design

Database design can be difficult because:

1.  You need to understand the requirements of the data and how it is collected
    -   For example, when is procedure information collected?
    -   Do patients have multiple procedures? (Cardinality)
2.  You need to group like data with like (normalization)
    -   Data that is dependent on a primary key should stay together
    -   For example, `person` should contain information of a patient such as demographics, but not individual `procedure_concept_ids`.
3.  You need to have an automated process to add data to the database (Extract Transfer Load, or ETL).
4.  Search processes must be optimized for common operations (indexing)

Of this, steps 1 and 2 are the most difficult and take the most time. They require the designer to interview users of the data and those who collect the data to reflect the business processes. These two steps are called the **Data Modeling** steps.

These processes are essential if you are designing a **transactional database** that is collecting data from multiple sources (such as clinicians at time of care) and is updated multiple times a second.

If you want to read more about the data model we're using, I've written up a short bit here: [OMOP Data Model](miscellaneous.html#the-omop-data-model).

## Database Administration

Maintaining a database is also known as **database administration**. Database Admins are responsible for the following:

1.  Making sure that the data maintains its integrity
2.  Ensuring that common queries are optimized for fast loading
3.  General upkeep and optimization. Oftentimes, if multiple people are accessing the data at once, the data may be distributed among multiple machines (load balancing).
4.  Security. We don't want the wrong people accessing the data.

Being a good admin does not start from scratch. You can't be a top-tier admin straight out of school. There are a lot of things DB admins learn, but a lot of the optimization happens from experience with managing the data.

Respect your DB Admin and know that they know a lot about how to optimize your queries.

## Always close the connection

When we're done, it's best to close the connection with `dbDisconnect()`.

```{r}
dbDisconnect(con)
```