Skip to content

Commit 1674993

Browse files
committed
Added a new exercise and a paragraph about join.
1 parent 85348f6 commit 1674993

3 files changed

Lines changed: 329 additions & 17 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@
44
.jupyter_cache
55
jupyter_execute
66
uv.lock
7+
nyc_yellow_taxi_2025-01.parquet

content/tabular-data.md

Lines changed: 64 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -529,7 +529,7 @@ shape: (3_475_226, 1)
529529
```
530530

531531
```python
532-
df.with_columns((pl.col('trip_distance')/pl.col('trip_duration_sec')*3600).alias("avg_sp\
532+
df.with_columns((pl.col('trip_distance')/pl.col('trip_duration_sec')*3600).alias("avg_sp
533533
eed_mph"))
534534
shape: (3_475_226, 22)
535535
┌──────────┬──────────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬──────────┐
@@ -624,8 +624,68 @@ shape: (104_410, 22)
624624

625625
The `group_by` context behaves like its Pandas counterpart.
626626

627+
### Transformations
628+
629+
A `join` operation combines columns from one or more dataframes into a new
630+
dataframe. There are different joining strategies, which influence how columns
631+
are combined and what rows are included in the final set. A common type is the
632+
*equi* join, where rows are matched by a key expression. Let us clarify this
633+
with an example. The `df` dataframe does not include specific coordinates for
634+
each pickup and drop-off, rather only a `PULocationID` and a `DOLocationID`.
635+
There is a `taxy_zones_xy.csv` file that contains, for each `LocationID`, the
636+
latitude (X) and longitude (Y) of each location, as well as the name of zone
637+
and borough:
638+
639+
```python
640+
641+
lookup_df = pl.read_csv('taxy_zones_xy.csv', has_header=True)
642+
lookup_df.head()
643+
┌────────────┬────────────┬───────────┬─────────────────────────┬───────────────┐
644+
│ LocationID ┆ X ┆ Y ┆ zone ┆ borough │
645+
---------------
646+
│ i64 ┆ f64 ┆ f64 ┆ strstr
647+
╞════════════╪════════════╪═══════════╪═════════════════════════╪═══════════════╡
648+
1-74.17678640.689516 ┆ Newark Airport ┆ EWR
649+
2-73.82612640.625724 ┆ Jamaica Bay ┆ Queens │
650+
3-73.84947940.865888 ┆ Allerton/Pelham Gardens ┆ Bronx │
651+
4-73.97702340.724152 ┆ Alphabet City ┆ Manhattan │
652+
5-74.1899340.55034 ┆ Arden Heights ┆ Staten Island │
653+
└────────────┴────────────┴───────────┴─────────────────────────┴───────────────┘
654+
```
655+
656+
This can be used to append these columns to the original df to have some form
657+
of geographical data as follows (e.g. for the `PULocationID`):
658+
659+
```python
660+
df = df.join(lookup_df, left_on='PULocationID', right_on='LocationID', how='left'
661+
, suffix='_pickup')
662+
```
663+
664+
In the line above, `left_on` is used to indicate the *key* in the original
665+
dataframe, `right_on` is used to specify the *key* in the `lookup_df` dataframe,
666+
`how=left` means that the columns from the second dataframe will be added to
667+
the first (and not the other way around) and `suffix` is what will be added to
668+
the names of the joined columns (i.e., df will contain columns called `X_pickup`,
669+
`Y_pickup`, `zone_pickup` and `borough_pickup`). More information on join
670+
operations can be found [here](https://docs.pola.rs/user-guide/transformations/joins/).
671+
627672
## Exercises
628673

674+
:::{exercise} Joining geographical data
675+
We have already seen how to add actual latitude and longitude for the pickups.
676+
Now do the same for the drop-offs!
677+
678+
:::
679+
680+
:::{solution}
681+
682+
```python
683+
df = df.join(lookup_df, left_on='DOLocationID', right_on='LocationID', how='left'
684+
, suffix='_dropoff')
685+
```
686+
687+
:::
688+
629689
:::{exercise} Feature engineering: enriching the dataset
630690
We want to understand a bit more of the traffic in the city by creating
631691
new features (i.e. columns), in particular:
@@ -653,18 +713,12 @@ df = raw_df.with_columns([
653713
.alias("trip_duration_sec"),
654714
])
655715

656-
# ------------------------------------------------------------
657-
# 4️⃣ Speed feature (mph)
658-
# ------------------------------------------------------------
659716
df = df.with_column(
660717
#TODO: add expression for average velocity here
661718
.replace_nan(None) # protect against div‑by‑zero
662719
.alias("avg_speed_mph")
663720
)
664721

665-
# ------------------------------------------------------------
666-
# 5️⃣ Zone‑level contextual aggregates
667-
# ------------------------------------------------------------
668722
# Compute per‑pickup‑zone statistics once
669723
zone_stats = (
670724
df.groupby("PULocationID")
@@ -682,7 +736,9 @@ df = df.join(zone_stats, left_on="PULocationID", right_on="pickup_zone_id", how=
682736

683737
While we haven't covered the `join` instruction earlier, its main role
684738
is to "spread" the `zone_stats` over all the rides in the original dataframe
685-
(i.e. write the `zone_avg_fare` on each ride in `df`).
739+
(i.e. write the `zone_avg_fare` on each ride in `df`). `join` has its roots
740+
in relational databases, where different tables can be merged based on a
741+
common column.
686742
:::
687743

688744
:::{solution}
@@ -701,9 +757,6 @@ df = raw_df.with_columns([
701757
.alias("trip_duration_sec"),
702758
])
703759

704-
# ------------------------------------------------------------
705-
# 4️⃣ Speed feature (mph)
706-
# ------------------------------------------------------------
707760
df = df.with_column(
708761
(
709762
pl.col("trip_distance") /
@@ -713,9 +766,6 @@ df = df.with_column(
713766
.alias("avg_speed_mph")
714767
)
715768

716-
# ------------------------------------------------------------
717-
# 5️⃣ Zone‑level contextual aggregates
718-
# ------------------------------------------------------------
719769
# Compute per‑pickup‑zone statistics once
720770
zone_stats = (
721771
df.groupby("PULocationID")
@@ -795,9 +845,6 @@ df = raw_df.with_columns([
795845
.alias("dist_per_passenger"),
796846
])
797847

798-
# ------------------------------------------------------------
799-
# 4️⃣ Drop‑off‑zone contextual aggregates
800-
# ------------------------------------------------------------
801848
dropoff_stats = (
802849
df.groupby("DOLocationID")
803850
.agg([

0 commit comments

Comments
 (0)