@@ -529,7 +529,7 @@ shape: (3_475_226, 1)
529529```
530530
531531``` python
532- df.with_columns((pl.col(' trip_distance' )/ pl.col(' trip_duration_sec' )* 3600 ).alias(" avg_sp\
532+ df.with_columns((pl.col(' trip_distance' )/ pl.col(' trip_duration_sec' )* 3600 ).alias(" avg_sp
533533eed_mph" ))
534534shape: (3_475_226 , 22 )
535535┌──────────┬──────────┬──────────┬──────────┬───┬──────────┬──────────┬──────────┬──────────┐
@@ -624,8 +624,68 @@ shape: (104_410, 22)
624624
625625The `group_by` context behaves like its Pandas counterpart.
626626
627+ # ## Transformations
628+
629+ A `join` operation combines columns from one or more dataframes into a new
630+ dataframe. There are different joining strategies, which influence how columns
631+ are combined and what rows are included in the final set . A common type is the
632+ * equi* join, where rows are matched by a key expression. Let us clarify this
633+ with an example. The `df` dataframe does not include specific coordinates for
634+ each pickup and drop- off, rather only a `PULocationID` and a `DOLocationID` .
635+ There is a `taxy_zones_xy.csv` file that contains, for each `LocationID` , the
636+ latitude (X) and longitude (Y) of each location, as well as the name of zone
637+ and borough:
638+
639+ ```python
640+
641+ lookup_df = pl.read_csv(' taxy_zones_xy.csv' , has_header = True )
642+ lookup_df.head()
643+ ┌────────────┬────────────┬───────────┬─────────────────────────┬───────────────┐
644+ │ LocationID ┆ X ┆ Y ┆ zone ┆ borough │
645+ │ -- - ┆ -- - ┆ -- - ┆ -- - ┆ -- - │
646+ │ i64 ┆ f64 ┆ f64 ┆ str ┆ str │
647+ ╞════════════╪════════════╪═══════════╪═════════════════════════╪═══════════════╡
648+ │ 1 ┆ - 74.176786 ┆ 40.689516 ┆ Newark Airport ┆ EWR │
649+ │ 2 ┆ - 73.826126 ┆ 40.625724 ┆ Jamaica Bay ┆ Queens │
650+ │ 3 ┆ - 73.849479 ┆ 40.865888 ┆ Allerton/ Pelham Gardens ┆ Bronx │
651+ │ 4 ┆ - 73.977023 ┆ 40.724152 ┆ Alphabet City ┆ Manhattan │
652+ │ 5 ┆ - 74.18993 ┆ 40.55034 ┆ Arden Heights ┆ Staten Island │
653+ └────────────┴────────────┴───────────┴─────────────────────────┴───────────────┘
654+ ```
655+
656+ This can be used to append these columns to the original df to have some form
657+ of geographical data as follows (e.g. for the `PULocationID` ):
658+
659+ ```python
660+ df = df.join(lookup_df, left_on = ' PULocationID' , right_on = ' LocationID' , how = ' left'
661+ , suffix = ' _pickup' )
662+ ```
663+
664+ In the line above, `left_on` is used to indicate the * key* in the original
665+ dataframe, `right_on` is used to specify the * key* in the `lookup_df` dataframe,
666+ `how=left` means that the columns from the second dataframe will be added to
667+ the first (and not the other way around) and `suffix` is what will be added to
668+ the names of the joined columns (i.e., df will contain columns called `X_pickup` ,
669+ `Y_pickup` , `zone_pickup` and `borough_pickup` ). More information on join
670+ operations can be found [here](https:// docs.pola.rs/ user- guide/ transformations/ joins/ ).
671+
627672# # Exercises
628673
674+ :::{exercise} Joining geographical data
675+ We have already seen how to add actual latitude and longitude for the pickups.
676+ Now do the same for the drop- offs!
677+
678+ :::
679+
680+ :::{solution}
681+
682+ ```python
683+ df = df.join(lookup_df, left_on = ' DOLocationID' , right_on = ' LocationID' , how = ' left'
684+ , suffix = ' _dropoff' )
685+ ```
686+
687+ :::
688+
629689:::{exercise} Feature engineering: enriching the dataset
630690We want to understand a bit more of the traffic in the city by creating
631691new features (i.e. columns), in particular:
@@ -653,18 +713,12 @@ df = raw_df.with_columns([
653713 .alias(" trip_duration_sec" ),
654714])
655715
656- # ------------------------------------------------------------
657- # 4️⃣ Speed feature (mph)
658- # ------------------------------------------------------------
659716df = df.with_column(
660717 # TODO : add expression for average velocity here
661718 .replace_nan(None ) # protect against div‑by‑zero
662719 .alias(" avg_speed_mph" )
663720)
664721
665- # ------------------------------------------------------------
666- # 5️⃣ Zone‑level contextual aggregates
667- # ------------------------------------------------------------
668722# Compute per‑pickup‑zone statistics once
669723zone_stats = (
670724 df.groupby(" PULocationID" )
@@ -682,7 +736,9 @@ df = df.join(zone_stats, left_on="PULocationID", right_on="pickup_zone_id", how=
682736
683737While we haven' t covered the `join` instruction earlier, its main role
684738is to " spread" the `zone_stats` over all the rides in the original dataframe
685- (i.e. write the ` zone_avg_fare ` on each ride in ` df ` ).
739+ (i.e. write the `zone_avg_fare` on each ride in `df` ). `join` has its roots
740+ in relational databases, where different tables can be merged based on a
741+ common column.
686742:::
687743
688744:::{solution}
@@ -701,9 +757,6 @@ df = raw_df.with_columns([
701757 .alias(" trip_duration_sec" ),
702758])
703759
704- # ------------------------------------------------------------
705- # 4️⃣ Speed feature (mph)
706- # ------------------------------------------------------------
707760df = df.with_column(
708761 (
709762 pl.col(" trip_distance" ) /
@@ -713,9 +766,6 @@ df = df.with_column(
713766 .alias(" avg_speed_mph" )
714767)
715768
716- # ------------------------------------------------------------
717- # 5️⃣ Zone‑level contextual aggregates
718- # ------------------------------------------------------------
719769# Compute per‑pickup‑zone statistics once
720770zone_stats = (
721771 df.groupby(" PULocationID" )
@@ -795,9 +845,6 @@ df = raw_df.with_columns([
795845 .alias(" dist_per_passenger" ),
796846])
797847
798- # ------------------------------------------------------------
799- # 4️⃣ Drop‑off‑zone contextual aggregates
800- # ------------------------------------------------------------
801848dropoff_stats = (
802849 df.groupby(" DOLocationID" )
803850 .agg([
0 commit comments