Skip to content

Commit b67de1c

Browse files
committed
Improve polars lecture: pandas comparison, split code blocks, add prose
- Update benchmark link to official Polars TPC-H benchmarks - Add pandas vs Polars timing comparison for small and large datasets - Split monolithic code cells into focused cells with connecting prose - Add connecting prose between all adjacent code cells - Clean heading: use index directive instead of role syntax - Remove redundant standalone index entry
1 parent e28cf1a commit b67de1c

File tree

1 file changed

+111
-24
lines changed

1 file changed

+111
-24
lines changed

lectures/polars.md

Lines changed: 111 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ kernelspec:
2020
</div>
2121
```
2222

23-
# {index}`Polars <single: Polars>`
23+
# Polars
2424

2525
```{index} single: Python; Polars
2626
```
@@ -51,7 +51,7 @@ Polars is designed with performance and memory efficiency in mind, leveraging:
5151
5252
* **Memory**: pandas typically needs 5--10x your dataset size in RAM; Polars needs only 2--4x
5353
* **Speed**: Polars is 10--100x faster for many common operations
54-
* **See**: [Polars vs pandas comparison](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/) for detailed benchmarks
54+
* **See**: [Polars TPC-H benchmarks](https://www.pola.rs/benchmarks/) for up-to-date performance comparisons
5555
```
5656

5757
Throughout the lecture, we will assume that the following imports have taken place
@@ -95,10 +95,14 @@ Polars `Series` are built on top of [Apache Arrow](https://arrow.apache.org/) ar
9595
s * 100
9696
```
9797

98+
Absolute values are available as a method
99+
98100
```{code-cell} ipython3
99101
s.abs()
100102
```
101103

104+
We can also get quick summary statistics
105+
102106
```{code-cell} ipython3
103107
s.describe()
104108
```
@@ -135,6 +139,8 @@ df = df.with_columns(
135139
df
136140
```
137141

142+
We can also check membership
143+
138144
```{code-cell} ipython3
139145
'AAPL' in df['company']
140146
```
@@ -166,10 +172,14 @@ We can select rows by slicing and columns by name
166172
df[2:5]
167173
```
168174

175+
To select specific columns, pass a list of names to `select`
176+
169177
```{code-cell} ipython3
170178
df.select(['country', 'tcgdp'])
171179
```
172180

181+
These can be combined
182+
173183
```{code-cell} ipython3
174184
df[2:5].select(['country', 'tcgdp'])
175185
```
@@ -387,57 +397,130 @@ print("Optimized plan:")
387397
print(optimized.explain())
388398
```
389399

400+
Executing the plan gives us the final result
401+
390402
```{code-cell} ipython3
391403
optimized.collect()
392404
```
393405

394406
### Performance comparison
395407

396-
Let's compare eager vs lazy on a larger synthetic dataset
408+
Let's compare pandas, Polars eager, and Polars lazy on the same task.
409+
410+
We start with a small dataset (the Penn World Tables we used above) to show
411+
that for small data the differences are negligible
397412

398413
```{code-cell} ipython3
414+
import pandas as pd
399415
import time
400416
417+
# Small dataset -- Penn World Tables (~8 rows)
418+
url = ('https://raw.githubusercontent.com/QuantEcon/'
419+
'lecture-python-programming/main/lectures/_static/'
420+
'lecture_specific/pandas/data/test_pwt.csv')
421+
small_pd = pd.read_csv(url)
422+
small_pl = pl.read_csv(url)
423+
```
424+
425+
Now we time the same filter-select-sort operation in each library
426+
427+
```{code-cell} ipython3
428+
# pandas
429+
start = time.perf_counter()
430+
_ = (small_pd
431+
.query('tcgdp > 500')
432+
[['country', 'year', 'tcgdp', 'POP']]
433+
.assign(gdp_pc=lambda d: d['tcgdp'] / d['POP'])
434+
.sort_values('gdp_pc', ascending=False))
435+
pd_small = time.perf_counter() - start
436+
437+
# Polars eager
438+
start = time.perf_counter()
439+
_ = (small_pl
440+
.filter(pl.col('tcgdp') > 500)
441+
.select(['country', 'year', 'tcgdp', 'POP'])
442+
.with_columns((pl.col('tcgdp') / pl.col('POP')).alias('gdp_pc'))
443+
.sort('gdp_pc', descending=True))
444+
pl_small = time.perf_counter() - start
445+
446+
print(f"Small data -- pandas: {pd_small:.4f}s | Polars eager: {pl_small:.4f}s")
447+
```
448+
449+
On a handful of rows the speed difference is immaterial --- use whichever
450+
API you find more convenient.
451+
452+
Now let's scale up to 5 million rows where the difference becomes clear
453+
454+
```{code-cell} ipython3
401455
n = 5_000_000
402-
big_df = pl.DataFrame({
403-
'group': np.random.choice(['A', 'B', 'C', 'D'], n),
404-
'value': np.random.randn(n),
405-
'weight': np.random.rand(n),
406-
'extra1': np.random.randn(n),
407-
'extra2': np.random.randn(n),
456+
np.random.seed(42)
457+
458+
groups = np.random.choice(['A', 'B', 'C', 'D'], n)
459+
values = np.random.randn(n)
460+
weights = np.random.rand(n)
461+
extra1 = np.random.randn(n)
462+
extra2 = np.random.randn(n)
463+
464+
big_pd = pd.DataFrame({
465+
'group': groups, 'value': values,
466+
'weight': weights, 'extra1': extra1, 'extra2': extra2
467+
})
468+
big_pl = pl.DataFrame({
469+
'group': groups, 'value': values,
470+
'weight': weights, 'extra1': extra1, 'extra2': extra2
408471
})
472+
```
473+
474+
First, the pandas baseline
475+
476+
```{code-cell} ipython3
477+
start = time.perf_counter()
478+
tmp = big_pd[big_pd['value'] > 0][['group', 'value', 'weight']].copy()
479+
tmp['weighted'] = tmp['value'] * tmp['weight']
480+
_ = tmp.groupby('group')['weighted'].mean()
481+
pd_time = time.perf_counter() - start
482+
print(f"pandas: {pd_time:.4f}s")
483+
```
484+
485+
Next, Polars in eager mode
409486

410-
# Eager
487+
```{code-cell} ipython3
411488
start = time.perf_counter()
412-
result_e = (big_df
489+
_ = (big_pl
413490
.filter(pl.col('value') > 0)
414491
.select(['group', 'value', 'weight'])
415492
.with_columns(
416-
(pl.col('value') * pl.col('weight')).alias('weighted')
417-
)
493+
(pl.col('value') * pl.col('weight')).alias('weighted'))
418494
.group_by('group')
419-
.agg(pl.col('weighted').mean())
420-
)
495+
.agg(pl.col('weighted').mean()))
421496
eager_time = time.perf_counter() - start
497+
print(f"Polars eager: {eager_time:.4f}s")
498+
```
499+
500+
And finally, Polars in lazy mode
422501

423-
# Lazy
502+
```{code-cell} ipython3
424503
start = time.perf_counter()
425-
result_l = (big_df.lazy()
504+
_ = (big_pl.lazy()
426505
.filter(pl.col('value') > 0)
427506
.select(['group', 'value', 'weight'])
428507
.with_columns(
429-
(pl.col('value') * pl.col('weight')).alias('weighted')
430-
)
508+
(pl.col('value') * pl.col('weight')).alias('weighted'))
431509
.group_by('group')
432510
.agg(pl.col('weighted').mean())
433-
.collect()
434-
)
511+
.collect())
435512
lazy_time = time.perf_counter() - start
436-
437-
print(f"Eager: {eager_time:.4f}s")
438-
print(f"Lazy: {lazy_time:.4f}s")
513+
print(f"Polars lazy: {lazy_time:.4f}s")
439514
```
440515

516+
The take-away:
517+
518+
* For **small data** (thousands of rows), pandas and Polars perform
519+
similarly --- choose based on API preference and ecosystem fit.
520+
* For **medium to large data** (hundreds of thousands of rows and above),
521+
Polars can be significantly faster thanks to its Rust engine, parallel
522+
execution, and (in lazy mode) query optimization.
523+
441524
The lazy API is particularly powerful when reading from disk --- `scan_csv` returns a `LazyFrame` directly, so filters and projections are pushed down to the file reader.
442525

443526
```{tip}
@@ -479,10 +562,14 @@ fred_url = ('https://fred.stlouisfed.org/graph/fredgraph.csv?'
479562
data = pl.read_csv(fred_url, try_parse_dates=True)
480563
```
481564

565+
Let's inspect the first few rows
566+
482567
```{code-cell} ipython3
483568
data.head()
484569
```
485570

571+
And get summary statistics
572+
486573
```{code-cell} ipython3
487574
data.describe()
488575
```

0 commit comments

Comments
 (0)