@@ -20,7 +20,7 @@ kernelspec:
2020</div>
2121```
2222
23- # {index} ` Polars <single: Polars> `
23+ # Polars
2424
2525``` {index} single: Python; Polars
2626```
@@ -51,7 +51,7 @@ Polars is designed with performance and memory efficiency in mind, leveraging:
5151
5252* **Memory**: pandas typically needs 5--10x your dataset size in RAM; Polars needs only 2--4x
5353* **Speed**: Polars is 10--100x faster for many common operations
54- * **See**: [Polars vs pandas comparison ](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/ ) for detailed benchmarks
54+ * **See**: [Polars TPC-H benchmarks ](https://www.pola.rs/benchmarks/ ) for up-to-date performance comparisons
5555```
5656
5757Throughout the lecture, we will assume that the following imports have taken place
@@ -95,10 +95,14 @@ Polars `Series` are built on top of [Apache Arrow](https://arrow.apache.org/) ar
9595s * 100
9696```
9797
98+ Absolute values are available as a method
99+
98100``` {code-cell} ipython3
99101s.abs()
100102```
101103
104+ We can also get quick summary statistics
105+
102106``` {code-cell} ipython3
103107s.describe()
104108```
@@ -135,6 +139,8 @@ df = df.with_columns(
135139df
136140```
137141
142+ We can also check membership
143+
138144``` {code-cell} ipython3
139145'AAPL' in df['company']
140146```
@@ -166,10 +172,14 @@ We can select rows by slicing and columns by name
166172df[2:5]
167173```
168174
175+ To select specific columns, pass a list of names to ` select `
176+
169177``` {code-cell} ipython3
170178df.select(['country', 'tcgdp'])
171179```
172180
181+ These can be combined
182+
173183``` {code-cell} ipython3
174184df[2:5].select(['country', 'tcgdp'])
175185```
@@ -387,57 +397,130 @@ print("Optimized plan:")
387397print(optimized.explain())
388398```
389399
400+ Executing the plan gives us the final result
401+
390402``` {code-cell} ipython3
391403optimized.collect()
392404```
393405
394406### Performance comparison
395407
396- Let's compare eager vs lazy on a larger synthetic dataset
408+ Let's compare pandas, Polars eager, and Polars lazy on the same task.
409+
410+ We start with a small dataset (the Penn World Tables we used above) to show
411+ that for small data the differences are negligible
397412
398413``` {code-cell} ipython3
414+ import pandas as pd
399415import time
400416
417+ # Small dataset -- Penn World Tables (~8 rows)
418+ url = ('https://raw.githubusercontent.com/QuantEcon/'
419+ 'lecture-python-programming/main/lectures/_static/'
420+ 'lecture_specific/pandas/data/test_pwt.csv')
421+ small_pd = pd.read_csv(url)
422+ small_pl = pl.read_csv(url)
423+ ```
424+
425+ Now we time the same filter-select-sort operation in each library
426+
427+ ``` {code-cell} ipython3
428+ # pandas
429+ start = time.perf_counter()
430+ _ = (small_pd
431+ .query('tcgdp > 500')
432+ [['country', 'year', 'tcgdp', 'POP']]
433+ .assign(gdp_pc=lambda d: d['tcgdp'] / d['POP'])
434+ .sort_values('gdp_pc', ascending=False))
435+ pd_small = time.perf_counter() - start
436+
437+ # Polars eager
438+ start = time.perf_counter()
439+ _ = (small_pl
440+ .filter(pl.col('tcgdp') > 500)
441+ .select(['country', 'year', 'tcgdp', 'POP'])
442+ .with_columns((pl.col('tcgdp') / pl.col('POP')).alias('gdp_pc'))
443+ .sort('gdp_pc', descending=True))
444+ pl_small = time.perf_counter() - start
445+
446+ print(f"Small data -- pandas: {pd_small:.4f}s | Polars eager: {pl_small:.4f}s")
447+ ```
448+
449+ On a handful of rows the speed difference is immaterial --- use whichever
450+ API you find more convenient.
451+
452+ Now let's scale up to 5 million rows where the difference becomes clear
453+
454+ ``` {code-cell} ipython3
401455n = 5_000_000
402- big_df = pl.DataFrame({
403- 'group': np.random.choice(['A', 'B', 'C', 'D'], n),
404- 'value': np.random.randn(n),
405- 'weight': np.random.rand(n),
406- 'extra1': np.random.randn(n),
407- 'extra2': np.random.randn(n),
456+ np.random.seed(42)
457+
458+ groups = np.random.choice(['A', 'B', 'C', 'D'], n)
459+ values = np.random.randn(n)
460+ weights = np.random.rand(n)
461+ extra1 = np.random.randn(n)
462+ extra2 = np.random.randn(n)
463+
464+ big_pd = pd.DataFrame({
465+ 'group': groups, 'value': values,
466+ 'weight': weights, 'extra1': extra1, 'extra2': extra2
467+ })
468+ big_pl = pl.DataFrame({
469+ 'group': groups, 'value': values,
470+ 'weight': weights, 'extra1': extra1, 'extra2': extra2
408471})
472+ ```
473+
474+ First, the pandas baseline
475+
476+ ``` {code-cell} ipython3
477+ start = time.perf_counter()
478+ tmp = big_pd[big_pd['value'] > 0][['group', 'value', 'weight']].copy()
479+ tmp['weighted'] = tmp['value'] * tmp['weight']
480+ _ = tmp.groupby('group')['weighted'].mean()
481+ pd_time = time.perf_counter() - start
482+ print(f"pandas: {pd_time:.4f}s")
483+ ```
484+
485+ Next, Polars in eager mode
409486
410- # Eager
487+ ``` {code-cell} ipython3
411488start = time.perf_counter()
412- result_e = (big_df
489+ _ = (big_pl
413490 .filter(pl.col('value') > 0)
414491 .select(['group', 'value', 'weight'])
415492 .with_columns(
416- (pl.col('value') * pl.col('weight')).alias('weighted')
417- )
493+ (pl.col('value') * pl.col('weight')).alias('weighted'))
418494 .group_by('group')
419- .agg(pl.col('weighted').mean())
420- )
495+ .agg(pl.col('weighted').mean()))
421496eager_time = time.perf_counter() - start
497+ print(f"Polars eager: {eager_time:.4f}s")
498+ ```
499+
500+ And finally, Polars in lazy mode
422501
423- # Lazy
502+ ``` {code-cell} ipython3
424503start = time.perf_counter()
425- result_l = (big_df .lazy()
504+ _ = (big_pl .lazy()
426505 .filter(pl.col('value') > 0)
427506 .select(['group', 'value', 'weight'])
428507 .with_columns(
429- (pl.col('value') * pl.col('weight')).alias('weighted')
430- )
508+ (pl.col('value') * pl.col('weight')).alias('weighted'))
431509 .group_by('group')
432510 .agg(pl.col('weighted').mean())
433- .collect()
434- )
511+ .collect())
435512lazy_time = time.perf_counter() - start
436-
437- print(f"Eager: {eager_time:.4f}s")
438- print(f"Lazy: {lazy_time:.4f}s")
513+ print(f"Polars lazy: {lazy_time:.4f}s")
439514```
440515
516+ The take-away:
517+
518+ * For ** small data** (thousands of rows), pandas and Polars perform
519+ similarly --- choose based on API preference and ecosystem fit.
520+ * For ** medium to large data** (hundreds of thousands of rows and above),
521+ Polars can be significantly faster thanks to its Rust engine, parallel
522+ execution, and (in lazy mode) query optimization.
523+
441524The lazy API is particularly powerful when reading from disk --- ` scan_csv ` returns a ` LazyFrame ` directly, so filters and projections are pushed down to the file reader.
442525
443526``` {tip}
@@ -479,10 +562,14 @@ fred_url = ('https://fred.stlouisfed.org/graph/fredgraph.csv?'
479562data = pl.read_csv(fred_url, try_parse_dates=True)
480563```
481564
565+ Let's inspect the first few rows
566+
482567``` {code-cell} ipython3
483568data.head()
484569```
485570
571+ And get summary statistics
572+
486573``` {code-cell} ipython3
487574data.describe()
488575```
0 commit comments