-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Labels
Description
What happens?
It seems like the limit relation method doesn't make things as fast as it should/could.
To Reproduce
If I prepare a 100 million row parquet:
import duckdb as ddb
ddb.sql(# sql
"""
copy (
select i from generate_series(1, 100_000_000) s(i)
) to '100million.parquet'
"""
)and then try to preview it with a SQL limit, it's very fast:
ddb.sql("select * from '100million.parquet' limit 5").show() # fast
ddb.sql("with tbl as (select * from '100million.parquet') select * from tbl limit 5").show() # fasthowever, if I use the limit option on the relation instead, things are very slow (about a minute on my midrange laptop):
ddb.sql("select * from '100million.parquet'").limit(5).show() # slow
ddb.sql("select * from '100million.parquet'").limit(5).show(max_rows=5) # also slowI would have expected the performance behaviour of both approaches to be identical.
Note I haven't dived into where the slowness actually is - .sql(), .limit(),or .show().
Thanks for continually working on such great software!
OS:
Linux x86_64 - Pop OS 24.04
DuckDB Version:
1.4.3
DuckDB Client:
Python
Hardware:
No response
Full Name:
Jarrad Whitaker
Affiliation:
personal
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
- Yes, I have
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set