Skip to content

Python Relation .limit() does not give expected speedups #247

@akdor1154

Description

@akdor1154

What happens?

It seems like the limit relation method doesn't make things as fast as it should/could.

To Reproduce

If I prepare a 100 million row parquet:

import duckdb as ddb

ddb.sql(# sql
    """
    copy (
        select i from generate_series(1, 100_000_000) s(i)
    ) to '100million.parquet'
    """
)

and then try to preview it with a SQL limit, it's very fast:

ddb.sql("select * from '100million.parquet' limit 5").show() # fast
ddb.sql("with tbl as (select * from '100million.parquet') select * from tbl limit 5").show() # fast

however, if I use the limit option on the relation instead, things are very slow (about a minute on my midrange laptop):

ddb.sql("select * from '100million.parquet'").limit(5).show()  # slow
ddb.sql("select * from '100million.parquet'").limit(5).show(max_rows=5) # also slow

I would have expected the performance behaviour of both approaches to be identical.

Note I haven't dived into where the slowness actually is - .sql(), .limit(),or .show().

Thanks for continually working on such great software!

OS:

Linux x86_64 - Pop OS 24.04

DuckDB Version:

1.4.3

DuckDB Client:

Python

Hardware:

No response

Full Name:

Jarrad Whitaker

Affiliation:

personal

Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?

  • Yes, I have

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions