This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
DataFusion-DuckLake is a DataFusion extension that provides read-only access to DuckLake catalogs. DuckLake is an integrated data lake and catalog format that stores:
- Metadata: In SQL databases (DuckDB, SQLite, PostgreSQL, MySQL) as structured catalog tables
- Data: As Apache Parquet files on disk or object storage (S3, MinIO)
The extension integrates DuckLake with Apache DataFusion by implementing DataFusion's catalog and table provider interfaces.
# Build the project
cargo build
# Run all tests
cargo test
# Run a specific test
cargo test test_name
# Build and run the basic query example
cargo run --example basic_query -- <catalog.db> <sql>The codebase follows a layered architecture with clear separation of concerns:
-
MetadataProvider Layer (
src/metadata_provider.rs,src/metadata_provider_duckdb.rs)- Abstraction for querying DuckLake catalog metadata
MetadataProvidertrait defines interface for listing schemas, tables, columns, and data files- Also provides individual lookup methods:
get_schema_by_name(),get_table_by_name(), andtable_exists() DuckdbMetadataProviderimplements the trait using DuckDB as the catalog backend- Executes SQL queries against standard DuckLake catalog tables (
ducklake_snapshot,ducklake_schema,ducklake_table,ducklake_column,ducklake_data_file,ducklake_delete_file,ducklake_metadata) - Thread-safe: Uses a single shared connection protected by Mutex for efficiency
- Supports delete files:
get_table_files_for_select()returns data files with associated delete files
-
DataFusion Integration Layer (
src/catalog.rs,src/schema.rs,src/table.rs)- Bridges DuckLake concepts to DataFusion's catalog system
DuckLakeCatalog: ImplementsCatalogProvider, uses dynamic metadata lookup (queries on every call toschema()andschema_names())DuckLakeSchema: ImplementsSchemaProvider, uses dynamic metadata lookup (queries on every call totable()andtable_names())DuckLakeTable: ImplementsTableProvider, caches table structure and file lists at creation time- No HashMaps: Catalog and schema providers query metadata on-demand rather than caching
-
Path Resolution (
src/path_resolver.rs)- Centralized utilities for parsing object store URLs and resolving hierarchical paths
parse_object_store_url(): Parses S3, file://, or local paths into ObjectStoreUrl and path componentsresolve_path(): Resolves relative or absolute paths in the catalog hierarchyPathResolver: Maintains base URL and path for hierarchical resolution (catalog -> schema -> table -> file)- Handles S3, MinIO, and local filesystem paths uniformly
-
Delete File Filtering (
src/delete_filter.rs)DeleteFilterExec: Custom execution plan that wraps Parquet scans and filters deleted rows- Implements MOR (Merge-On-Read) pattern for row-level deletes
- Delete files contain
(file_path: VARCHAR, pos: INT64)schema - Efficiently filters rows by position during query execution
- Supports COUNT(*) optimization (zero-column batches)
-
Type Mapping (
src/types.rs)- Converts DuckLake type strings to Arrow DataTypes
- Handles basic types (integers, floats, strings, dates, timestamps)
- Supports decimals with precision/scale parsing
- Complex types (lists, structs, maps) return proper errors instead of silently failing
build_arrow_schema()constructs Arrow schemas from DuckLake column metadata
The catalog uses a pure dynamic lookup approach with no caching at the catalog/schema level:
-
DuckLakeCatalog (
catalog.rs):schema_names(): Querieslist_schemas()on every callschema(): Queriesget_schema_by_name()on every callnew(): O(1) - only fetches snapshot ID and data_path
-
DuckLakeSchema (
schema.rs):table_names(): Querieslist_tables()on every calltable(): Queriesget_table_by_name()on every calltable_exist(): Queriestable_exists()on every callnew(): O(1) - just stores IDs and paths
-
DuckLakeTable (
table.rs):- Still caches table structure and file lists at creation time
- This is necessary for query planning and execution
Benefits:
- O(1) memory usage regardless of catalog size
- Fast catalog startup (no upfront schema/table listing)
- Always fresh metadata (no stale cache issues)
- Simple implementation (no cache invalidation logic)
Trade-offs:
- Small query overhead per metadata lookup (acceptable for read-only DuckDB connections)
- Future optimization: Add optional caching layer via wrapper implementation
When querying a DuckLake table:
- User creates a
SessionContextwith aRuntimeEnvand registers aDuckLakeCatalog - User registers required object stores (S3, MinIO, etc.) with the
RuntimeEnv - SQL query references table as
catalog.schema.table - DataFusion resolves path: catalog -> schema -> table (queries metadata on-demand)
DuckLakeTablequeries metadata provider for table structure and data files (cached at table creation)- Paths are resolved hierarchically using
path_resolverutilities:- Global
data_pathfromducklake_metadatatable - Schema path (relative to
data_pathor absolute) - Table path (relative to schema path or absolute)
- File paths (relative to table path or absolute)
- Global
DuckLakeTableresolves file paths to ObjectStoreUrl and relative paths- For each file, check if delete file exists (from metadata join)
- Files without deletes are grouped into a single efficient
ParquetExec - Files with deletes get individual
ParquetExecwrapped inDeleteFilterExec - All execution plans are combined with
UnionExecif multiple plans exist - DataFusion scans Parquet files using registered object stores
- Delete filters apply row position filtering during streaming execution
DuckLake supports hierarchical path resolution with relative and absolute paths:
- data_path (from
ducklake_metadatatable): Root path for all data - schema.path: May be relative to
data_pathor absolute - table.path: May be relative to resolved schema path or absolute
- file.path: May be relative to resolved table path or absolute
See path_resolver.rs for centralized path resolution logic, particularly:
parse_object_store_url(): Converts paths to ObjectStoreUrl + key pathresolve_path(): Handles relative/absolute path resolutionPathResolver: Hierarchical resolver withchild_resolver()for multi-level paths
Object stores must be registered with DataFusion's RuntimeEnv before querying:
- Local filesystem: Automatically available via DataFusion's default object store
- S3/MinIO: Must be explicitly registered using
AmazonS3BuilderandRuntimeEnv::register_object_store() - Object stores are registered per-bucket (S3) or globally (local filesystem)
- See
examples/basic_query.rsfor S3/MinIO configuration examples
The DuckLakeTable provider handles URL resolution by:
- Using
path_resolver::resolve_path()to resolve file paths hierarchically - Passing resolved absolute paths to DataFusion's
PartitionedFile - Leveraging the
ObjectStoreUrlfrom catalog initialization for all file operations
- DuckLake uses snapshot IDs for temporal consistency
- Current implementation queries latest snapshot on catalog creation
- Tables and schemas are filtered by snapshot validity ranges
- Uses DataFusion's
FileScanConfigBuilderandParquetFormat - Files are organized into
FileGroupfor parallel scanning - Footer Size Optimization: Parquet footer sizes stored in metadata and passed via
with_metadata_size_hint()- Reduces I/O from 2 reads to 1 read per file (especially beneficial for S3/MinIO)
- Applied to both data files and delete files
- Files without delete files are grouped into a single
ParquetExecfor efficiency - Files with delete files get individual
ParquetExecwrapped inDeleteFilterExec
- Delete files contain row positions to exclude:
(file_path: VARCHAR, pos: INT64) - Metadata join in
SQL_GET_DATA_FILESassociates delete files with data files DeleteFilterExecwraps Parquet scans and filters rows by global position- Supports MOR (Merge-On-Read) pattern for efficient row-level deletes
- Handles edge cases: COUNT(*) optimization, empty batches, all rows deleted
- See
delete_filter.rsandtests/delete_filter_tests.rsfor implementation and tests
- Implements
supports_filters_pushdown()returningInexactfor all filters - Allows DataFusion to push filters to Parquet for:
- Row group pruning via statistics
- Page-level filtering with late materialization
- Bloom filter lookups (if available)
- Marks filters as
Inexactbecause delete filtering happens after Parquet scan - DataFusion automatically reapplies filters after
DeleteFilterExecfor correctness
- DuckLake types are stored as strings in catalog
- Type mapping handles SQL type aliases (e.g., "bigint" -> Int64, "text" -> Utf8)
- Geometry types are mapped to Binary (WKB format)
- Complex types (nested lists, structs, maps) return descriptive errors instead of silently failing
The example in examples/basic_query.rs shows object store registration for MinIO:
let runtime = Arc::new(RuntimeEnv::default());
let s3: Arc<dyn ObjectStore> = Arc::new(
AmazonS3Builder::new()
.with_endpoint("http://localhost:9000")
.with_bucket_name("ducklake-data")
.with_access_key_id("minioadmin")
.with_secret_access_key("minioadmin")
.with_region("us-west-2")
.with_allow_http(true)
.build()?,
);
runtime.register_object_store(&Url::parse("s3://ducklake-data/")?, s3);- Read-only access (no writes to DuckLake catalogs)
- Complex types (nested lists, structs, maps) return errors (not yet supported)
- No partition-based file pruning (TODO: add to
MetadataProvidertrait) - Single metadata provider implementation (DuckDB only)
- No optional metadata caching layer (all lookups are dynamic)
The project includes comprehensive tests:
- Unit tests:
src/delete_filter.rs- Delete file schema and position extraction - Integration tests:
tests/delete_filter_tests.rs- End-to-end delete filtering scenariostests/concurrent_tests.rs- Thread-safety and concurrent query handlingtests/object_store_integration_test.rs- S3/MinIO integration
- Test data generation: All test data is generated in Rust using
tests/common/mod.rshelpers- No external shell scripts required
- Tests use temporary directories for isolation
- Each test generates its own DuckLake catalog on-the-fly
Run tests with:
cargo test # All tests (no setup required)
cargo test delete_filter # Delete file tests only
cargo test concurrent # Concurrency tests only
cargo test --ignored # Performance benchmarks