Search before asking
Motivation
In time-series / IoT / observability workloads, a common pattern is storing free-schema fields in a MAP<STRING, T> column (e.g. metrics MAP<STRING, DOUBLE>). The default MAP storage (two KV arrays) provides:
- No per-key columnar access
- No per-key statistics
- No predicate pushdown on individual keys
This makes queries like SELECT ext_map['usage'] FROM metrics WHERE ext_map['usage'] > 30 scan the entire MAP column — extremely inefficient when only 1–3 keys out of thousands are needed per query.
The PIP: Columnar-Extend Storage Layout for MAP Columns proposes a new extend storage layout that stores MAP values in K reusable physical columns within a Struct, achieving near-full columnar access with per-key statistics and predicate pushdown — without changing the logical type (MAP<STRING, T>).
Solution
Physical Layout
Each MAP<STRING, T> column marked with map-storage-layout = extend is physically stored as:
STRUCT<
__field_mapping: FixedSizeList<Int32, K>, -- per-row: which field_id each col holds
__col_0: T, __col_1: T, ..., __col_{K-1}: T, -- reusable typed columns
__overflow: MAP<INT32, T> -- rare fallback for rows with > K fields
>
File metadata (footer) stores: field name↔id dictionary, field_id→physical column set S, overflow set O, K, and max row width.
Write Path
-
Schema conversion utilities — Logical MAP → physical Struct schema rewriting; metadata serialization/deserialization; EXTEND column detection via field metadata marker.
-
FormatWriter::AddMetadata — New virtual method (default no-op) for writing key-value metadata to file footer before Finish(). Parquet implementation calls AddKeyValueMetadata.
-
Column allocator — Streaming per-row slot allocator (Hit/Evict/Retain/Overflow) maintaining K physical column assignments across batches within a file. LRU-based eviction. Accumulates file-level statistics (S, O, max row width).
-
Logical→physical batch converter — Parses logical MAP, encodes field names to integer IDs (file-level dictionary), invokes allocator per row, assembles physical Struct array.
-
Writer integration — Extended DataFileWriter that performs conversion before writing + injects metadata on close. AppendOnlyWriter detects EXTEND columns and routes accordingly. Cross-file K adaptation (P99 of recent max row widths, capped by K_max).
Read Path
-
File metadata parsing — Parse EXTEND metadata from file footer (dictionary, S, O, K). New GetFileKeyValueMetadata() method on FileBatchReader with Parquet implementation.
-
Predicate translation — Translate logical predicates on MAP keys into conservative OR predicates over physical sub-columns. Requires extending LeafPredicate to support nested field paths and updating PredicateConverter to emit nested FieldRef.
-
Read planning — At SetReadSchema time: look up which physical columns to read (from S), decide whether __overflow is needed (from O), translate predicates, and pass the physical schema + physical predicate down to the inner FileBatchReader unchanged.
-
Batch reconstruction — After NextBatch: read __field_mapping per row to identify which column holds which field (fine-grained filter), gather values into logical MAP<STRING, T>. Merge overflow when needed. Correctness relies on per-row __field_mapping, not on pushdown precision.
-
Reader integration — A wrapper reader (implements FileBatchReader) sits between the upper layer and the format-level reader. Per-file instance. Compatible with varying K across files. Orthogonal to DataEvolutionFileReader (schema evolution).
Anything else?
No response
Are you willing to submit a PR?
Search before asking
Motivation
In time-series / IoT / observability workloads, a common pattern is storing free-schema fields in a
MAP<STRING, T>column (e.g.metrics MAP<STRING, DOUBLE>). The default MAP storage (two KV arrays) provides:This makes queries like
SELECT ext_map['usage'] FROM metrics WHERE ext_map['usage'] > 30scan the entire MAP column — extremely inefficient when only 1–3 keys out of thousands are needed per query.The PIP: Columnar-Extend Storage Layout for MAP Columns proposes a new
extendstorage layout that stores MAP values inKreusable physical columns within a Struct, achieving near-full columnar access with per-key statistics and predicate pushdown — without changing the logical type (MAP<STRING, T>).Solution
Physical Layout
Each
MAP<STRING, T>column marked withmap-storage-layout = extendis physically stored as:File metadata (footer) stores: field name↔id dictionary, field_id→physical column set S, overflow set O, K, and max row width.
Write Path
Schema conversion utilities — Logical MAP → physical Struct schema rewriting; metadata serialization/deserialization; EXTEND column detection via field metadata marker.
FormatWriter::AddMetadata— New virtual method (default no-op) for writing key-value metadata to file footer beforeFinish(). Parquet implementation callsAddKeyValueMetadata.Column allocator — Streaming per-row slot allocator (Hit/Evict/Retain/Overflow) maintaining
Kphysical column assignments across batches within a file. LRU-based eviction. Accumulates file-level statistics (S, O, max row width).Logical→physical batch converter — Parses logical MAP, encodes field names to integer IDs (file-level dictionary), invokes allocator per row, assembles physical Struct array.
Writer integration — Extended DataFileWriter that performs conversion before writing + injects metadata on close. AppendOnlyWriter detects EXTEND columns and routes accordingly. Cross-file K adaptation (P99 of recent max row widths, capped by K_max).
Read Path
File metadata parsing — Parse EXTEND metadata from file footer (dictionary, S, O, K). New
GetFileKeyValueMetadata()method onFileBatchReaderwith Parquet implementation.Predicate translation — Translate logical predicates on MAP keys into conservative OR predicates over physical sub-columns. Requires extending
LeafPredicateto support nested field paths and updatingPredicateConverterto emit nestedFieldRef.Read planning — At
SetReadSchematime: look up which physical columns to read (from S), decide whether__overflowis needed (from O), translate predicates, and pass the physical schema + physical predicate down to the innerFileBatchReaderunchanged.Batch reconstruction — After
NextBatch: read__field_mappingper row to identify which column holds which field (fine-grained filter), gather values into logicalMAP<STRING, T>. Merge overflow when needed. Correctness relies on per-row__field_mapping, not on pushdown precision.Reader integration — A wrapper reader (implements
FileBatchReader) sits between the upper layer and the format-level reader. Per-file instance. Compatible with varying K across files. Orthogonal toDataEvolutionFileReader(schema evolution).Anything else?
No response
Are you willing to submit a PR?