[core] Introduce vector-store for data-evolution table#7240
[core] Introduce vector-store for data-evolution table#7240ColdL wants to merge 2 commits intoapache:masterfrom
Conversation
0ef7695 to
5e012b4
Compare
5e012b4 to
4976c76
Compare
...re/src/main/java/org/apache/paimon/append/dataevolution/DataEvolutionCompactCoordinator.java
Outdated
Show resolved
Hide resolved
paimon-core/src/main/java/org/apache/paimon/operation/DataEvolutionSplitRead.java
Outdated
Show resolved
Hide resolved
| .noDefaultValue() | ||
| .withDescription("Specify the vector store fields."); | ||
|
|
||
| public static final ConfigOption<MemorySize> VECTOR_STORE_TARGET_FILE_SIZE = |
There was a problem hiding this comment.
Fixed. Now the config names in code are consistent with the public configuration keys.
637e7d6 to
549c2b3
Compare
There was a problem hiding this comment.
LGTM
Thanks for @ColdL , can you rebase the latest master to resolve the conflict
fe6d1c1 to
900cb36
Compare
900cb36 to
1bbc24b
Compare
|
I think the PR statement should clearly state a few things:
|
|
You can also create a separate doc in |
@JingsongLi Thanks for the review! I've updated the PR description. After confirmation, I will continue to update and add the corresponding docs. |
|
@ColdL How about:
|
Got it. Do you think we should retain the |
vector-field is for Flink & Spark engines without Vector Type. So I think it should be retained. |
Purpose
Linked issue: update #7011
The goal of this PR is to optimize storage layout for vector scenarios in the Data Evolution table, specifically by storing vector columns and potentially associated columns in specified file formats.
For example, scalar columns can be stored using Parquet format, while vector columns and columns that may require point lookups during vector search can be stored using file formats like Lance.
1. Configuration
This PR introduces three new configuration options:
vector-field: defines the column names for separate storagevector.file.format: defines the file formatvector.target-file-size: specifies the file size threshold for rolling2. Storage Layout
When this feature is enabled, a set of columns specified by
vector-fieldwill be stored separately in the file format specified byvector.file.format, marked by.vector-store.in the data file path.File Path Pattern:
data-xxx-{count}.vector-store.{file-format}This design serves two purposes:
.vector-store.segment identifies these as separately stored column groups.{file-format}follows current conventions, using the file format as the suffixNote: Perhaps
.vector.is better than.vector-store., if confirmed, I will update this accordingly. Please see the discussion below for details.The final storage layout might be:
These vector-store files are associated with regular columns through Row-tracking / Data Evolution.
Tests
API and Format
Documentation