feat: integrate OMEGA adaptive early termination into zvec#301
feat: integrate OMEGA adaptive early termination into zvec#301driPyf wants to merge 138 commits into
Conversation
Integrate OMEGALib repository as a submodule to provide OMEGA adaptive search functionality. The submodule includes GBDT inference, feature extraction, model management, and search context components.
Add OMEGA index components that wrap HNSW with adaptive search capability: - OmegaSearcher: Wraps HnswSearcher with OMEGA model integration and automatic fallback - OmegaBuilder: Wraps HnswBuilder for index construction - OmegaStreamer: Wraps HnswStreamer for streaming operations - Factory registration for all components - CMakeLists.txt integration with omega library dependency OMEGA mode activates when vector count >= threshold and model is loaded, otherwise falls back to standard HNSW transparently.
- Add OMEGA index type to zvec type system - Implement OmegaIndexParams class for index configuration - Add Python bindings for OmegaIndexParam - Integrate OMEGA searcher with HNSW fallback mechanism - Add comprehensive Python unit tests for OMEGA functionality - Update schema validation to support OMEGA index type Tests verify that OMEGA index correctly falls back to HNSW behavior when OMEGA-specific features are not enabled, ensuring full compatibility.
…y recall - Implement OmegaIndex with ITrainingCapable interface for training support - Create OmegaStreamer with training mode for feature collection during search - Add OmegaSearcher adaptive search with OMEGA early stopping prediction - Implement training data export and collection APIs - Add OmegaQueryParams and OmegaContext for per-query target_recall specification - Create omega_params.h and omega_context.h for parameter management - Update engine_helper to convert and extract OMEGA query parameters - Integrate training mode with Collection API (enable/disable/export methods) - Add training data collector, query generator, and model trainer components - Add Python training API with OmegaTrainer class - Add debug logging for OMEGA index creation and merge operations - Adjust HnswSearcher member access modifiers for OMEGA inheritance - Remove test_omega_fallback.py (replaced by test_collection.py tests)
- Fix memory explosion in training data collection by clearing records after copy - Add omega_model directory creation before training to fix CSV write failure - Remove all debug fprintf/fflush statements and empty code blocks
… params - Parallelize ground truth computation and training searches with std::thread - Add training_query_id support for thread-safe parallel training - Add num_training_queries param to OmegaIndexParams (default: 1000) - Use ef_construction as training search ef instead of hardcoded 1000
Build System Changes: - Add ZVEC_ENABLE_OMEGA option for conditional OMEGA compilation (default: OFF) - Add -DZVEC_ENABLE_OMEGA definition when enabled - Update thirdparty/CMakeLists.txt to conditionally build omega library - Update src/core/CMakeLists.txt to conditionally compile omega sources - Update omega submodule to version with LightGBM C API support Training System Refactor: - Replace Python subprocess training with native LightGBM C API * Remove CSV export and Python _omega_training.py invocation * Add direct omega::OmegaTrainer integration via C++ API * Remove ExportToCSV, ExportGtCmpsToCSV, InvokePythonTrainer methods - Add configurable training parameters to OmegaModelTrainerOptions: * num_iterations (default: 100) * num_leaves (default: 31) * learning_rate (default: 0.1) * num_threads (default: 8) - Add type conversion helpers (ConvertRecord, ConvertGtCmpsData) - Improve training performance Training Data Collection Improvements: - Move training record storage from OmegaStreamer to OmegaContext * Remove shared collected_records_ vector and training_mutex_ from OmegaStreamer * Store records per-query in OmegaContext via add_training_record() * Eliminate lock contention during parallel training searches - Remove legacy GetTrainingRecords/ClearTrainingRecords from OmegaStreamer - Simplify OmegaIndex training interface (return empty vectors) - Update omega_streamer.cc to use context-based record collection Code Cleanup: - Wrap all OMEGA-dependent code with #ifdef ZVEC_ENABLE_OMEGA guards - Update OmegaModelTrainerOptions documentation - Add detailed logging for training record collection - Improve error handling for missing OmegaContext
- Expose target_recall parameter for OMEGA adaptive early stopping - Update OMEGA tests with 100k docs and recall validation - Remove deprecated _omega_training.py
Major optimization: - Move training data collection before Flush() to use in-memory graph - Eliminate ~2 minute disk reload delay for 1M vectors - Fix GT computation to use correct indexers (was using empty flushed ones) Training improvements: - Add ef_groundtruth parameter for faster GT computation using HNSW - Support parallel training searches with per-query ground truth - Add window_size parameter for early stopping control - Expose all OMEGA params through Python API (OmegaIndexParam, OmegaQueryParam) Code quality: - Add TIMING logs for performance debugging - Refactor TrainingDataCollector to use passed indexers instead of segment's - Clean up training flow in merge_vector_indexer()
…query-side search path OMEGA integration updates: - wire the updated omega training and search behavior into zvec index build, load and query execution paths - expose and propagate OMEGA training/query parameters through the Python API, index params and engine helper conversions - update omega builder, searcher, streamer and context handling to match the reference behavior more closely Training and validation updates: - update training data collection and model training integration for the reference-aligned OMEGA workflow Performance and debugging updates: - add an OMEGA prediction microbenchmark for query-side inference analysis - improve storage/index plumbing needed by the OMEGA workflow - add query-side diagnostics to investigate early-stop calibration and repeated prediction overhead
…g hooks, and add query-side profiling
| OmegaStreamer &operator=(const OmegaStreamer &streamer) = delete; | ||
|
|
||
| // Training-mode configuration forwarded into per-search contexts. | ||
| void EnableTrainingMode(bool enable) { |
There was a problem hiding this comment.
use lower case function name
| std::vector<uint64_t> query_doc_ids; | ||
| { | ||
| ScopedTimer timer("Step1: GenerateHeldOutQueries"); | ||
| auto sampled = TrainingQueryGenerator::GenerateHeldOutQueries( |
There was a problem hiding this comment.
should telll whether the result is loaded successfully.
|
Hi @driPyf , thanks for the PR! Could you please resolve the merge conflicts with the main branch? Once that's done, I'll continue with the review. Thanks! |
|
Also, could you please submit a PR to zvec-web beforehand? That PR would provide a more intuitive view of Omega's features and help users better understand how to perceive this capability. |
# Conflicts: # .github/workflows/04-android-build.yml # examples/c++/CMakeLists.txt # examples/c/CMakeLists.txt # python/tests/test_collection.py # python/tests/test_params.py # python/zvec/__init__.py # python/zvec/model/param/query.py # python/zvec/model/schema/field_schema.py # src/binding/c/c_api.cc # src/binding/python/typing/python_type.cc # src/core/algorithm/CMakeLists.txt # src/core/algorithm/hnsw/hnsw_algorithm.cc # src/core/algorithm/hnsw/hnsw_algorithm.h # src/core/algorithm/hnsw/hnsw_dist_calculator.h # src/core/algorithm/hnsw/hnsw_searcher.cc # src/core/algorithm/hnsw/hnsw_searcher.h # src/core/algorithm/hnsw/hnsw_streamer.cc # src/core/algorithm/hnsw/hnsw_streamer.h # src/db/collection.cc # src/db/index/column/vector_column/vector_column_indexer.h # src/db/index/common/proto_converter.cc # src/db/index/common/proto_converter.h # src/db/index/common/schema.cc # src/db/proto/zvec.proto # src/include/zvec/c_api.h # src/include/zvec/core/framework/index_storage.h # src/include/zvec/core/interface/index_param.h # src/include/zvec/db/index_params.h # src/include/zvec/db/query_params.h # tests/core/interface/CMakeLists.txt # tools/core/CMakeLists.txt
Apply ruff format and clang-format to resolve CI lint violations.
- Add missing ZVEC_DEPENDENCY_LIB_DIR definition (was used but never set) - Change FATAL_ERROR to WARNING since omega-example links libzvec.so which already bundles OMEGA internally - Fix omega-example to link zvec-lib instead of non-existent zvec-core - Add omega-example to CI workflow 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The Cache dependencies step in Android CI was using actions/cache@v5 without the required 'key' and 'path' parameters, causing CI failure.
Python's setup-python action sets LD_LIBRARY_PATH to its own lib directory, which can contain a libgomp that conflicts with the OpenMP runtime (libomp) used by Clang builds. This causes heap corruption during test teardown in omega_index_integration_test. Unsetting LD_LIBRARY_PATH ensures the system OpenMP library is loaded. C++ test binaries find their own shared libraries via RPATH and do not need Python's library path.
Replace old type names (ZVecErrorCode, ZVecCollection, etc.) with the correct _t suffixed types (zvec_error_code_t, zvec_collection_t, etc.) to match the C API header definitions.
There was a problem hiding this comment.
CMake Warning at thirdparty/omega/OMEGALib/CMakeLists.txt:30 (message):
OMEGA: OpenMP not found, building LightGBM without OpenMP
CMake Error at thirdparty/omega/OMEGALib/lightgbm/CMakeLists.txt:27 (cmake_minimum_required):
CMake 3.28 or higher is required. You are running version 3.26.5
cmake version needs update to 3.28?
|
after updating cmake version to 3.28, macOS local build still fail, env: the error msg: It seems that C++11/17 was not specified when compiling Eigen. |
| // Invert Index | ||
| IT_INVERT = 10; | ||
| // OMEGA Index (HNSW with learned early stopping) | ||
| IT_OMEGA = 11; |
There was a problem hiding this comment.
IT_INVERT是标量的索引类型,建议IT_OMEGA=6
| IVFIndexParams ivf = 4; | ||
| HnswRabitqIndexParams hnsw_rabitq = 5; | ||
| VamanaIndexParams vamana = 6; | ||
| OmegaIndexParams omega = 5; |
There was a problem hiding this comment.
-> OmegaIndexParams omega=7
| HNSW_RABITQ = 4, | ||
| VAMANA = 5, | ||
| INVERT = 10, | ||
| OMEGA = 11, |
| // limitations under the License. | ||
| #pragma once | ||
|
|
||
| #include <cstddef> |
There was a problem hiding this comment.
没有必要include cstddef/string
| @@ -0,0 +1,127 @@ | |||
| // Copyright 2025-present the zvec project | |||
There was a problem hiding this comment.
src/db/training下面的函数都没有被用到吧?
| - name: Run C++ Tests | ||
| run: | | ||
| cd "$GITHUB_WORKSPACE/build" | ||
| unset LD_LIBRARY_PATH |
| export CCACHE_BASEDIR="$GITHUB_WORKSPACE" | ||
| export CCACHE_NOHASHDIR=1 | ||
| export CCACHE_SLOPPINESS=clang_index_store,file_stat_matches,include_file_mtime,locale,time_macros | ||
|
|
| set(CMAKE_MSVC_RUNTIME_LIBRARY "MultiThreaded$<$<CONFIG:Debug>:Debug>") | ||
| endif() | ||
|
|
||
| if(ZVEC_ENABLE_OMEGA) |
There was a problem hiding this comment.
这个检查有必要吗?如果omega确实没有被编译进来,即使链接不报错 程序运行也会报错吧?
There was a problem hiding this comment.
如果zvec以ZVEC_ENABLE_OMEGA=OFF编译,example运行的时候会报错omega对应的资源不存在,core内部是通过register机制注册的
| set(ZVEC_INCLUDE_DIR ${CMAKE_BINARY_DIR}/../../../src/include) | ||
| set(ZVEC_GENERATED_INCLUDE_DIR ${CMAKE_BINARY_DIR}/../../../${HOST_BUILD_DIR}/src/generated) | ||
| set(ZVEC_LIB_DIR ${CMAKE_BINARY_DIR}/../../../${HOST_BUILD_DIR}/lib) | ||
| set(ZVEC_DEPENDENCY_LIB_DIR ${CMAKE_BINARY_DIR}/../../../${HOST_BUILD_DIR}/external/usr/local/lib) |
There was a problem hiding this comment.
没必要加ZVEC_DEPENDENCY_LIB_DIR吧
| @@ -0,0 +1,126 @@ | |||
| #include <cmath> | |||
| } | ||
|
|
||
| Status SegmentImpl::load_vector_index_blocks() { | ||
| int block_index = 0; |
| ../training/*.cc) | ||
|
|
||
| if(NOT ZVEC_ENABLE_OMEGA) | ||
| list(FILTER ZVEC_INDEX_SRCS EXCLUDE REGEX ".*/training/omega_model_trainer\\.cc$") |
| @@ -43,6 +43,8 @@ | |||
| #include "db/index/segment/segment_helper.h" | |||
There was a problem hiding this comment.
如果ENABLE_OMEGA=OFF,最好是collection构建(schema validate)阶段就应该返回错误
| return Status::OK(); | ||
| } | ||
|
|
||
| Status retrain_omega_model() override { |
There was a problem hiding this comment.
collection_test.cc里面有create_and_open、ddl、dml、query、optimize等接口测试,都应该把omega作为一个param放进去一起测
| @@ -0,0 +1 @@ | |||
| add_subdirectory(OMEGALib) | |||
There was a problem hiding this comment.
在下面可以加上
if(TARGET omega)
set_target_properties(omega PROPERTIES
CXX_STANDARD 17
CXX_STANDARD_REQUIRED ON
CXX_EXTENSIONS OFF
)
endif()
可以解决eigen编译without c++11/17的问题
Closes #300
Greptile Summary
This PR introduces a new
OMEGAindex type to zvec, integrating adaptive early termination on top of HNSW. Instead of adding a separate search engine, the implementation keeps HNSW as the underlying graph traversal path and adds a learned query-time stopping policy that decides whether the current search state is already sufficient for a target recall. The implementation spans a newomegaalgorithm directory, OMEGA-aware searcher/streamer/index support, DB-layer training orchestration, Python bindings, benchmark integration, and Python workflow tests.Key changes:
OMEGAis now exposed as a first-class index type with dedicated index params and query params in both the core interfaces and Python bindings.OmegaSearcherandOmegaStreamerintegrate OMEGA with the existing HNSW search loop through hook callbacks, so online search remains HNSW-based while query-time stop decisions come from OMEGALib.omega_model/artifacts such asmodel.txt,threshold_table.txt,interval_table.txt,gt_collected_table.txt, andgt_cmps_all_table.txt.min_vector_threshold.Issues found:
Confidence Score: 4/5
src/core/algorithm/omega/omega_searcher.cc,src/core/algorithm/omega/omega_streamer.cc, andsrc/db/training/omega_training_coordinator.cc— these files define the runtime activation/fallback logic, the HNSW hook integration, and the offline training lifecycle that make the feature work end to end.Important Files Changed
src/core/algorithm/omega/omega_searcher.ccsrc/core/algorithm/omega/omega_streamer.ccsrc/core/algorithm/omega/omega_context.hHnswContextwith OMEGA-specific per-query state such astarget_recall,training_query_id, and collected training outputs.src/core/interface/indexes/omega_index.ccsrc/db/training/omega_training_coordinator.ccomega_model/.src/db/training/omega_model_trainer.ccsrc/db/index/segment/segment.ccsrc/db/index/column/vector_column/engine_helper.hppsrc/binding/python/model/param/python_param.ccOmegaIndexParam,OmegaQueryParam, and optimize-time retraining options.python/tests/test_collection.pythirdparty/CMakeLists.txtCMakeLists.txtSequence Diagram
sequenceDiagram participant User participant Collection participant Segment participant OmegaTrainingCoordinator participant OMEGALib participant OmegaSearcher participant HNSW Note over User,Collection: Offline build / optimize User->>Collection: insert(docs) Collection->>Segment: build / persist vector index User->>Collection: optimize() Collection->>Segment: collect held-out queries and traces Segment->>OmegaTrainingCoordinator: training records + gt data OmegaTrainingCoordinator->>OMEGALib: train model OMEGALib-->>OmegaTrainingCoordinator: model + auxiliary tables OmegaTrainingCoordinator-->>Segment: persist omega_model/ Note over User,Collection: Online query User->>Collection: query(VectorQuery, OmegaQueryParam) Collection->>OmegaSearcher: search(...) OmegaSearcher->>OmegaSearcher: should_use_omega() alt model available and threshold satisfied OmegaSearcher->>HNSW: search with OMEGA hooks HNSW->>OMEGALib: update SearchContext during traversal OMEGALib-->>HNSW: stop / continue HNSW-->>OmegaSearcher: results else fallback OmegaSearcher->>HNSW: plain HNSW search HNSW-->>OmegaSearcher: results end OmegaSearcher-->>Collection: results Collection-->>User: query results