Conversation
3c02fb5 to
3381152
Compare
3381152 to
0e5727a
Compare
|
Depends on VectorDB-NTU/RaBitQ-Library#36 |
|
@greptile |
Greptile SummaryThis PR adds comprehensive HNSW-RaBitQ support to the zvec project, implementing a vector search index that combines Hierarchical Navigable Small World (HNSW) graph structure with RaBitQ quantization for memory-efficient approximate nearest neighbor search. Major changes:
Issues found:
Confidence Score: 3/5
Important Files Changed
Class Diagram%%{init: {'theme': 'neutral'}}%%
classDiagram
class HnswRabitqIndex {
+build()
+search()
+stream()
}
class HnswRabitqBuilder {
+init()
+train()
+build()
-rabitq_converter_
}
class HnswRabitqSearcher {
+init()
+search()
-entity_
-reformer_
}
class HnswRabitqStreamer {
+init()
+add()
+search()
-entity_
-reformer_
}
class HnswRabitqAlgorithm {
+add_node()
+search()
-entity_
}
class HnswRabitqEntity {
+get_neighbors()
+update_neighbors()
-graph_structure_
}
class RabitqConverter {
+train()
+transform()
+to_reformer()
-rotator_
-centroids_
}
class RabitqReformer {
+reform()
+get_quantized()
-rotator_
-centroids_
}
class HnswRabitqContext {
+dist_calculator()
+visit_filter()
}
HnswRabitqIndex --> HnswRabitqBuilder
HnswRabitqIndex --> HnswRabitqSearcher
HnswRabitqIndex --> HnswRabitqStreamer
HnswRabitqBuilder --> RabitqConverter
HnswRabitqBuilder --> HnswRabitqAlgorithm
HnswRabitqBuilder --> HnswRabitqEntity
HnswRabitqSearcher --> HnswRabitqAlgorithm
HnswRabitqSearcher --> HnswRabitqEntity
HnswRabitqSearcher --> RabitqReformer
HnswRabitqStreamer --> HnswRabitqAlgorithm
HnswRabitqStreamer --> HnswRabitqEntity
HnswRabitqStreamer --> RabitqReformer
HnswRabitqAlgorithm --> HnswRabitqEntity
HnswRabitqAlgorithm --> HnswRabitqContext
RabitqConverter --> RabitqReformer
Last reviewed commit: 42a30d6 |
| path = thirdparty/magic_enum/magic_enum-0.9.7 | ||
| url = https://github.com/Neargye/magic_enum.git | ||
| ignore = all | ||
| [submodule "thirdparty/thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"] |
There was a problem hiding this comment.
Submodule name has duplicate thirdparty/ prefix
| [submodule "thirdparty/thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"] | |
| [submodule "thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"] |
There was a problem hiding this comment.
please check and handle this.
There was a problem hiding this comment.
You're right, the submodule name has a duplicate thirdparty/ prefix. The path is correct but the submodule name itself is wrong. Here's the fix:
| [submodule "thirdparty/thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"] | |
| [submodule "thirdparty/RaBitQ-Library/RaBitQ-Library-0.1"] |
This will match the pattern used by all other submodules in the file and correctly remove the duplicate prefix.
| float dists[size]; | ||
| const void *neighbor_vecs[size]; |
There was a problem hiding this comment.
Variable-length arrays (VLAs) are not standard C++ and won't compile with MSVC. Use std::vector instead
| float dists[size]; | |
| const void *neighbor_vecs[size]; | |
| std::vector<float> dists(size); | |
| std::vector<const void*> neighbor_vecs(size); |
There was a problem hiding this comment.
please check and handle this.
There was a problem hiding this comment.
Based on the provided context, the issue is clear. Lines 139-140 use variable-length arrays (VLAs) which aren't standard C++ and won't compile on MSVC. Here's the fix:
| float dists[size]; | |
| const void *neighbor_vecs[size]; | |
| std::vector<float> dists(size); | |
| std::vector<const void*> neighbor_vecs(size); |
This replaces the VLAs with std::vector, making the code portable across all C++ compilers including MSVC.
| (*ctx->mutable_stats_get_neighbors())++; | ||
| } | ||
|
|
||
| node_id_t neighbor_ids[neighbors.size()]; |
There was a problem hiding this comment.
VLA not standard C++, won't compile with MSVC
| node_id_t neighbor_ids[neighbors.size()]; | |
| std::vector<node_id_t> neighbor_ids(neighbors.size()); |
There was a problem hiding this comment.
please check and handle this.
There was a problem hiding this comment.
You're right, this is a VLA which isn't standard C++. The fix is to use std::vector:
| node_id_t neighbor_ids[neighbors.size()]; | |
| std::vector<node_id_t> neighbor_ids(neighbors.size()); |
This will compile on MSVC and all standard C++ compilers while maintaining the same functionality.
| float dists[size]; | ||
| const void *neighbor_vecs[size]; |
There was a problem hiding this comment.
VLAs not standard C++, won't compile with MSVC
| float dists[size]; | |
| const void *neighbor_vecs[size]; | |
| std::vector<float> dists(size); | |
| std::vector<const void*> neighbor_vecs(size); |
There was a problem hiding this comment.
please check and handle this.
There was a problem hiding this comment.
You're right, those are VLAs which won't compile on MSVC. Replace them with std::vector:
| float dists[size]; | |
| const void *neighbor_vecs[size]; | |
| std::vector<float> dists(size); | |
| std::vector<const void*> neighbor_vecs(size); |
This is standard C++ and will work across all compilers. You'll also need to update the references to use .data() when passing to batch_dist:
dc.batch_dist(neighbor_vecs.data(), size, dists.data());| (*ctx->mutable_stats_get_neighbors())++; | ||
| } | ||
|
|
||
| node_id_t neighbor_ids[neighbors.size()]; |
There was a problem hiding this comment.
VLA not standard C++, won't compile with MSVC
| node_id_t neighbor_ids[neighbors.size()]; | |
| std::vector<node_id_t> neighbor_ids(neighbors.size()); |
There was a problem hiding this comment.
need to pay attention to these VLA issues.
| @@ -0,0 +1,4 @@ | |||
| add_library(rabitqlib INTERFACE) | |||
| target_include_directories( | |||
| sparsehash INTERFACE "${CMAKE_CURRENT_SOURCE_DIR}/RaBitQ-Library-0.1/include" | |||
There was a problem hiding this comment.
sparsehash -> rabitqlib?
There was a problem hiding this comment.
this class can be moved to quantizer folder with rabitq_reformer
|
|
||
| void HnswRabitqQueryAlgorithm::expand_neighbors_by_group( | ||
| TopkHeap &topk, HnswRabitqContext *ctx) const { | ||
| // if (!ctx->group_by().is_valid()) { |
There was a problem hiding this comment.
to return with some message when group's setup?
| continue; | ||
| } | ||
| } else { | ||
| // Candidate cand{ResultRecord(candest.est_dist, candest.low_dist), |
There was a problem hiding this comment.
no need to check ex bits here? can it be checked earlier?
|
|
||
| // TODO: check loop type | ||
|
|
||
| // if ((!topk.full()) || cur_dist < topk[0].second) { |
There was a problem hiding this comment.
lagacy codes here and elsewhere can be cleaned up.
| query_wrapper.set_g_add(q_to_centroids[cluster_id]); | ||
| float est_dist = rabitqlib::split_distance_boosting( | ||
| ex_data, ip_func_, query_wrapper, padded_dim_, ex_bits_, res.ip_x0_qr); | ||
| float low_dist = est_dist - (res.est_dist - res.low_dist) / (1 << ex_bits_); |
There was a problem hiding this comment.
can predefine param to convert to multiply manipulation:
double inv_divisor = 1.0 / (1 << ex_bits_);
auto result = (res.est_dist - res.low_dist) * inv_divisor;
resolve #42