feat: add diskann index#369
Conversation
… feat/diskann_index
… feat/diskann_index
| const std::vector<float> &b) const { | ||
| if (a.size() != b.size()) return false; | ||
| for (size_t i = 0; i < a.size(); ++i) | ||
| if (std::fabs(a[i] - b[i]) >= 1e-4f) return false; |
There was a problem hiding this comment.
测试定位bug的时候,放宽了要求,改回去了
| -Wl,--whole-archive | ||
| $<TARGET_FILE:core_knn_flat_static> | ||
| $<TARGET_FILE:core_knn_flat_sparse_static> | ||
| $<TARGET_FILE:core_knn_hnsw_static> |
| run: | | ||
| sudo apt-get update | ||
| sudo apt-get install -y --no-install-recommends \ | ||
| libaio-dev |
There was a problem hiding this comment.
如果用户的环境没有装libaio-dev,会发生什么?
There was a problem hiding this comment.
现在默认使用是需要安装libaio,可以通过配置的方式进行区分,千问的建议是通过linux安装包的方式安装libaio库:
Installation
zvec requires the libaio system library on linux platform.
On Ubuntu/Debian:
sudo apt-get install libaio1 libaio-dev
pip install zvecThere was a problem hiding this comment.
如果没有安装,会发生什么?这里预期的行为应该是 如果用户不安装aio,不影响除diskann的其他功能使用
| pytest \ | ||
| scikit-build-core \ | ||
| setuptools_scm | ||
| shell: bash |
There was a problem hiding this comment.
把bash加回去吧,统一一点,并且如果后续这里是多行命令,在非bash为默认shell的环境下可能会出问题
| } | ||
|
|
||
| auto &pool = ctx->expanded_nodes(); | ||
| for (uint32_t i = 0; i < pool.size(); i++) { |
There was a problem hiding this comment.
可以使用std::remove_if + erase,效率高一些
|
|
||
| virtual ~DiskAnnQueryParams() = default; | ||
|
|
||
| int list_size() const { |
| } | ||
|
|
||
| for (size_t i = 0; i < dimension; i++) { | ||
| centroid_data_ptr[i] /= entity_.doc_cnt(); |
There was a problem hiding this comment.
entity_.doc_cnt()可能为0吗?
There was a problem hiding this comment.
加了提前校验:
if (ailego_unlikely(holder->count() == 0)) {
LOG_ERROR("Holder is empty");
return IndexError_Runtime;
}
|
|
||
| (*entity_.mutable_medoid()) = medoid_id; | ||
|
|
||
| LOG_INFO("Medroid Calculation Done. ID: %zu", (size_t)medoid_id); |
|
|
||
| sector_internal_id_++; | ||
| if (sector_internal_id_ >= sector_vec_num_) { | ||
| std::vector<uint8_t> padding_(padding_size_, 0); |
There was a problem hiding this comment.
没有必要allocate一个临时的std::vector?
std::memset(data_ptr + data_size_, 0, padding_size_);
| float *centroid_data_{nullptr}; | ||
|
|
||
| diskann_id_t medoid_; | ||
| std::vector<diskann_id_t> entrypints_; |
| ## | ||
| ## Copyright (C) The Software Authors. All rights reserved. | ||
| ## | ||
| ## \file CMakeLists.txt |
There was a problem hiding this comment.
去掉吧,换成Copyright of zvec
|
|
||
| int list_size() const { | ||
| return list_size_; | ||
| } |
There was a problem hiding this comment.
需要透出的参数(query_params/index_params) 我看和其他类似产品是有区别的,这里的考量是什么?
There was a problem hiding this comment.
这里和diskann保持一致,使用list size
… feat/diskann_index
…ec into feat/diskann_dynamic_load
| ContextPointer create_context() const override; | ||
|
|
||
| //! Create a new iterator | ||
| IndexSearcher::Provider::Pointer create_provider(void) const override { |
There was a problem hiding this comment.
这里返回空指针的话,在 MixedStreamerReducer::read_vec() 里遍历数据时可能会 core dumped。当前实现应该是验证了 flat index -> diskann index 的正确性,但没实现多组小 diskann 索引到一个大的 diskann 索引的实现
There was a problem hiding this comment.
可以在
zvec/tests/db/collection_test.cc
Line 2684 in 0cb2d88
… feat/diskann_index
| "proxima.stratified.trainer.cluster_params_in_level_"; | ||
|
|
||
| static const std::string MULTI_CHUNK_CLUSTER_COUNT = | ||
| "proxima.cluster.multi_chunk_cluster.count"; |
|
|
||
| //! Init | ||
| int init() { | ||
| // file_.open(path, std::ios::in | std::ios::out); |
| typedef std::shared_ptr<MultiChunkClusterAlgorithm> Pointer; | ||
|
|
||
| //! Constructor | ||
| MultiChunkClusterAlgorithm(void) {} |
There was a problem hiding this comment.
使用default,{}会影响编译器优化,析构函数类似
| "ValueType must be arithmetic"); | ||
|
|
||
| //! Constructor | ||
| MultiChunkNumericalAlgorithm(void) {} |
|
|
||
| bool result = algorithm.cluster_once(*local_threads, &cost); | ||
| if (result != true) { | ||
| LOG_ERROR("(%u) Failed to cluster.", i + 1); |
There was a problem hiding this comment.
%u打印会有编译warning,建议%zu (size_t)xxx打印
| (*finished)++; | ||
| } | ||
|
|
||
| return; |
| int cleanup(void); | ||
|
|
||
| //! Reset Cluster | ||
| int reset(void); |
There was a problem hiding this comment.
void可以去掉,c++里面reset()即可
| } | ||
| } | ||
|
|
||
| (*out)[id * chunk_count_ + chunk] = static_cast<uint32_t>(sel_index); |
There was a problem hiding this comment.
不用cast,sel_index本身是uint32的
| entity_.set_neighbors(id, pruned_list); | ||
| lock_pool_[lock_idx].unlock(); | ||
|
|
||
| ret = inter_insert(id, pruned_list, ctx); |
There was a problem hiding this comment.
return inter_insert...否则会丢失错误
| return core::IndexError_Unsupported; | ||
| } | ||
|
|
||
| param_ = dynamic_cast<const DiskAnnIndexParam &>(param); |
| return pq_chunk_num_; | ||
| } | ||
|
|
||
| void pq_chunk_num(int pq_chunk_num) { |
| 20 | ||
| )pbdoc"); | ||
| diskann_params | ||
| .def(py::init<int>(), py::arg("list_size") = 10, R"pbdoc( |
There was a problem hiding this comment.
c++默认为300,注释也写默认300,这里是10
| Default is ``MetricType.IP`` (inner product). | ||
| max_degree (int):. | ||
| list_size (int): . | ||
| pq_chunk_num (bool): . |
There was a problem hiding this comment.
pq_chunk_num (bool): —— 类型是 int,不是 bool。
示例里 >>> print(params.n_list) / >>> print(params.nprobe) 引用的属性 DiskAnn 类上根本不存在
Add diskann index into Zvec to lower memory usage in vector search as per the description: #325