MDEV-37009 bulk insert into vector index by shabbann · Pull Request #5150 · MariaDB/server

shabbann · 2026-05-31T04:08:20Z

Add support for bulk insertion into MHNSW vector indexes during ALTER TABLE

gemini-code-assist

Code Review

This pull request introduces bulk insert support for hierarchical navigable small world (mhnsw) indexes during table copy operations. The reviewer identified several critical issues, including compilation errors from accessing bulk_insert_active through the wrong pointer, an uninitialized member variable, and an ignored return value during bulk insert initialization. Suggestions were also made to reset the active flag on failure and correct formatting to match the repository's style.

gemini-code-assist · 2026-05-31T04:10:12Z

+    DBUG_ASSERT(s->hlindexes() == (hlindex != NULL));
+    if (hlindex && hlindex->in_use)
+    {
+        if (hlindex->bulk_insert_active)
+            return mhnsw_bulk_insert_row(this, key_info + s->keys);
+        else
+            return mhnsw_insert(this, key_info + s->keys);
+    }
+    return 0;


The member bulk_insert_active was added to the TABLE struct in sql/table.h, but here it is accessed via hlindex->bulk_insert_active. Since hlindex is not of type TABLE*, this will cause a compilation error. It should be accessed directly as bulk_insert_active (or this->bulk_insert_active).

Additionally, please update the indentation to 2 spaces to match the repository's implicit style guide.

DBUG_ASSERT(s->hlindexes() == (hlindex != NULL)); if (hlindex && hlindex->in_use) { if (bulk_insert_active) return mhnsw_bulk_insert_row(this, key_info + s->keys); else return mhnsw_insert(this, key_info + s->keys); } return 0;

gemini-code-assist · 2026-05-31T04:10:12Z

+int TABLE::hlindexes_bulk_insert_begin(ha_rows rows)
+{
+    if (hlindex && hlindex->in_use)
+    {
+        hlindex->bulk_insert_active= true;
+        return mhnsw_bulk_insert_begin(this, key_info + s->keys, rows);
+    }
+    return 0;
+}
+
+int TABLE::hlindexes_bulk_insert_end()
+{
+    if (hlindex && hlindex->in_use)
+    {
+        hlindex->bulk_insert_active= false;
+        return mhnsw_bulk_insert_end(this, key_info + s->keys);
+    }
+    return 0;
+}


There are three issues here:

bulk_insert_active is a member of TABLE, not hlindex, so accessing it via hlindex->bulk_insert_active will cause a compilation error.

If mhnsw_bulk_insert_begin fails (returns a non-zero error code), bulk_insert_active should be reset to false so that subsequent inserts do not incorrectly attempt bulk insertion.

The indentation should be updated to 2 spaces to match the repository's implicit style guide.

int TABLE::hlindexes_bulk_insert_begin(ha_rows rows) { if (hlindex && hlindex->in_use) { bulk_insert_active= true; int err= mhnsw_bulk_insert_begin(this, key_info + s->keys, rows); if (err) bulk_insert_active= false; return err; } return 0; } int TABLE::hlindexes_bulk_insert_end() { if (hlindex && hlindex->in_use) { bulk_insert_active= false; return mhnsw_bulk_insert_end(this, key_info + s->keys); } return 0; }

gemini-code-assist · 2026-05-31T04:10:13Z

+  if (!to->s->long_unique_table)
  {
-    to->file->ha_start_bulk_insert(from->file->stats.records,
-                                   ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT);
-    bulk_insert_started= 1;
+      to->file->ha_start_bulk_insert(from->file->stats.records,
+              ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT);
+      bulk_insert_started= 1;
+
+      if (to->s->hlindexes())
+      {
+          to->hlindexes_bulk_insert_begin(from->file->stats.records);
+          hlindex_bulk_started= 1;
+      }
  }


The return value of to->hlindexes_bulk_insert_begin is currently ignored. If the bulk insert initialization fails, we should handle the error, print it, and avoid setting hlindex_bulk_started to 1 so that we do not proceed with the bulk insert or attempt to end it later.

Additionally, please update the indentation to 2 spaces to match the repository's implicit style guide.

if (!to->s->long_unique_table) { to->file->ha_start_bulk_insert(from->file->stats.records, ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT); bulk_insert_started= 1; if (to->s->hlindexes()) { if ((error= to->hlindexes_bulk_insert_begin(from->file->stats.records))) { to->file->print_error(error, MYF(0)); } else { hlindex_bulk_started= 1; } } }

gemini-code-assist · 2026-05-31T04:10:13Z

+  if (hlindex_bulk_started && to->hlindexes_bulk_insert_end() && error <= 0)
+  {
+      if (!thd->is_error())
+          to->file->print_error(my_errno, MYF(0));
+      error= 1;
+  }
+  hlindex_bulk_started=0;


Please update the indentation to 2 spaces and add proper spacing around operators (e.g., hlindex_bulk_started = 0;) to match the repository's implicit style guide.

if (hlindex_bulk_started && to->hlindexes_bulk_insert_end() && error <= 0) { if (!thd->is_error()) to->file->print_error(my_errno, MYF(0)); error= 1; } hlindex_bulk_started= 0;

gemini-code-assist · 2026-05-31T04:10:13Z

  */
  bool alias_name_used;              /* true if table_name is alias */
  bool get_fields_in_item_tree;      /* Signal to fix_field */
+  bool bulk_insert_active;           /* mhnsw bulk_insert_started flag */


The new member variable bulk_insert_active is not initialized. To prevent potential uninitialized memory bugs, it is highly recommended to initialize it to false directly in the struct definition using C++ member initialization.

bool bulk_insert_active= false; /* mhnsw bulk_insert_started flag */

…orrectly set the grefs already faster by ~15%

iMineLink

Thank you for sharing the first draft @shabbann, I did a preliminary review and found minor things and a missing initialization of MHNSW_Share::bulk_active worth checking.
Possibly out of scope, but it may be interesting to enable falling back to normal insert if some memory budget is exceeded for the bulk insert, to avoid unexpected OOM issues.

iMineLink · 2026-06-05T10:16:03Z

@@ -510,7 +510,7 @@ class MHNSW_Share : public Sql_alloc
  const uint M;
  metric_type metric;
  bool use_subdist;
-
+  bool bulk_active;


Can you explicitely initialize this to false?

iMineLink · 2026-06-05T10:16:38Z

+
+      if (to->s->hlindexes())
+      {
+          to->hlindexes_bulk_insert_begin(from->file->stats.records);


I think error handling is needed here.

iMineLink · 2026-06-05T10:18:31Z

+      return err;
+  }
+
+  if (int err= graph->file->ha_end_bulk_insert())


I think this needs to be placed inside a SCOPE_EXIT or similar, to avoid premature exit on error if node->save() fails in the loop above.

iMineLink · 2026-06-05T10:21:34Z

+  if (my_init_dynamic_array(PSI_INSTRUMENT_MEM, &bulk->nodes, sizeof(FVectorNode*),
+                            rows + rows * 0.1, rows, MYF(0)))
+  {
+    delete bulk;


I think this delete bulk statements and the one few lines below can be removed.

iMineLink · 2026-06-05T10:24:20Z

@@ -1012,6 +1012,8 @@ int FVectorNode::load_from_record(TABLE *graph)
  FVector *vec_ptr= FVector::align_ptr(tref() + tref_len());
  memcpy(vec_ptr->data(), v->ptr(), v->length());
  vec_ptr->postprocess(ctx->use_subdist, ctx->vec_len);
+  if (ctx->metric == COSINE)
+    vec_ptr->abs2= 0.5f;


I see you already prepared #5184 to fix this separately, good.

iMineLink · 2026-06-05T11:08:53Z

+    delete bulk;
+    return err;
+  }
+


Here you may add DBUG_ASSERT(!bulk->ctx->start); to document that the bulk build assumes an empty target graph

shabbann · 2026-06-08T12:04:16Z

Thank you for sharing the first draft @shabbann, I did a preliminary review and found minor things and a missing initialization of MHNSW_Share::bulk_active worth checking.

@iMineLink
Sorry somehow I didn't get any notifications with the comments I just saw it now when I opened to check the PR. Thank you for the quick review. I will work on it.

Possibly out of scope, but it may be interesting to enable falling back to normal insert if some memory budget is exceeded for the bulk insert, to avoid unexpected OOM issues.

Yes good idea thank you for bringing it up. I will work on a way to implement it.

shabbann · 2026-06-10T19:58:39Z

+  bulk->ctx= ctx;
+  bulk->estimated_rows= rows;
+  if (my_init_dynamic_array(PSI_INSTRUMENT_MEM, &bulk->nodes, sizeof(FVectorNode*),
+                            rows + rows / 10, rows, MYF(0)))


@iMineLink I forgot to ask you about this when I pushed the commit last week

so here I know that rows is an estimate so it can be less than actual rows by 50 or something, and then we will have to reallocate the whole array and make its size 2*rows

So I added rows/10 to make sure this doesn't happen. Is there is a better way to do it?

It's OK like this, I couldn't find myself a better way. FYI, it's also not approximate for all engines (for InnoDB it is), check HA_STATS_RECORDS_IS_EXACT. Leaving a comment here would help future readers on the why 10% margin is added to rows.

iMineLink

Hi! Thanks for checking my previous review, I have a few more comments on what's already here in the draft. Keep up the good work!

iMineLink · 2026-06-11T08:14:21Z

+  MHNSW_Share *ctx;
+  int err= MHNSW_Share::acquire(&ctx, table, true);
+  if (err && err != HA_ERR_END_OF_FILE && err != HA_ERR_KEY_NOT_FOUND)
+    return err;


On error here, you need to release the context as well (check mhnsw_insert(), but here we release only on error)

iMineLink · 2026-06-11T08:20:58Z

+  }
+
+  bulk->ctx= ctx;
+  bulk->estimated_rows= rows;


If this is now write-only, it can be removed

iMineLink · 2026-06-11T08:25:22Z

+Level	Code	Message
+Note	1105	MHNSW: Bulk insert disabled because estimated memory usage (1335571) exceeds mhnsw_max_cache_size (1048576). Falling back to normal insert.
+# Test search using the fallback-built index to ensure it is healthy and correct
+select id, vec_distance_euclidean(v, concat(repeat(x'00', 3996), x'0000803f')) d from t1 order by d limit 3;


To make this result deterministic, you can set limit 1 to filter only d=0

iMineLink · 2026-06-11T08:26:00Z

+# Adding index with small cache size should trigger fallback and show a warning
+alter table t1 add vector index (v) m=200;
+Warnings:
+Note	1105	MHNSW: Bulk insert disabled because estimated memory usage (1335571) exceeds mhnsw_max_cache_size (1048576). Falling back to normal insert.


Exact memory usage will depend on the used engine I guess, maybe substitute this value with a placeholder to make the test pass on all engines

iMineLink · 2026-06-11T08:27:55Z

+id	d
+2	0
+1	1
+3	1.41421


Here I think both id=3 and id=4 have the same distance to the target, you can limit 2 to get deterministic results

iMineLink · 2026-06-11T08:38:59Z

+                        "exceeds mhnsw_max_cache_size (%llu). Falling back to normal insert.",
+                        (ulonglong)estimated_mem, (ulonglong)mhnsw_max_cache_size);
+    ctx->release(table);
+    return 1;


Here maybe instead of using returning error 1, you can for example reset the context to null and return 0, to signal to the caller that the condition is "fallback to normal insert" rather than a "true error". This requires a bit of restructuring and is minor though (the caller should set bulk_insert_active only when the context is not-null).

iMineLink · 2026-06-11T08:44:43Z

+  bulk->ctx= ctx;
+  bulk->estimated_rows= rows;
+  if (my_init_dynamic_array(PSI_INSTRUMENT_MEM, &bulk->nodes, sizeof(FVectorNode*),
+                            rows + rows / 10, rows, MYF(0)))


It's OK like this, I couldn't find myself a better way. FYI, it's also not approximate for all engines (for InnoDB it is), check HA_STATS_RECORDS_IS_EXACT. Leaving a comment here would help future readers on the why 10% margin is added to rows.

- removed error return (1) and refactored hlindex_bulk_insert_begin to use context instead

shabbann · 2026-06-19T18:33:45Z

 {
  mysql_mutex_t cache_lock;     // for node_cache and stats
-  mysql_mutex_t node_lock[8];
+  mysql_mutex_t node_lock[32];


My CPU(AMD Ryzen 7 5800H) got 16 threads so currently 32 here works without any issues. I don't know how we can decide what value to choose though.

shabbann · 2026-06-19T18:39:12Z

-  neighbors.num= 0;

-  while (pq.elements() && neighbors.num < max_neighbor_connections)
+  size_t temp_num = 0;


I know in normal insert this code is redundant and could slow it a bit (seconds). Should we use if(bulk) do this or create a new function select_neighbors_bulk to keep the code clean?

shabbann · 2026-06-19T18:44:05Z

-    SCOPE_EXIT([memroot_sv](){ root_free_to_savepoint(&memroot_sv); });
+  uint N= std::thread::hardware_concurrency();
+  uint total_nodes= bulk->nodes.elements - 1;
+  uint workers= std::min(N, total_nodes);


Currently, I use all available threads.

shabbann · 2026-06-19T19:09:06Z

+    args[i].start_idx = current_start;
+    args[i].end_idx = current_start + count;
+    args[i].error= 0;
+    current_start += count;


The problem with this approach is that we can't guarantee that graph layers is updated correctly, and we may end up jumping directly from max_layer=L0 to L5. This could slightly degrade recall, but I can't find a solution to this

shabbann marked this pull request as draft May 31, 2026 04:09

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

vuvova requested review from iMineLink and vuvova May 31, 2026 08:35

vuvova added the GSoC label May 31, 2026

gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 1, 2026

shabbann added 2 commits June 4, 2026 09:38

init

a0e7cf3

with one thread we fully build the graph in memory and then save to c…

ffaa6a1

…orrectly set the grefs already faster by ~15%

shabbann force-pushed the MDEV-37009-bulk-insert-into-vector-index branch from 2c44b07 to ffaa6a1 Compare June 4, 2026 09:16

iMineLink reviewed Jun 5, 2026

View reviewed changes

shabbann added 2 commits June 10, 2026 22:43

Fix index corruption and empty index bulk initialization

79be8c4

Fall back to normal insert if bulk insert exceeds memory budget

d1e45cc

shabbann commented Jun 10, 2026

View reviewed changes

iMineLink requested changes Jun 11, 2026

View reviewed changes

shabbann added 2 commits June 11, 2026 15:56

- fixed bulk test issues

db869c7

- removed error return (1) and refactored hlindex_bulk_insert_begin to use context instead

implemented parallel build

fcdc5f3

shabbann force-pushed the MDEV-37009-bulk-insert-into-vector-index branch from d51f90b to fcdc5f3 Compare June 19, 2026 18:26

shabbann commented Jun 19, 2026

View reviewed changes

+              id	d
+	0
+	1
+	1.41421

Uh oh!

Conversation

shabbann commented May 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

iMineLink left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shabbann commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shabbann Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iMineLink left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

shabbann commented Jun 8, 2026 •

edited

Loading

shabbann Jun 10, 2026 •

edited

Loading