Skip to content

MDEV-37009 bulk insert into vector index#5150

Draft
shabbann wants to merge 6 commits into
MariaDB:mainfrom
shabbann:MDEV-37009-bulk-insert-into-vector-index
Draft

MDEV-37009 bulk insert into vector index#5150
shabbann wants to merge 6 commits into
MariaDB:mainfrom
shabbann:MDEV-37009-bulk-insert-into-vector-index

Conversation

@shabbann

Copy link
Copy Markdown
Contributor

Add support for bulk insertion into MHNSW vector indexes during ALTER TABLE

@shabbann shabbann marked this pull request as draft May 31, 2026 04:09

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces bulk insert support for hierarchical navigable small world (mhnsw) indexes during table copy operations. The reviewer identified several critical issues, including compilation errors from accessing bulk_insert_active through the wrong pointer, an uninitialized member variable, and an ignored return value during bulk insert initialization. Suggestions were also made to reset the active flag on failure and correct formatting to match the repository's style.

Comment thread sql/sql_base.cc
Comment on lines +10148 to +10156
DBUG_ASSERT(s->hlindexes() == (hlindex != NULL));
if (hlindex && hlindex->in_use)
{
if (hlindex->bulk_insert_active)
return mhnsw_bulk_insert_row(this, key_info + s->keys);
else
return mhnsw_insert(this, key_info + s->keys);
}
return 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The member bulk_insert_active was added to the TABLE struct in sql/table.h, but here it is accessed via hlindex->bulk_insert_active. Since hlindex is not of type TABLE*, this will cause a compilation error. It should be accessed directly as bulk_insert_active (or this->bulk_insert_active).

Additionally, please update the indentation to 2 spaces to match the repository's implicit style guide.

  DBUG_ASSERT(s->hlindexes() == (hlindex != NULL));
  if (hlindex && hlindex->in_use)
  {
    if (bulk_insert_active)
      return mhnsw_bulk_insert_row(this, key_info + s->keys);
    else
      return mhnsw_insert(this, key_info + s->keys);
  }
  return 0;

Comment thread sql/sql_base.cc
Comment on lines +10216 to +10234
int TABLE::hlindexes_bulk_insert_begin(ha_rows rows)
{
if (hlindex && hlindex->in_use)
{
hlindex->bulk_insert_active= true;
return mhnsw_bulk_insert_begin(this, key_info + s->keys, rows);
}
return 0;
}

int TABLE::hlindexes_bulk_insert_end()
{
if (hlindex && hlindex->in_use)
{
hlindex->bulk_insert_active= false;
return mhnsw_bulk_insert_end(this, key_info + s->keys);
}
return 0;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are three issues here:

  1. bulk_insert_active is a member of TABLE, not hlindex, so accessing it via hlindex->bulk_insert_active will cause a compilation error.
  2. If mhnsw_bulk_insert_begin fails (returns a non-zero error code), bulk_insert_active should be reset to false so that subsequent inserts do not incorrectly attempt bulk insertion.
  3. The indentation should be updated to 2 spaces to match the repository's implicit style guide.
int TABLE::hlindexes_bulk_insert_begin(ha_rows rows)
{
  if (hlindex && hlindex->in_use)
  {
    bulk_insert_active= true;
    int err= mhnsw_bulk_insert_begin(this, key_info + s->keys, rows);
    if (err)
      bulk_insert_active= false;
    return err;
  }
  return 0;
}

int TABLE::hlindexes_bulk_insert_end()
{
  if (hlindex && hlindex->in_use)
  {
    bulk_insert_active= false;
    return mhnsw_bulk_insert_end(this, key_info + s->keys);
  }
  return 0;
}

Comment thread sql/sql_table.cc
Comment on lines +12666 to 12677
if (!to->s->long_unique_table)
{
to->file->ha_start_bulk_insert(from->file->stats.records,
ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT);
bulk_insert_started= 1;
to->file->ha_start_bulk_insert(from->file->stats.records,
ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT);
bulk_insert_started= 1;

if (to->s->hlindexes())
{
to->hlindexes_bulk_insert_begin(from->file->stats.records);
hlindex_bulk_started= 1;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return value of to->hlindexes_bulk_insert_begin is currently ignored. If the bulk insert initialization fails, we should handle the error, print it, and avoid setting hlindex_bulk_started to 1 so that we do not proceed with the bulk insert or attempt to end it later.

Additionally, please update the indentation to 2 spaces to match the repository's implicit style guide.

  if (!to->s->long_unique_table)
  {
    to->file->ha_start_bulk_insert(from->file->stats.records,
                                   ignore ? 0 : HA_CREATE_UNIQUE_INDEX_BY_SORT);
    bulk_insert_started= 1;

    if (to->s->hlindexes())
    {
      if ((error= to->hlindexes_bulk_insert_begin(from->file->stats.records)))
      {
        to->file->print_error(error, MYF(0));
      }
      else
      {
        hlindex_bulk_started= 1;
      }
    }
  }

Comment thread sql/sql_table.cc
Comment on lines +13009 to +13015
if (hlindex_bulk_started && to->hlindexes_bulk_insert_end() && error <= 0)
{
if (!thd->is_error())
to->file->print_error(my_errno, MYF(0));
error= 1;
}
hlindex_bulk_started=0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Please update the indentation to 2 spaces and add proper spacing around operators (e.g., hlindex_bulk_started = 0;) to match the repository's implicit style guide.

  if (hlindex_bulk_started && to->hlindexes_bulk_insert_end() && error <= 0)
  {
    if (!thd->is_error())
      to->file->print_error(my_errno, MYF(0));
    error= 1;
  }
  hlindex_bulk_started= 0;

Comment thread sql/table.h Outdated
*/
bool alias_name_used; /* true if table_name is alias */
bool get_fields_in_item_tree; /* Signal to fix_field */
bool bulk_insert_active; /* mhnsw bulk_insert_started flag */

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new member variable bulk_insert_active is not initialized. To prevent potential uninitialized memory bugs, it is highly recommended to initialize it to false directly in the struct definition using C++ member initialization.

  bool bulk_insert_active= false;    /* mhnsw bulk_insert_started flag */

@vuvova vuvova requested review from iMineLink and vuvova May 31, 2026 08:35
@vuvova vuvova added the GSoC label May 31, 2026
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 1, 2026
@shabbann shabbann force-pushed the MDEV-37009-bulk-insert-into-vector-index branch from 2c44b07 to ffaa6a1 Compare June 4, 2026 09:16

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the first draft @shabbann, I did a preliminary review and found minor things and a missing initialization of MHNSW_Share::bulk_active worth checking.
Possibly out of scope, but it may be interesting to enable falling back to normal insert if some memory budget is exceeded for the bulk insert, to avoid unexpected OOM issues.

Comment thread sql/vector_mhnsw.cc Outdated
@@ -510,7 +510,7 @@ class MHNSW_Share : public Sql_alloc
const uint M;
metric_type metric;
bool use_subdist;

bool bulk_active;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explicitely initialize this to false?

Comment thread sql/sql_table.cc Outdated

if (to->s->hlindexes())
{
to->hlindexes_bulk_insert_begin(from->file->stats.records);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think error handling is needed here.

Comment thread sql/vector_mhnsw.cc
return err;
}

if (int err= graph->file->ha_end_bulk_insert())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be placed inside a SCOPE_EXIT or similar, to avoid premature exit on error if node->save() fails in the loop above.

Comment thread sql/vector_mhnsw.cc Outdated
if (my_init_dynamic_array(PSI_INSTRUMENT_MEM, &bulk->nodes, sizeof(FVectorNode*),
rows + rows * 0.1, rows, MYF(0)))
{
delete bulk;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this delete bulk statements and the one few lines below can be removed.

Comment thread sql/vector_mhnsw.cc
@@ -1012,6 +1012,8 @@ int FVectorNode::load_from_record(TABLE *graph)
FVector *vec_ptr= FVector::align_ptr(tref() + tref_len());
memcpy(vec_ptr->data(), v->ptr(), v->length());
vec_ptr->postprocess(ctx->use_subdist, ctx->vec_len);
if (ctx->metric == COSINE)
vec_ptr->abs2= 0.5f;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you already prepared #5184 to fix this separately, good.

Comment thread sql/vector_mhnsw.cc
delete bulk;
return err;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you may add DBUG_ASSERT(!bulk->ctx->start); to document that the bulk build assumes an empty target graph

@shabbann

shabbann commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Thank you for sharing the first draft @shabbann, I did a preliminary review and found minor things and a missing initialization of MHNSW_Share::bulk_active worth checking.

@iMineLink
Sorry somehow I didn't get any notifications with the comments I just saw it now when I opened to check the PR. Thank you for the quick review. I will work on it.

Possibly out of scope, but it may be interesting to enable falling back to normal insert if some memory budget is exceeded for the bulk insert, to avoid unexpected OOM issues.

Yes good idea thank you for bringing it up. I will work on a way to implement it.

Comment thread sql/vector_mhnsw.cc
bulk->ctx= ctx;
bulk->estimated_rows= rows;
if (my_init_dynamic_array(PSI_INSTRUMENT_MEM, &bulk->nodes, sizeof(FVectorNode*),
rows + rows / 10, rows, MYF(0)))

@shabbann shabbann Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iMineLink I forgot to ask you about this when I pushed the commit last week

so here I know that rows is an estimate so it can be less than actual rows by 50 or something, and then we will have to reallocate the whole array and make its size 2*rows

So I added rows/10 to make sure this doesn't happen. Is there is a better way to do it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK like this, I couldn't find myself a better way. FYI, it's also not approximate for all engines (for InnoDB it is), check HA_STATS_RECORDS_IS_EXACT. Leaving a comment here would help future readers on the why 10% margin is added to rows.

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks for checking my previous review, I have a few more comments on what's already here in the draft. Keep up the good work!

Comment thread sql/vector_mhnsw.cc
MHNSW_Share *ctx;
int err= MHNSW_Share::acquire(&ctx, table, true);
if (err && err != HA_ERR_END_OF_FILE && err != HA_ERR_KEY_NOT_FOUND)
return err;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On error here, you need to release the context as well (check mhnsw_insert(), but here we release only on error)

Comment thread sql/vector_mhnsw.cc Outdated
}

bulk->ctx= ctx;
bulk->estimated_rows= rows;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is now write-only, it can be removed

Comment thread mysql-test/main/vector_bulk.result Outdated
Level Code Message
Note 1105 MHNSW: Bulk insert disabled because estimated memory usage (1335571) exceeds mhnsw_max_cache_size (1048576). Falling back to normal insert.
# Test search using the fallback-built index to ensure it is healthy and correct
select id, vec_distance_euclidean(v, concat(repeat(x'00', 3996), x'0000803f')) d from t1 order by d limit 3;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this result deterministic, you can set limit 1 to filter only d=0

Comment thread mysql-test/main/vector_bulk.result Outdated
# Adding index with small cache size should trigger fallback and show a warning
alter table t1 add vector index (v) m=200;
Warnings:
Note 1105 MHNSW: Bulk insert disabled because estimated memory usage (1335571) exceeds mhnsw_max_cache_size (1048576). Falling back to normal insert.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exact memory usage will depend on the used engine I guess, maybe substitute this value with a placeholder to make the test pass on all engines

Comment thread mysql-test/main/vector_bulk.result Outdated
id d
2 0
1 1
3 1.41421

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think both id=3 and id=4 have the same distance to the target, you can limit 2 to get deterministic results

Comment thread sql/vector_mhnsw.cc Outdated
"exceeds mhnsw_max_cache_size (%llu). Falling back to normal insert.",
(ulonglong)estimated_mem, (ulonglong)mhnsw_max_cache_size);
ctx->release(table);
return 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here maybe instead of using returning error 1, you can for example reset the context to null and return 0, to signal to the caller that the condition is "fallback to normal insert" rather than a "true error". This requires a bit of restructuring and is minor though (the caller should set bulk_insert_active only when the context is not-null).

Comment thread sql/vector_mhnsw.cc
bulk->ctx= ctx;
bulk->estimated_rows= rows;
if (my_init_dynamic_array(PSI_INSTRUMENT_MEM, &bulk->nodes, sizeof(FVectorNode*),
rows + rows / 10, rows, MYF(0)))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK like this, I couldn't find myself a better way. FYI, it's also not approximate for all engines (for InnoDB it is), check HA_STATS_RECORDS_IS_EXACT. Leaving a comment here would help future readers on the why 10% margin is added to rows.

shabbann added 2 commits June 11, 2026 15:56
- removed error return (1) and refactored hlindex_bulk_insert_begin to use
context instead
@shabbann shabbann force-pushed the MDEV-37009-bulk-insert-into-vector-index branch from d51f90b to fcdc5f3 Compare June 19, 2026 18:26
Comment thread sql/vector_mhnsw.cc
{
mysql_mutex_t cache_lock; // for node_cache and stats
mysql_mutex_t node_lock[8];
mysql_mutex_t node_lock[32];

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My CPU(AMD Ryzen 7 5800H) got 16 threads so currently 32 here works without any issues. I don't know how we can decide what value to choose though.

Comment thread sql/vector_mhnsw.cc
neighbors.num= 0;

while (pq.elements() && neighbors.num < max_neighbor_connections)
size_t temp_num = 0;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know in normal insert this code is redundant and could slow it a bit (seconds). Should we use if(bulk) do this or create a new function select_neighbors_bulk to keep the code clean?

Comment thread sql/vector_mhnsw.cc
SCOPE_EXIT([memroot_sv](){ root_free_to_savepoint(&memroot_sv); });
uint N= std::thread::hardware_concurrency();
uint total_nodes= bulk->nodes.elements - 1;
uint workers= std::min(N, total_nodes);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I use all available threads.

Comment thread sql/vector_mhnsw.cc
args[i].start_idx = current_start;
args[i].end_idx = current_start + count;
args[i].error= 0;
current_start += count;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this approach is that we can't guarantee that graph layers is updated correctly, and we may end up jumping directly from max_layer=L0 to L5. This could slightly degrade recall, but I can't find a solution to this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. GSoC

Development

Successfully merging this pull request may close these issues.

4 participants