Skip to content

MDEV-10526: Add binary string support to bitwise operators#5190

Draft
kjarir wants to merge 12 commits into
MariaDB:mainfrom
kjarir:feature/MDEV-10526-bitwise-binary
Draft

MDEV-10526: Add binary string support to bitwise operators#5190
kjarir wants to merge 12 commits into
MariaDB:mainfrom
kjarir:feature/MDEV-10526-bitwise-binary

Conversation

@kjarir

@kjarir kjarir commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Draft for review. Implements byte-by-byte binary string mode for all scalar bitwise operators. Aggregate function
support (BIT_AND/BIT_OR/BIT_XOR) to follow.

Tested:

  • All 6 operators on VARBINARY columns
  • INET6 subnet masking (primary use case)
  • NULL handling
  • Mismatched length error
  • Integer mode backward compatibility
  • _binary introducer triggers binary mode correctly

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for binary-to-binary bitwise operations (AND, OR, XOR, NOT, and bitwise shifts) on binary strings by implementing new handler classes and adding corresponding error messages for mismatched operand sizes. The review feedback highlights several critical issues: a potential integer overflow/wrap-around in the left-shift index calculation on 32-bit systems, incorrect ASCII string conversion of numeric arguments in mixed-type bitwise operations, and potential undefined behavior when handling empty binary strings due to passing a null pointer to memset.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread sql/item_func.cc Outdated
Comment on lines +2266 to +2268
size_t src_idx= i + byte_shift;
if (src_idx < len)
{

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation size_t src_idx = i + byte_shift is susceptible to integer overflow/wrap-around on 32-bit systems if len is very large (e.g., close to size_t max). If i + byte_shift overflows, src_idx wraps around and can evaluate to a value less than len, leading to incorrect data copying from wrapped-around indices instead of zero-filling. To prevent this, perform the bounds check using subtraction (byte_shift < len - i) before computing src_idx.

        if (byte_shift < len - i)
        {
          size_t src_idx= i + byte_shift;

Comment thread sql/item_func.cc
Comment on lines +7875 to +7876
StringBuffer<128> b_buf;
String *b= item->arguments()[1]->val_str(&b_buf);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In mixed-type bitwise operations (e.g., binary_string & numeric), calling val_str on the numeric argument converts it to its decimal string representation (e.g., 255 becomes the 3-byte ASCII string "255" / 0x323535), rather than its binary representation (e.g., 0xFF). This causes mismatched length errors (e.g., VARBINARY(1) & 255 compares length 1 with length 3) or incorrect bitwise operations if the lengths happen to match. To align with standard bitwise behavior (and MySQL 8.0 compatibility), numeric arguments should be converted to a binary string representation of their integer value, padded or truncated to match the length of the other binary string operand.

Comment thread sql/item_func.cc
Comment on lines +2243 to +2245
size_t len= a->length();

if (to->realloc(len))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When the input binary string a is empty (len == 0), calling to->realloc(0) can return a null pointer or do nothing. Subsequently, passing a null pointer to memset (e.g., memset(out_ptr, 0, len)) is technically undefined behavior in C/C++, even if the length is 0. Adding an early exit for len == 0 avoids this potential undefined behavior and improves efficiency by bypassing unnecessary allocation and loop overhead.

    size_t len= a->length();
    if (len == 0)
    { 
      to->length(0);
      to->set_charset(&my_charset_bin);
      item->null_value= false;
      return to;
    }

    if (to->realloc(len))

@kjarir kjarir force-pushed the feature/MDEV-10526-bitwise-binary branch from fa5cd7c to 862a606 Compare June 6, 2026 19:44
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 8, 2026

@grooverdan grooverdan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be off the main branch as a new feature.

from @abarkov

"Would be nice to add tests that uuid, inet6, inet4, geometry are not allowed for bit operations. Later we can probably implement bit operations for uuid, inet6, inet4. But to avoid compatibility problems we need to make sure they return error now."

Comment thread sql/item_func.cc Outdated
Comment thread sql/item_func.cc

bool fix_length_and_dec(Item_handled_func *item) const override
{
item->max_length= item->arguments()[0]->max_length;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from @abarkov
"check that max_length is calculated correctly in all functions"

This will show up in mtr --cursor-protocol on test that have truncated test result output on fields.

Comment thread sql/item_func.cc Outdated
Comment thread sql/item_func.cc Outdated
kjarir added 5 commits June 11, 2026 11:05
Bitwise operators (&, |, ^, ~, <<, >>) previously cast all
arguments to BIGINT, silently truncating values wider than
64 bits. This broke operations on BINARY, VARBINARY, BLOB,
INET6, and UUID columns.

Introduces binary_mode detection in fix_length_and_dec().
When any non-literal argument has STRING_RESULT with binary
charset, operators switch to byte-by-byte processing via
a new Handler_str subclass, returning LONGBLOB of the same
length as the input.

Bare hex literals (x'FF', 0xFF) and bit literals (b'1010')
retain integer mode for backward compatibility.

Existing int/decimal handler classes for Item_func_bit_or
and Item_func_bit_and are moved from item_cmpfunc.cc to
item_func.cc for consistency.

New error codes:
  ER_INVALID_BITWISE_OPERANDS_SIZE
  ER_INVALID_BITWISE_AGGREGATE_OPERANDS_SIZE

Aggregate function support (BIT_AND/BIT_OR/BIT_XOR) to
follow in a subsequent commit.

Closes: MDEV-10526
- Change DERIVATION_IMPLICIT to DERIVATION_COERCIBLE in all
  binary handler fix_length_and_dec() methods
- Update return_type_handler() to vary by max_length using
  blob_type_handler/type_handler_varchar/type_handler_string
  pattern, matching Item_char_typecast_func_handler_fbt_to_binary
- Fix max_length calculation for two-operand operators (&, |, ^)
  to use MY_MAX of both operand lengths
- Remove redundant null_value checks in val_str() — nullptr
  check on val_str() return value is sufficient
- Add early exit for empty string (len == 0) in all handlers
- Fix potential size_t overflow in shift left bounds check:
  use (byte_shift < len - i) instead of (i + byte_shift < len)
- Remove duplicate size_t len variable in xor/and/or handlers
…ates

Item_sum_bit::reset_field() and update_field() previously assumed
integer mode, using int8store/uint8korr to read/write 8 raw bytes
to result_field. In binary mode result_field is a string-typed
field, so this corrupted temp table memory and crashed the server
(SIGSEGV in ha_maria::write_block_record) during GROUP BY queries.

Fix uses Field::store()/val_str() for binary mode, matching the
pattern used by Item_sum_min_max for string aggregates.

Add formal test file mysql-test/main/func_bitops_binary.test
covering:
  - All 6 scalar operators on VARBINARY
  - INET6_ATON subnet masking (primary use case)
  - NULL handling
  - Mismatched length errors
  - Current hex/binary literal behavior (pending mentor decision
    on x'FF' semantics)
  - INET6/UUID CAST restrictions preserved (per Alexander's request)
  - BIT_AND/BIT_OR/BIT_XOR aggregates including GROUP BY
  - Empty result set returns neutral elements (not NULL)
  - 512-byte aggregate size guard
  - Integer mode backward compatibility
  - CREATE TABLE AS SELECT type preservation
Per Alexander Barkov's request, Section 6 verified that uuid and
inet6 cast types continue to error on bitwise operations. This
extends coverage to inet4 and geometry types as well, confirming
all four types (uuid, inet6, inet4, geometry) correctly raise
ER_ILLEGAL_PARAMETER_DATA_TYPE_FOR_OPERATION (4079) and remain
out of scope for this patch.
@kjarir kjarir force-pushed the feature/MDEV-10526-bitwise-binary branch from 221664d to 6b9dd9e Compare June 11, 2026 07:24
@CLAassistant

CLAassistant commented Jun 11, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@grooverdan grooverdan changed the base branch from 10.11 to main June 11, 2026 07:28
kjarir added 7 commits June 14, 2026 19:28
Two regressions were introduced by earlier commits:

1. main.gis: BIT_AND/BIT_OR/BIT_XOR on GEOMETRY columns now
   incorrectly raised ER_INVALID_BITWISE_AGGREGATE_OPERANDS_SIZE
   (4266) instead of the correct ER_ILLEGAL_PARAMETER_DATA_TYPE_
   FOR_OPERATION (4079). GEOMETRY has STRING_RESULT with
   my_charset_bin, so it matched our binary_mode detection
   before the existing type-rejection check could run. Fixed
   by excluding MYSQL_TYPE_GEOMETRY from binary_mode detection
   in Item_sum_bit::fix_length_and_dec(), letting it fall
   through to check_type_can_return_int() as before.

2. main.func_json: '1 ^ json_objectagg(NULL, 'a')' now raised
   ER_ILLEGAL_PARAMETER_DATA_TYPE_FOR_OPERATION (4079) instead
   of returning NULL. Our earlier commit changed
   Item_func_bit_operator::check_arguments() from
   check_argument_types_can_return_int() to
   check_type_traditional_scalar(). The latter is stricter:
   Type_handler_blob_json inherits can_return_int()=true from
   its BLOB base but has type_collection()=Type_collection_json
   != type_collection_std, so is_traditional_scalar_type()
   returns false and the stricter check rejected it.

   This change was unnecessary for our binary-mode feature -
   VARBINARY/BLOB/INET6_ATON all pass can_return_int() via the
   base Type_handler default (only GEOMETRY overrides it to
   false), so binary mode detection in fix_length_and_dec()
   (STRING_RESULT + my_charset_bin) works correctly without
   the stricter check_arguments() override. Reverted
   check_arguments() to check_argument_types_can_return_int(),
   restoring pre-patch JSON behavior.

Both fixes verified against main.gis, main.func_json,
func_bitops_binary, and func_bit - all pass.
Per Daniel Black's review feedback:

- Section 11: CREATE...SELECT length-boundary tests verifying
  return_type_handler() returns the correct type at each
  threshold - type_handler_string (<=255 bytes), type_handler_
  varchar (256-65532 bytes), and blob_type_handler (>65532
  bytes via BLOB columns, since VARBINARY max is
  MAX_FIELD_VARCHARLENGTH=65532).

- Section 12: Larger literal tests using SET @var=REPEAT(...)
  with CAST(... AS BINARY(n)) to verify byte-loop and
  max_length logic scale correctly to 100+ byte operands,
  including at the 512-byte aggregate size guard boundary.

- Section 13: COERCIBILITY() validation confirming all 6
  scalar binary-mode operators return coercibility 6
  (DERIVATION_COERCIBLE), validating the earlier review
  fix from DERIVATION_IMPLICIT.

Verified passing under both normal and --cursor-protocol
execution.
Per Daniel Black's reference to commit b5fae7f,
add two standalone CREATE...SELECT LIMIT 0 tests confirming
blob_type_handler() correctly escalates the result type based
on source column size:

  MEDIUMBLOB & MEDIUMBLOB -> mediumblob
  LONGBLOB & LONGBLOB     -> longblob

This complements the existing string/varchar/blob tier tests,
fully covering the return_type_handler boundary behavior shown
in the reference commit.
…ction

Per Alexander Barkov's review, replace the dynamic_cast<Item_hex_constant*>
on real_item() approach with a check on the resolved type_handler()
of each argument:

  longstr_count: count of args whose type_handler() is
    Type_handler_longstr (true string/binary types)
  hybrid_count: count of args whose type_handler() is
    Type_handler_hex_hybrid (0x.. style numeric-string hybrids)
  binary_count: count of args with my_charset_bin collation

Binary mode now activates only when ALL operands are true
non-hybrid binary strings (longstr_count == N && binary_count == N
&& hybrid_count == 0), matching exactly what Alexander and Sergei
decided: only two true binary strings switch to string mode,
preserving backward compatibility for hex/numeric literal usage.

This is also more robust than the previous approach - it correctly
sees through wrapper expressions like COALESCE(0x30), which
previously entered binary mode incorrectly since real_item()
doesn't resolve through wrapper functions. Verified:
  HEX(COALESCE(0x30) & 1) now correctly returns 0 (integer mode)
  instead of 30 (binary mode).

Applied to all 6 scalar operators (&, |, ^, ~, <<, >>) and
Item_sum_bit (BIT_AND/BIT_OR/BIT_XOR). The explicit GEOMETRY
exclusion in Item_sum_bit is no longer needed since GEOMETRY's
type_handler() is not Type_handler_longstr.

Verified against Alexander's exact test case:
  a & 51 = 0, a & 0x33 = 0, a & x'33' = (binary),
  a & _latin1'3' = 0, a & _binary'3' = (binary), 0x33 & 0x33 = 51

All passing under func_bitops_binary, func_bit, func_json, gis,
and both normal/cursor-protocol execution.
…nd lengths

Per Alexander Barkov's review, replaced my_error() with
push_warning_printf() (WARN_LEVEL_WARN) for length mismatches
in the & | ^ binary string handlers. This matches how MariaDB
handles similar truncation/conversion issues elsewhere -
THD::raise_condition() automatically escalates WARN_LEVEL_WARN
to WARN_LEVEL_ERROR when really_abort_on_warning() is true
(i.e. strict SQL mode on INSERT/UPDATE/DELETE).

Behavior:
  SELECT with mismatched lengths (non-strict) -> NULL + warning
  4265, statement continues
  INSERT with mismatched lengths (strict mode) -> warning
  escalates to ER_INVALID_BITWISE_OPERANDS_SIZE error,
  statement aborts

Updated Section 4 to test the strict-mode INSERT error path,
and added Section 14 to test the non-strict SELECT warning
path for all three operators (&, |, ^).

Note: ER_INVALID_BITWISE_AGGREGATE_OPERANDS_SIZE (the 512-byte
aggregate size guard) is unchanged and still uses my_error() -
this is a separate hard limit, not a length-mismatch comparison,
and was not part of Alexander's review comment.
Per Daniel Black's request, implement window function support for
BIT_AND/BIT_OR/BIT_XOR in binary mode, mirroring the existing
integer-mode sliding-window technique.

Adds a dynamically-sized uint32 counter array
(m_binary_bit_counters, sized m_binary_length * 8) allocated via
thd->calloc() at setup_window_func() time. Each counter tracks
how many active rows in the current window frame have a 1 bit
at that position, enabling correct row removal as the frame
slides without re-scanning the whole window - the same technique
the existing 64-bit integer bit_counters[] array uses, extended
to arbitrary byte lengths.

add_binary_as_window()/remove_binary_as_window() increment/
decrement per-bit counters. set_bits_from_counters() in each
subclass reconstructs the result byte-by-byte:
  OR:  bit set if counter > 0
  AND: bit set if counter == num_values_added
  XOR: bit set if counter is odd

Length mismatches in window mode produce the same warning
(ER_INVALID_BITWISE_OPERANDS_SIZE) as the non-window path.

Verified with both full-partition aggregation and sliding-frame
queries - the sliding frame test confirms remove_binary_as_window
correctly evicts expired rows from the counters rather than just
accumulating monotonically.

Closes the window-function gap Daniel Black flagged as Priority 1
in his post-review feedback.
Per Daniel Black's request, implement Spider pushdown for
BIT_AND/BIT_OR/BIT_XOR, allowing each remote shard to compute
its own partial aggregate, which is then merged locally via
direct_add() rather than fetching and re-aggregating all rows
on the coordinator. This mirrors the existing pattern used by
SUM/COUNT/MIN/MAX.

Correctness rationale: BIT_AND/OR/XOR are associative and
commutative, so BIT_OP(all_rows) == BIT_OP(BIT_OP(shard1),
BIT_OP(shard2), ...) - identical to how SUM pushdown is valid.

sql/item_sum.h, sql/item_sum.cc:
  Added direct_added/direct_reseted_field/direct_sum_is_null
  state and direct_bits (integer)/direct_str_value (binary)
  storage to Item_sum_bit, following the Item_sum_sum pattern.
  Two direct_add() overloads handle both modes. add() in each
  subclass (Or/And/Xor) checks direct_added first and merges
  with the operator-specific logic (|=, &=, ^=) for both
  integer and binary mode. reset_field()/update_field() updated
  to consume direct values the same way Item_sum_sum does.
  DBUG_ASSERT(!as_window_function) guards direct_add(), since
  Spider has no window function pushdown awareness (confirmed
  via code search - zero references in storage/spider/).

storage/spider/spd_db_mysql.cc:
  SUM_BIT_FUNC added to the pushdown serialization case in
  open_item_sum_func(), removed from the skip list. Reuses the
  existing func_name_cstring() ('bit_or(', 'bit_and(', 'bit_xor(')
  serialization path shared with COUNT/SUM/MIN/MAX.

storage/spider/spd_db_conn.cc:
  SUM_BIT_FUNC case added to spider_db_fetch_for_item_sum_func().
  Binary mode (result_type()==STRING_RESULT) reads the raw row
  via append_to_str() and calls direct_add(String*, bool).
  Integer mode required a fix: row->val_int() uses atoi(), a
  32-bit signed parser with silent overflow on values beyond
  INT_MAX - found via testing that pushdown ON vs OFF returned
  different results for a BIGINT UNSIGNED column with high bits
  set. Fixed by parsing the raw row string with my_strtoll10()
  instead (MariaDB's proper 64-bit parser, already used elsewhere
  in Spider for SHOW TABLE STATUS fields), preserving the full
  unsigned 64-bit value via bit-pattern-preserving signed/unsigned
  cast.

Verified with a new test (direct_aggregate_bit.test) using a
2-shard Spider table with BIGINT UNSIGNED and VARBINARY columns:
  - Confirmed via Spider_direct_aggregate status counter that
    pushdown genuinely occurs (not silently falling back)
  - Confirmed via filtered mysql.general_log inspection that
    bit_or(/bit_and(/bit_xor( are the actual SQL sent to each
    remote shard
  - Confirmed direct_aggregate=1 and direct_aggregate=0 produce
    identical results for both binary and integer mode (proves
    correctness independent of pushdown)
  - Full spider suite (94 tests) and core regression suite
    (func_bitops_binary, func_bit, func_json, gis) pass with
    zero regressions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. GSoC

Development

Successfully merging this pull request may close these issues.

4 participants