Skip to content

[Refactor] IL-generated kernels for UnmanagedMemoryBlock #585

@Nucs

Description

@Nucs

Problem

UnmanagedMemoryBlock.Casting.cs contains 2,228 lines of repetitive type-dispatch code with 144 nested switch cases (12 input types × 12 output types), each containing nearly identical for-loops:

case NPTypeCode.Boolean:
{
    var src = (bool*)source.Address;
    switch (InfoOf<TOut>.NPTypeCode)
    {
        case NPTypeCode.Int32:
            var dst = (int*)ret.Address;
            for (int i = 0; i < len; i++)
                *(dst + i) = Converts.ToInt32(*(src + i));
            break;
        // ... 11 more output types
    }
    break;
}
// ... 11 more input types (144 total combinations)

Issues

Problem Impact
Code bloat 2,228 lines for a simple operation
Maintenance burden Changes must be replicated across 144 branches
Regen dependency Uses #if _REGEN template generation
No SIMD Scalar loops where vectorization is possible
Cache pollution 144 code paths = poor instruction cache utilization

Proposed Solution

Replace with IL-generated kernels using the established ILKernelGenerator pattern:

New API (~20 lines)

public static partial class UnmanagedMemoryBlock
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static IMemoryBlock CastTo(this IMemoryBlock source, NPTypeCode to)
    {
        if (source.TypeCode == to)
            return source.Clone();
        return CastKernelGenerator.Execute(source, to);
    }
}

Kernel Generator (~300 lines)

public static class CastKernelGenerator
{
    private delegate void CastKernel(IntPtr src, IntPtr dst, int count);
    private static readonly ConcurrentDictionary<(NPTypeCode, NPTypeCode), CastKernel> _cache = new();

    public static IMemoryBlock Execute(IMemoryBlock source, NPTypeCode dstType)
    {
        var kernel = _cache.GetOrAdd(
            (source.TypeCode, dstType), 
            key => GenerateKernel(key.Item1, key.Item2));
        
        var dst = AllocateBlock(dstType, source.Count);
        kernel((IntPtr)source.Address, (IntPtr)dst.Address, source.Count);
        return dst;
    }

    private static CastKernel GenerateKernel(NPTypeCode srcType, NPTypeCode dstType)
    {
        var method = new DynamicMethod($"Cast_{srcType}_{dstType}", ...);
        var il = method.GetILGenerator();
        
        // Try SIMD for compatible types (widening, float<->double)
        if (TryEmitSimdCast(il, srcType, dstType))
            return (CastKernel)method.CreateDelegate(typeof(CastKernel));
        
        // Fallback: scalar loop with IL conversion opcodes
        EmitScalarCast(il, srcType, dstType);
        return (CastKernel)method.CreateDelegate(typeof(CastKernel));
    }
}

IL Emission (uses native conversion opcodes)

private static void EmitConversion(ILGenerator il, NPTypeCode srcType, NPTypeCode dstType)
{
    switch (dstType)
    {
        case NPTypeCode.Byte:    il.Emit(OpCodes.Conv_U1); break;
        case NPTypeCode.Int16:   il.Emit(OpCodes.Conv_I2); break;
        case NPTypeCode.Int32:   il.Emit(OpCodes.Conv_I4); break;
        case NPTypeCode.Int64:   il.Emit(OpCodes.Conv_I8); break;
        case NPTypeCode.Single:  il.Emit(OpCodes.Conv_R4); break;
        case NPTypeCode.Double:  il.Emit(OpCodes.Conv_R8); break;
        // ... etc
    }
}

Expected Outcome

Metric Before After Change
Lines of code 2,228 ~320 -86%
Type switches 144 2 -99%
For-loops in source 291 0 -100%
SIMD support None Yes New
Regen dependency Yes No Removed

SIMD Opportunities

Conversion SIMD Method
int32 → int64 Avx2.ConvertToVector256Int64(Vector128<int>)
float → double Avx.ConvertToVector256Double(Vector128<float>)
byte → int32 Avx2.ConvertToVector256Int32(Vector64<byte>)
Same-size reinterpret Buffer.MemoryCopy

Implementation Plan

  • Create ILKernelGenerator.Cast.cs with scalar conversion loop
  • Add kernel caching with (srcType, dstType) key
  • Implement SIMD paths for widening conversions
  • Implement SIMD paths for float↔double
  • Update UnmanagedMemoryBlock.CastTo to use new generator
  • Add unit tests for all 144 type pairs
  • Remove old UnmanagedMemoryBlock.Casting.cs
  • Update ArrayConvert.cs to reuse cast kernels

Complexity Assessment

Aspect Difficulty Notes
IL emission basics Easy Copy patterns from ILKernelGenerator.Binary.cs
Conversion opcodes Easy IL has native Conv_* opcodes
Decimal handling Medium Requires Convert.ToDecimal() call
SIMD widening Medium Well-documented intrinsics
Testing 144 pairs Tedious Straightforward but time-consuming

Related Files

Will be deleted:

  • src/NumSharp.Core/Backends/Unmanaged/UnmanagedMemoryBlock.Casting.cs (2,228 lines)

Will be simplified:

  • src/NumSharp.Core/Utilities/ArrayConvert.cs (can reuse cast kernels)

New file:

  • src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Cast.cs (~300 lines)

References

  • Existing pattern: ILKernelGenerator.Binary.cs, ILKernelGenerator.Unary.cs
  • Design doc: docs/examples/CastKernel_Proposal.cs
  • Parent tracking issue: docs/ISSUE_IL_MIGRATION.md

Metadata

Metadata

Assignees

Labels

NumPy 2.x ComplianceAligns behavior with NumPy 2.x (NEPs, breaking changes)coreInternal engine: Shape, Storage, TensorEngine, iteratorsperformancePerformance improvements or optimizations

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions