Skip to content

SdcaLogisticRegression test fails on macOS ARM64 Release — LogLoss exceeds 0.5 threshold #7600

@rokonec

Description

@rokonec

SdcaLogisticRegression test fails on macOS ARM64 Release — LogLoss exceeds 0.5 threshold

System Information:

  • OS: macOS 15 (Sequoia) ARM64
  • Configuration: Release
  • Helix Queue: osx.15.arm64.open
  • .NET: net8.0

Describe the bug

The SdcaLogisticRegression test fails on macOS ARM64 Release builds with:

Assert.InRange() Failure: Value not in range
Range:  (0 - 0.5)
Actual: 0.50113968629658268

The test at test/Microsoft.ML.Tests/TrainerEstimators/SdcaTests.cs:86 asserts metrics.LogLoss is in range (0, 0.5). On macOS ARM64 Release, the LogLoss is 0.5011 — exceeding the upper bound.

Key observations:

  • Passes on Windows x64 (Debug and Release), Linux x64, and macOS ARM64 Debug
  • Fails only on macOS ARM64 Release
  • The difference (0.0011 above threshold) represents a 0.22% deviation, but LogLoss > 0.5 is semantically meaningful — it indicates the model's predicted probabilities are worse than a naive baseline
  • This was previously hidden because the macOS Helix queues (OSX.13.Arm64.Open) were decommissioned, so macOS CI wasn't running at all. Updating to osx.15.arm64.open (PR Update macOS Helix queues from decommissioned OSX.13 to osx.15 #7599) exposed this

Root cause analysis:

The SDCA solver is iterative and uses floating-point arithmetic that is sensitive to:

  • ARM64 Release JIT optimizations (FMA instruction fusion, instruction reordering)
  • Low l2Regularization: 0.001f making the optimizer more sensitive to numerical drift
  • Small dataset (100 samples) amplifying per-iteration rounding differences

The test uses MLContext(seed: 1) for determinism, but JIT optimization differences between Debug/Release on ARM64 cause the optimizer to converge to a slightly different (worse) solution.

Possible fixes:

  1. Investigate ARM64 numerical stability in the SDCA implementation — determine if FMA or other optimizations cause meaningful quality degradation
  2. Relax the test bound from 0.5 to e.g. 0.55 — acknowledging cross-platform variance for this tiny dataset while still validating the model trains to a reasonable state
  3. Increase training data — 100 samples is very small and amplifies numerical sensitivity
  4. Add [Trait] to skip on ARM64 Release if this is considered acceptable platform variance

To Reproduce:

Run on macOS ARM64 in Release configuration:

dotnet test test/Microsoft.ML.Tests/Microsoft.ML.Tests.csproj --filter "FullyQualifiedName~SdcaLogisticRegression" -c Release

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinguntriagedNew issue has not been triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions