Skip to content

Add invert_permute operator with SYCL implementation for XPU#71

Open
aagalleg wants to merge 4 commits into
intel:mainfrom
aagalleg:feat/invert_permute
Open

Add invert_permute operator with SYCL implementation for XPU#71
aagalleg wants to merge 4 commits into
intel:mainfrom
aagalleg:feat/invert_permute

Conversation

@aagalleg

@aagalleg aagalleg commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

This PR introduces the invert_permute operator to fbgemm-xpu, enabling inverse permutations on Intel XPU devices.

Changes

Core Implementation

  • SYCL Kernel (invert_permute_kernel.cpp/h): High-performance SYCL implementation that computes inverse permutations on XPU devices
  • Operator Registration (ops_registry.cpp): Registers the operator with PyTorch's dispatch system using conditional schema registration to avoid conflicts
  • Python API (ops.py): Clean Python wrapper function with type hints for easy integration

Infrastructure

  • Build System (CMakeLists.txt): Added SYCL kernel to build configuration
  • Torch Library Utilities (torch_library.h): Helper utilities for schema existence checking and operator registration
  • Cleanup: Removed .gitkeep files from populated directories

Testing

  • Comprehensive Test Suite (test_invert_permute.py):
    • Correctness validation for int32/int64 data types
    • Edge cases (empty, single element, identity, reverse permutations)
    • Input validation and error handling
    • CPU-XPU consistency verification

cc: @manuelhsantana, @flezaalv

aagalleg added 4 commits June 18, 2026 22:37
- Add invert_permute kernel to CMake build
- Implement invert_permute Python wrapper in ops.py
- Register invert_permute operator with schema existence check
- Add torch_library.h utility for schema validation
Add SYCL/XPU kernel implementation for invert_permute operation.
Add complete test coverage for invert_permute operator on XPU
devices, covering correctness, validation, parity, and performance.

Test coverage includes:
- Correctness tests for int32/int64 with edge cases (empty, single
  element, identity, reverse, random permutations)
- Input validation tests for invalid dimensions and dtypes
- Meta function tests for torch.compile compatibility
- PyTorch opcheck validation for operator conventions
- Parametric tests with varying sizes (1 to 1M elements)
- CPU-XPU parity tests to ensure consistent results
- Performance benchmarks measuring execution time and bandwidth

@dvrogozh dvrogozh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aagalleg, I missed the beginning of the PRs chain and left all the comments in #73. Comments are pretty generic, can you please, refer to them?

For the convenience, please, mark all follow up PRs as drafts - in this way we will clearly know what should be reviewed first.

@dvrogozh dvrogozh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to inline comments:

  1. Repository has lint CI - please, address identified issues.
  2. What we implement and merge must be tested under CI. And IMHO doing that by engineers who implemented the code is the best variant rather than offloading that to dedicated CI engineers. Thus, please, modify .github/workflows/ci.yml and add a test section for FBGEMM following the pattern outlined for torchcodec (see link below). The test job for FBGEMM will be much simpler. Target only bmg for now is fine. Note that fbgemm wheel is already getting build by CI.

test-torchcodec:

.has_value();
}

} // namespace fbgemm_xpu::utils::torch No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add empty line in the end of file.

// SYCL Kernel Functor Implementations
// ============================================================================

void InvertPermuteKernelInt32::operator()(const sycl::nd_item<1>& item) const {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems templetable kernel over the operand type. Why don't we do that especially considering that the reference CUDA kernel is doing exactly that [1]?

[1] https://github.com/pytorch/FBGEMM/blob/8744183e4bd7390a674a0d8736bf4d7ef2d0516e/fbgemm_gpu/src/sparse_ops/sparse_invert_permute.cu#L15


__all__ = [
"dense_embedding_codegen_lookup_function",
"invert_permute",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I am confused. Why do we expose these XPU specific APIs from fbgemm xpu? who is supposed to use them? My understanding was that we are just register low level entrypoints for XPU key so that standard fbgemm APIs start to work on XPU device. Shouldn't that be that user calls fbgemm::invert_permute for example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants