Skip to content

Commit 178ad06

Browse files
mivertowskiclaude
andcommitted
docs: Expand documentation for Metal backend support
- Add complete Metal backend API documentation (docs/api/backends/metal.md) - Update exception docs with MetalError and MSLCompilationError - Update accelerator docs with METAL DeviceType and metal_available - Update unified-buffer docs with .metal property and mark_metal_dirty() - Update API index with MetalBackend in package structure - Update home page with Metal features and installation tabs - Update installation guide with Metal requirements and options - Update memory-management article with Metal unified memory notes - Update gpu-computing article with multi-backend examples - Update gpu-optimization guide with Metal-specific tips - Update mkdocs.yml navigation to include Metal backend page - Fix version mismatch: update __version__ to 0.2.0 in __init__.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent e4c324e commit 178ad06

12 files changed

Lines changed: 774 additions & 55 deletions

File tree

docs/api/backends/metal.md

Lines changed: 509 additions & 0 deletions
Large diffs are not rendered by default.

docs/api/core/accelerator.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ class DeviceType(Enum):
2727
"""Type of compute device."""
2828
CPU = "cpu"
2929
CUDA = "cuda"
30+
METAL = "metal" # macOS/Apple Silicon
3031
```
3132

3233
### DeviceInfo
@@ -65,6 +66,13 @@ def cuda_available() -> bool:
6566
"""Check if CUDA is available."""
6667
```
6768

69+
### metal_available
70+
71+
```python
72+
def metal_available() -> bool:
73+
"""Check if Metal is available (macOS only)."""
74+
```
75+
6876
## Properties
6977

7078
### devices
@@ -91,6 +99,14 @@ def cuda_available(self) -> bool:
9199
"""Whether CUDA devices are available."""
92100
```
93101

102+
### metal_available
103+
104+
```python
105+
@property
106+
def metal_available(self) -> bool:
107+
"""Whether Metal devices are available (macOS only)."""
108+
```
109+
94110
### current_device
95111

96112
```python
@@ -151,6 +167,20 @@ else:
151167
print("Running in CPU mode")
152168
```
153169

170+
### Check Metal Availability (macOS)
171+
172+
```python
173+
from pydotcompute.core.accelerator import metal_available, get_accelerator
174+
175+
if metal_available():
176+
print("Metal is available on Apple Silicon!")
177+
acc = get_accelerator()
178+
for device in acc.devices:
179+
if device.device_type.name == "METAL":
180+
print(f" GPU: {device.name}")
181+
print(f" GPU Cores: {device.multiprocessor_count}")
182+
```
183+
154184
### Memory Monitoring
155185

156186
```python
@@ -169,6 +199,8 @@ if acc.cuda_available:
169199
## Notes
170200

171201
- The `Accelerator` is a singleton - use `get_accelerator()` to access it
172-
- CPU is always available as device 0
173-
- CUDA devices are numbered starting from 1 when available
202+
- CPU is always available as a fallback device
203+
- CUDA devices are available on systems with NVIDIA GPUs
204+
- Metal devices are available on macOS with Apple Silicon (M1/M2/M3/M4)
174205
- Memory info for CPU returns system RAM information
206+
- Metal memory info includes MLX cache and peak memory usage

docs/api/core/unified-buffer.md

Lines changed: 54 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Host-device memory abstraction with lazy synchronization.
44

55
## Overview
66

7-
`UnifiedBuffer` provides a unified view of memory that can exist on both host (CPU) and device (GPU). It tracks which copy is current and automatically synchronizes when needed.
7+
`UnifiedBuffer` provides a unified view of memory that can exist on host (CPU), CUDA device, and Metal device (macOS). It tracks which copy is current and automatically synchronizes when needed.
88

99
```python
1010
from pydotcompute import UnifiedBuffer
@@ -79,13 +79,26 @@ def host(self) -> np.ndarray:
7979
@property
8080
def device(self) -> Any:
8181
"""
82-
Get device (GPU) view of data.
82+
Get device (CUDA GPU) view of data.
8383
8484
Automatically syncs from host if host is dirty.
8585
Returns CuPy array if CUDA available, else NumPy array.
8686
"""
8787
```
8888

89+
### metal
90+
91+
```python
92+
@property
93+
def metal(self) -> Any:
94+
"""
95+
Get Metal (Apple GPU) view of data.
96+
97+
Automatically syncs from host if host is dirty.
98+
Returns MLX array if Metal available (macOS only).
99+
"""
100+
```
101+
89102
### shape
90103

91104
```python
@@ -145,7 +158,14 @@ def mark_host_dirty(self) -> None:
145158

146159
```python
147160
def mark_device_dirty(self) -> None:
148-
"""Mark device data as modified (host copy is stale)."""
161+
"""Mark CUDA device data as modified (host copy is stale)."""
162+
```
163+
164+
### mark_metal_dirty
165+
166+
```python
167+
def mark_metal_dirty(self) -> None:
168+
"""Mark Metal device data as modified (host copy is stale)."""
149169
```
150170

151171
### fill
@@ -230,20 +250,49 @@ buf.sync_to_device()
230250
print(buf.state) # SYNCHRONIZED
231251
```
232252

253+
### Metal Example (macOS)
254+
255+
```python
256+
from pydotcompute import UnifiedBuffer
257+
import numpy as np
258+
259+
buf = UnifiedBuffer((1000,), dtype=np.float32)
260+
261+
# Initialize on host
262+
buf.host[:] = np.random.randn(1000).astype(np.float32)
263+
buf.mark_host_dirty()
264+
265+
# Access on Metal (automatically syncs from host)
266+
metal_array = buf.metal # Returns MLX array
267+
268+
# Compute on Metal
269+
import mlx.core as mx
270+
result = mx.sum(metal_array)
271+
mx.eval(result)
272+
print(f"Sum: {float(result)}")
273+
274+
# After Metal modifies data
275+
buf.mark_metal_dirty()
276+
```
277+
233278
## Performance Tips
234279

235-
1. **Minimize Transfers**: Access `host` or `device` properties sparingly - each access may trigger a sync
280+
1. **Minimize Transfers**: Access `host`, `device`, or `metal` properties sparingly - each access may trigger a sync
236281

237282
2. **Batch Operations**: Do all host work, then sync once, then do all device work
238283

239-
3. **Use Pinned Memory**: For buffers that transfer frequently, use `pinned=True`
284+
3. **Use Pinned Memory**: For CUDA buffers that transfer frequently, use `pinned=True`
240285

241286
4. **Explicit Sync**: For performance-critical code, use explicit `sync_to_device()` / `sync_to_host()` instead of relying on lazy sync
242287

243288
5. **Check State**: Use `state` property to understand when syncs will occur
244289

290+
6. **Unified Memory on Metal**: Apple Silicon's unified memory architecture makes Metal transfers virtually free
291+
245292
## Notes
246293

247294
- On systems without CUDA, `device` returns the same NumPy array as `host`
295+
- On systems without Metal (non-macOS), `metal` raises an error
248296
- Pinned memory requires CUDA and may be limited by system resources
249297
- Large buffers should use explicit sync to avoid unexpected latency
298+
- Metal uses unified memory on Apple Silicon, eliminating physical data transfers

docs/api/exceptions.md

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,9 @@ PyDotComputeError
3737
│ └── MessageValidationError
3838
├── BackendError
3939
│ ├── BackendNotAvailableError
40-
│ └── BackendExecutionError
40+
│ ├── BackendExecutionError
41+
│ └── MetalError
42+
│ └── MSLCompilationError
4143
└── MemoryError
4244
├── AllocationError
4345
└── OutOfMemoryError
@@ -240,6 +242,41 @@ class BackendExecutionError(BackendError):
240242
super().__init__(backend, f"Execution failed: {reason}")
241243
```
242244

245+
### MetalError
246+
247+
```python
248+
class MetalError(BackendError):
249+
"""Raised for Metal-specific errors on macOS."""
250+
251+
def __init__(self, message: str) -> None:
252+
super().__init__("Metal", message)
253+
```
254+
255+
**When raised:**
256+
257+
- Metal/MLX not available on the system
258+
- Metal memory allocation fails
259+
- Metal operations fail
260+
261+
### MSLCompilationError
262+
263+
```python
264+
class MSLCompilationError(MetalError):
265+
"""Raised when Metal Shading Language compilation fails."""
266+
267+
def __init__(self, shader_name: str, error_message: str) -> None:
268+
self.shader_name = shader_name
269+
self.error_message = error_message
270+
super().__init__(
271+
f"Failed to compile MSL shader '{shader_name}': {error_message}"
272+
)
273+
```
274+
275+
**When raised:**
276+
277+
- Custom MSL shader compilation fails
278+
- Invalid Metal shader syntax
279+
243280
## Memory Exceptions
244281

245282
### MemoryError

docs/api/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ pydotcompute/
2222
├── backends/ # Compute backends
2323
│ ├── base.py # Backend interface
2424
│ ├── cpu.py # CPU simulation
25-
│ └── cuda.py # CUDA backend
25+
│ ├── cuda.py # CUDA backend
26+
│ └── metal.py # Metal backend (macOS)
2627
└── decorators/ # API decorators
2728
├── kernel.py # @kernel decorator
2829
├── ring_kernel.py # @ring_kernel decorator
@@ -51,6 +52,7 @@ pydotcompute/
5152
- **[Backend Interface](backends/base.md)**: Abstract backend API
5253
- **[CPUBackend](backends/cpu.md)**: CPU simulation backend
5354
- **[CUDABackend](backends/cuda.md)**: NVIDIA GPU backend
55+
- **[MetalBackend](backends/metal.md)**: Apple Silicon GPU backend (macOS)
5456

5557
### Decorators
5658

docs/articles/background/gpu-computing.md

Lines changed: 42 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@ Understanding GPU architecture and why persistent kernels matter.
44

55
## GPU Architecture
66

7+
PyDotCompute supports multiple GPU backends:
8+
9+
- **CUDA**: NVIDIA GPUs (Windows, Linux)
10+
- **Metal**: Apple Silicon GPUs via MLX (macOS)
11+
- **CPU**: Fallback simulation for development/testing
12+
713
### CPU vs GPU
814

915
```
@@ -173,23 +179,46 @@ Efficiency: 10/12 = 83%
173179

174180
### Unified Memory
175181

176-
PyDotCompute's `UnifiedBuffer` uses CUDA Unified Memory:
182+
PyDotCompute's `UnifiedBuffer` abstracts memory across backends:
177183

178-
```python
179-
from pydotcompute import UnifiedBuffer
184+
=== "CUDA"
180185

181-
# Single buffer, accessible from both host and device
182-
buf = UnifiedBuffer((1000,), dtype=np.float32)
186+
```python
187+
from pydotcompute import UnifiedBuffer
183188

184-
# Host access
185-
buf.host[:] = data # Automatic page migration
189+
# Single buffer, accessible from both host and device
190+
buf = UnifiedBuffer((1000,), dtype=np.float32)
186191

187-
# Device access
188-
result = kernel(buf.device) # Data migrates to GPU
192+
# Host access
193+
buf.host[:] = data # Automatic page migration
189194

190-
# Host access again
191-
output = buf.host[:] # Data migrates back
192-
```
195+
# Device access
196+
result = kernel(buf.device) # Data migrates to GPU
197+
198+
# Host access again
199+
output = buf.host[:] # Data migrates back
200+
```
201+
202+
=== "Metal (macOS)"
203+
204+
```python
205+
from pydotcompute import UnifiedBuffer
206+
207+
# On Apple Silicon, memory is truly unified
208+
buf = UnifiedBuffer((1000,), dtype=np.float32)
209+
210+
# Host access
211+
buf.host[:] = data
212+
213+
# Metal access (no physical transfer needed!)
214+
metal_array = buf.metal # Returns MLX array
215+
216+
# CPU and GPU share the same physical memory
217+
output = buf.host[:] # Virtually free
218+
```
219+
220+
!!! tip "Apple Silicon Advantage"
221+
Apple Silicon's unified memory architecture means CPU and GPU share the same physical memory. This eliminates the traditional host-device transfer bottleneck, making Metal particularly efficient for streaming workloads.
193222

194223
## Kernel Launch Overhead
195224

@@ -276,6 +305,7 @@ PyDotCompute addresses these GPU challenges:
276305
| State management | Manual | Automatic |
277306
| Synchronization | Explicit | Message-based |
278307
| Memory tracking | Manual | UnifiedBuffer |
308+
| Backend portability | Vendor-specific | Multi-backend (CUDA, Metal, CPU) |
279309

280310
## Next Steps
281311

0 commit comments

Comments
 (0)