mivertowski
diff --git a/‎docs/api/backends/metal.md‎
Lines changed: 509 additions & 0 deletions b/‎docs/api/backends/metal.md‎
Lines changed: 509 additions & 0 deletions
diff --git a/‎docs/api/core/accelerator.md‎
Lines changed: 34 additions & 2 deletions b/‎docs/api/core/accelerator.md‎
Lines changed: 34 additions & 2 deletions
diff --git a/‎docs/api/core/unified-buffer.md‎
Lines changed: 54 additions & 5 deletions b/‎docs/api/core/unified-buffer.md‎
Lines changed: 54 additions & 5 deletions
diff --git a/‎docs/api/exceptions.md‎
Lines changed: 38 additions & 1 deletion b/‎docs/api/exceptions.md‎
Lines changed: 38 additions & 1 deletion
diff --git a/‎docs/api/index.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/api/index.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/articles/background/gpu-computing.md‎
Lines changed: 42 additions & 12 deletions b/‎docs/articles/background/gpu-computing.md‎
Lines changed: 42 additions & 12 deletions
@@ -27,6 +27,7 @@ class DeviceType(Enum):
     """Type of compute device."""
     CPU = "cpu"
     CUDA = "cuda"
+    METAL = "metal"  # macOS/Apple Silicon
 ```
 
 ### DeviceInfo
@@ -65,6 +66,13 @@ def cuda_available() -> bool:
     """Check if CUDA is available."""
 ```
 
+### metal_available
+
+```python
+def metal_available() -> bool:
+    """Check if Metal is available (macOS only)."""
+```
+
 ## Properties
 
 ### devices
@@ -91,6 +99,14 @@ def cuda_available(self) -> bool:
     """Whether CUDA devices are available."""
 ```
 
+### metal_available
+
+```python
+@property
+def metal_available(self) -> bool:
+    """Whether Metal devices are available (macOS only)."""
+```
+
 ### current_device
 
 ```python
@@ -151,6 +167,20 @@ else:
     print("Running in CPU mode")
 ```
 
+### Check Metal Availability (macOS)
+
+```python
+from pydotcompute.core.accelerator import metal_available, get_accelerator
+
+if metal_available():
+    print("Metal is available on Apple Silicon!")
+    acc = get_accelerator()
+    for device in acc.devices:
+        if device.device_type.name == "METAL":
+            print(f"  GPU: {device.name}")
+            print(f"  GPU Cores: {device.multiprocessor_count}")
+```
+
 ### Memory Monitoring
 
 ```python
@@ -169,6 +199,8 @@ if acc.cuda_available:
 ## Notes
 
 - The `Accelerator` is a singleton - use `get_accelerator()` to access it
-- CPU is always available as device 0
-- CUDA devices are numbered starting from 1 when available
+- CPU is always available as a fallback device
+- CUDA devices are available on systems with NVIDIA GPUs
+- Metal devices are available on macOS with Apple Silicon (M1/M2/M3/M4)
 - Memory info for CPU returns system RAM information
+- Metal memory info includes MLX cache and peak memory usage
@@ -4,7 +4,7 @@ Host-device memory abstraction with lazy synchronization.
 
 ## Overview
 
-`UnifiedBuffer` provides a unified view of memory that can exist on both host (CPU) and device (GPU). It tracks which copy is current and automatically synchronizes when needed.
+`UnifiedBuffer` provides a unified view of memory that can exist on host (CPU), CUDA device, and Metal device (macOS). It tracks which copy is current and automatically synchronizes when needed.
 
 ```python
 from pydotcompute import UnifiedBuffer
@@ -79,13 +79,26 @@ def host(self) -> np.ndarray:
 @property
 def device(self) -> Any:
     """
-    Get device (GPU) view of data.
+    Get device (CUDA GPU) view of data.
 
     Automatically syncs from host if host is dirty.
     Returns CuPy array if CUDA available, else NumPy array.
     """
 ```
 
+### metal
+
+```python
+@property
+def metal(self) -> Any:
+    """
+    Get Metal (Apple GPU) view of data.
+
+    Automatically syncs from host if host is dirty.
+    Returns MLX array if Metal available (macOS only).
+    """
+```
+
 ### shape
 
 ```python
@@ -145,7 +158,14 @@ def mark_host_dirty(self) -> None:
 
 ```python
 def mark_device_dirty(self) -> None:
-    """Mark device data as modified (host copy is stale)."""
+    """Mark CUDA device data as modified (host copy is stale)."""
+```
+
+### mark_metal_dirty
+
+```python
+def mark_metal_dirty(self) -> None:
+    """Mark Metal device data as modified (host copy is stale)."""
 ```
 
 ### fill
@@ -230,20 +250,49 @@ buf.sync_to_device()
 print(buf.state)  # SYNCHRONIZED
 ```
 
+### Metal Example (macOS)
+
+```python
+from pydotcompute import UnifiedBuffer
+import numpy as np
+
+buf = UnifiedBuffer((1000,), dtype=np.float32)
+
+# Initialize on host
+buf.host[:] = np.random.randn(1000).astype(np.float32)
+buf.mark_host_dirty()
+
+# Access on Metal (automatically syncs from host)
+metal_array = buf.metal  # Returns MLX array
+
+# Compute on Metal
+import mlx.core as mx
+result = mx.sum(metal_array)
+mx.eval(result)
+print(f"Sum: {float(result)}")
+
+# After Metal modifies data
+buf.mark_metal_dirty()
+```
+
 ## Performance Tips
 
-1. **Minimize Transfers**: Access `host` or `device` properties sparingly - each access may trigger a sync
+1. **Minimize Transfers**: Access `host`, `device`, or `metal` properties sparingly - each access may trigger a sync
 
 2. **Batch Operations**: Do all host work, then sync once, then do all device work
 
-3. **Use Pinned Memory**: For buffers that transfer frequently, use `pinned=True`
+3. **Use Pinned Memory**: For CUDA buffers that transfer frequently, use `pinned=True`
 
 4. **Explicit Sync**: For performance-critical code, use explicit `sync_to_device()` / `sync_to_host()` instead of relying on lazy sync
 
 5. **Check State**: Use `state` property to understand when syncs will occur
 
+6. **Unified Memory on Metal**: Apple Silicon's unified memory architecture makes Metal transfers virtually free
+
 ## Notes
 
 - On systems without CUDA, `device` returns the same NumPy array as `host`
+- On systems without Metal (non-macOS), `metal` raises an error
 - Pinned memory requires CUDA and may be limited by system resources
 - Large buffers should use explicit sync to avoid unexpected latency
+- Metal uses unified memory on Apple Silicon, eliminating physical data transfers
@@ -37,7 +37,9 @@ PyDotComputeError
 │   └── MessageValidationError
 ├── BackendError
 │   ├── BackendNotAvailableError
-│   └── BackendExecutionError
+│   ├── BackendExecutionError
+│   └── MetalError
+│       └── MSLCompilationError
 └── MemoryError
     ├── AllocationError
     └── OutOfMemoryError
@@ -240,6 +242,41 @@ class BackendExecutionError(BackendError):
         super().__init__(backend, f"Execution failed: {reason}")
 ```
 
+### MetalError
+
+```python
+class MetalError(BackendError):
+    """Raised for Metal-specific errors on macOS."""
+
+    def __init__(self, message: str) -> None:
+        super().__init__("Metal", message)
+```
+
+**When raised:**
+
+- Metal/MLX not available on the system
+- Metal memory allocation fails
+- Metal operations fail
+
+### MSLCompilationError
+
+```python
+class MSLCompilationError(MetalError):
+    """Raised when Metal Shading Language compilation fails."""
+
+    def __init__(self, shader_name: str, error_message: str) -> None:
+        self.shader_name = shader_name
+        self.error_message = error_message
+        super().__init__(
+            f"Failed to compile MSL shader '{shader_name}': {error_message}"
+        )
+```
+
+**When raised:**
+
+- Custom MSL shader compilation fails
+- Invalid Metal shader syntax
+
 ## Memory Exceptions
 
 ### MemoryError
 
@@ -22,7 +22,8 @@ pydotcompute/
 ├── backends/            # Compute backends
 │   ├── base.py          # Backend interface
 │   ├── cpu.py           # CPU simulation
-│   └── cuda.py          # CUDA backend
+│   ├── cuda.py          # CUDA backend
+│   └── metal.py         # Metal backend (macOS)
 └── decorators/          # API decorators
     ├── kernel.py        # @kernel decorator
     ├── ring_kernel.py   # @ring_kernel decorator
@@ -51,6 +52,7 @@ pydotcompute/
 - **[Backend Interface](backends/base.md)**: Abstract backend API
 - **[CPUBackend](backends/cpu.md)**: CPU simulation backend
 - **[CUDABackend](backends/cuda.md)**: NVIDIA GPU backend
+- **[MetalBackend](backends/metal.md)**: Apple Silicon GPU backend (macOS)
 
 ### Decorators
 
 
@@ -4,6 +4,12 @@ Understanding GPU architecture and why persistent kernels matter.
 
 ## GPU Architecture
 
+PyDotCompute supports multiple GPU backends:
+
+- **CUDA**: NVIDIA GPUs (Windows, Linux)
+- **Metal**: Apple Silicon GPUs via MLX (macOS)
+- **CPU**: Fallback simulation for development/testing
+
 ### CPU vs GPU
 
 ```
@@ -173,23 +179,46 @@ Efficiency: 10/12 = 83%
 
 ### Unified Memory
 
-PyDotCompute's `UnifiedBuffer` uses CUDA Unified Memory:
+PyDotCompute's `UnifiedBuffer` abstracts memory across backends:
 
-```python
-from pydotcompute import UnifiedBuffer
+=== "CUDA"
 
-# Single buffer, accessible from both host and device
-buf = UnifiedBuffer((1000,), dtype=np.float32)
+    ```python
+    from pydotcompute import UnifiedBuffer
 
-# Host access
-buf.host[:] = data      # Automatic page migration
+    # Single buffer, accessible from both host and device
+    buf = UnifiedBuffer((1000,), dtype=np.float32)
 
-# Device access
-result = kernel(buf.device)  # Data migrates to GPU
+    # Host access
+    buf.host[:] = data      # Automatic page migration
 
-# Host access again
-output = buf.host[:]    # Data migrates back
-```
+    # Device access
+    result = kernel(buf.device)  # Data migrates to GPU
+
+    # Host access again
+    output = buf.host[:]    # Data migrates back
+    ```
+
+=== "Metal (macOS)"
+
+    ```python
+    from pydotcompute import UnifiedBuffer
+
+    # On Apple Silicon, memory is truly unified
+    buf = UnifiedBuffer((1000,), dtype=np.float32)
+
+    # Host access
+    buf.host[:] = data
+
+    # Metal access (no physical transfer needed!)
+    metal_array = buf.metal  # Returns MLX array
+
+    # CPU and GPU share the same physical memory
+    output = buf.host[:]    # Virtually free
+    ```
+
+!!! tip "Apple Silicon Advantage"
+    Apple Silicon's unified memory architecture means CPU and GPU share the same physical memory. This eliminates the traditional host-device transfer bottleneck, making Metal particularly efficient for streaming workloads.
 
 ## Kernel Launch Overhead
 
@@ -276,6 +305,7 @@ PyDotCompute addresses these GPU challenges:
 | State management | Manual | Automatic |
 | Synchronization | Explicit | Message-based |
 | Memory tracking | Manual | UnifiedBuffer |
+| Backend portability | Vendor-specific | Multi-backend (CUDA, Metal, CPU) |
 
 ## Next Steps