-
Notifications
You must be signed in to change notification settings - Fork 1
feat(cuda): GPU batch inverse #658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ColoCarletti
wants to merge
80
commits into
main
Choose a base branch
from
feat/cuda-pr5-batch-invert
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
80 commits
Select commit
Hold shift + click to select a range
d1a0abf
add first cuda files
ColoCarletti 79634ff
fmt
ColoCarletti ac6fbb5
fix clippy
ColoCarletti 2ceb3b0
gpu 2nd part
ColoCarletti affceb1
feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles
ColoCarletti 01172f2
merge main
ColoCarletti c4627e1
Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits
ColoCarletti 01aa5e4
comments fix
ColoCarletti cfc5c19
Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits
MauroToscano ea5696f
Update crypto/stark/src/gpu_lde.rs
ColoCarletti a8cf265
Update crypto/stark/src/gpu_lde.rs
ColoCarletti fb8d31f
Update crypto/stark/src/gpu_lde.rs
ColoCarletti a79f2b5
Update crypto/stark/src/gpu_lde.rs
ColoCarletti 761a2c0
Update crypto/stark/src/gpu_lde.rs
ColoCarletti e066e9d
address reviews
ColoCarletti 7d3d0f0
fix review comments
ColoCarletti cf80771
Merge remote-tracking branch 'origin/main' into feat/cuda-pr2-r1-gpu-…
ColoCarletti 71aba0d
address doc comment suggestions
ColoCarletti 83d91b8
Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits
ColoCarletti 34cae4b
fix
ColoCarletti f076bf4
Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits
gabrielbosio a2cde0f
Pass replay transcript to bus-balance call in verify_vm_minimal
gabrielbosio 46c305b
Update crypto/math-cuda/src/device.rs
ColoCarletti aca3dca
Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits
ColoCarletti 63d7c00
Update crypto/math-cuda/src/device.rs
ColoCarletti eb16c02
Update crypto/math-cuda/src/device.rs
ColoCarletti 66925b1
Update crypto/math-cuda/src/device.rs
ColoCarletti 4e6daf3
Update crypto/math-cuda/src/lde.rs
ColoCarletti 4cd27d9
Update crypto/math-cuda/src/lde.rs
ColoCarletti 5fe390f
Update crypto/math-cuda/src/lde.rs
ColoCarletti 5819930
Update crypto/math-cuda/src/lde.rs
ColoCarletti 33f7c36
Update crypto/math-cuda/src/lde.rs
ColoCarletti 49d3607
Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits
ColoCarletti 99cd59c
add pr3 code
ColoCarletti c52521e
Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits
ColoCarletti 828ee16
fix comments
ColoCarletti 19a36a0
Merge remote-tracking branch 'origin/feat/cuda-pr2-r1-gpu-commits' in…
ColoCarletti 80e1ecb
fix sync stream after D2H in merke.rs
ColoCarletti 3ead022
Merge branch 'main' into feat/cuda-pr3
ColoCarletti 04dd872
fix comments
ColoCarletti 8a67e33
address review feedback
ColoCarletti 1f9394d
Update crypto/math-cuda/src/barycentric.rs
ColoCarletti b07999c
Update crypto/math-cuda/src/barycentric.rs
ColoCarletti c575017
fix imports
ColoCarletti 0ffc661
Merge branch 'feat/cuda-pr3' of github.com:yetanotherco/lambda_vm int…
ColoCarletti 0777f1e
Merge branch 'main' into feat/cuda-pr3
ColoCarletti 2c7b0de
cuda integration tests
ColoCarletti 2f1fe2d
address review feedback
ColoCarletti f254eae
batch invert kernels and parity test
ColoCarletti 84cc04b
DEEP composition kernel
ColoCarletti 0ba7745
fri
ColoCarletti 7046a40
gpu lde
ColoCarletti 065c8f9
gpu_lde
ColoCarletti 7d2810f
fri
ColoCarletti cc840cd
add tests
ColoCarletti fac3974
fix
ColoCarletti bc61a00
Merge branch 'main' into feat/cuda-pr4
ColoCarletti 3c52fdf
fix comments
ColoCarletti 59437f3
add integration tests
ColoCarletti c499ee0
fix comments
ColoCarletti 025813a
refactor test
ColoCarletti f41bb7b
rm dead code, refactor
ColoCarletti 6399cf2
fix
ColoCarletti b8d97d5
rm doc
ColoCarletti 6f3262d
gpu batch inverse
ColoCarletti b422d71
fix
ColoCarletti 95b8025
Merge branch 'feat/cuda-pr4' of github.com:yetanotherco/lambda_vm int…
ColoCarletti b706e48
fallback test
ColoCarletti 5eae98a
Merge branch 'main' into feat/cuda-pr4
ColoCarletti 578cb29
fix_comments
ColoCarletti 50d2541
Merge remote-tracking branch 'origin/feat/cuda-pr4' into feat/cuda-pr…
ColoCarletti 84ae125
cleanup
ColoCarletti ca4efc7
Merge remote-tracking branch 'origin/main' into feat/cuda-pr5-batch-i…
ColoCarletti adbcfe2
fmt
ColoCarletti 7386d0a
Merge branch 'main' into feat/cuda-pr5-batch-invert
MauroToscano a29b013
Merge branch 'main' into feat/cuda-pr5-batch-invert
ColoCarletti 1a1de35
address comments
ColoCarletti d73219e
harden inv_denoms guard, fix scan kernel race
ColoCarletti b0c60e1
fix debug assert
ColoCarletti 577c6b2
Merge branch 'main' into feat/cuda-pr5-batch-invert
diegokingston File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,313 @@ | ||
| // Parallel Montgomery batch inverse over ext3 elements. | ||
| // | ||
| // Algorithm: given a[0..N-1] all non-zero, compute a^{-1}[0..N-1] using | ||
| // prefix[i] = a[0] * a[1] * ... * a[i] (inclusive forward scan) | ||
| // suffix[i] = a[i] * a[i+1] * ... * a[N-1] (inclusive backward scan) | ||
| // total = prefix[N-1] = suffix[0] | ||
| // inv_total = 1 / total (one Fermat inversion on host) | ||
| // a^{-1}[i] = prefix[i-1] * inv_total * suffix[i+1] (boundaries use identity) | ||
| // | ||
| // Each scan is a multi-block 3-phase Hillis-Steele scan in shared memory: | ||
| // Phase 1: each block does an inclusive scan over its 256 elements and | ||
| // writes its block sum to a per-block totals array. | ||
| // Phase 2: recursively scan the block totals (host re-launches this same | ||
| // kernel set; recursion depth = ceil(log_256(N))). | ||
| // Phase 3: each block reads its offset (the inclusive prefix of all | ||
| // preceding block sums) and multiplies it into every element. | ||
| // | ||
| // Forward and backward kernels are mirrors of each other. | ||
| // | ||
| // Buffer layouts: all ext3 buffers are interleaved [a0,b0,c0, a1,b1,c1, ...] | ||
| // with one u64 per coordinate. `BLOCK_SIZE = 256` ext3 elements per block | ||
| // uses 6 KB of shared memory, well under the per-SM limit on Ada/Blackwell. | ||
|
|
||
| #include "goldilocks.cuh" | ||
| #include "ext3.cuh" | ||
|
|
||
| #define BLOCK_SIZE 256 | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // 1. compute_denoms_ext3 | ||
| // | ||
| // If `subtract_x = 0` (R3 OOD convention): denoms[k * n + i] = z[k] - x[i]. | ||
| // Matches CPU `barycentric_inv_denoms(z, points)` = 1/(z - points[i]). | ||
| // If `subtract_x = 1` (R4 DEEP convention): denoms[k * n + i] = x[i] - z[k]. | ||
| // Matches CPU R4 `denoms.push(x_i - z_k)` convention. | ||
| // | ||
| // Output is ext3-interleaved of length 3 * k_scalars * n. | ||
| // | ||
| // Launched as grid = ceil(total / BLOCK_SIZE), where total = k_scalars * n. | ||
| // Each thread builds one denom. | ||
| // --------------------------------------------------------------------------- | ||
| extern "C" __global__ void compute_denoms_ext3( | ||
| const uint64_t *x_base, // n u64 | ||
| const uint64_t *z_scalars, // 3 * k_scalars u64 | ||
| uint64_t n, | ||
| uint64_t k_scalars, | ||
| uint64_t subtract_x, // 0: z - x; 1: x - z | ||
| uint64_t *denoms_out // 3 * k_scalars * n u64 | ||
| ) { | ||
| uint64_t flat = (uint64_t)blockIdx.x * BLOCK_SIZE + threadIdx.x; | ||
| uint64_t total = k_scalars * n; | ||
| if (flat >= total) return; | ||
|
|
||
| uint64_t k = flat / n; | ||
| uint64_t i = flat - k * n; | ||
|
|
||
| uint64_t x_i = x_base[i]; | ||
| ext3::Fe3 z = { | ||
| z_scalars[k * 3 + 0], | ||
| z_scalars[k * 3 + 1], | ||
| z_scalars[k * 3 + 2], | ||
| }; | ||
| ext3::Fe3 d; | ||
| if (subtract_x == 0) { | ||
| // z - x: lift x to (x, 0, 0), subtract from z. | ||
| d.a = goldilocks::sub(z.a, x_i); | ||
| d.b = z.b; | ||
| d.c = z.c; | ||
| } else { | ||
| // x - z: lift x to (x, 0, 0), subtract z. | ||
| d.a = goldilocks::sub(x_i, z.a); | ||
| d.b = goldilocks::neg(z.b); | ||
| d.c = goldilocks::neg(z.c); | ||
| } | ||
|
|
||
| denoms_out[flat * 3 + 0] = d.a; | ||
| denoms_out[flat * 3 + 1] = d.b; | ||
| denoms_out[flat * 3 + 2] = d.c; | ||
| } | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // 2. block_inclusive_scan_fwd_ext3 | ||
| // | ||
| // Per-block forward Hillis-Steele inclusive scan with multiplication. Writes | ||
| // scan_out[gid] = product of input[block_start..=gid] and block_totals[bid] = | ||
| // the product over the entire block. | ||
| // | ||
| // Threads handle out-of-range positions by loading the identity element (1), | ||
| // so a partial last block still produces a correct scan. | ||
| // --------------------------------------------------------------------------- | ||
| extern "C" __global__ void block_inclusive_scan_fwd_ext3( | ||
| const uint64_t *input, // 3 * n u64 | ||
| uint64_t n, | ||
| uint64_t *scan_out, // 3 * n u64 | ||
| uint64_t *block_totals // 3 * K u64, K = ceil(n / BLOCK_SIZE) | ||
| ) { | ||
| __shared__ ext3::Fe3 shmem[BLOCK_SIZE]; | ||
| uint64_t tid = threadIdx.x; | ||
| uint64_t gid = (uint64_t)blockIdx.x * BLOCK_SIZE + tid; | ||
|
|
||
| // Load input or identity. | ||
| if (gid < n) { | ||
| shmem[tid].a = input[gid * 3 + 0]; | ||
| shmem[tid].b = input[gid * 3 + 1]; | ||
| shmem[tid].c = input[gid * 3 + 2]; | ||
| } else { | ||
| shmem[tid] = ext3::one(); | ||
| } | ||
| __syncthreads(); | ||
|
|
||
| // Hillis-Steele inclusive scan: 8 doubling levels for BLOCK_SIZE = 256. | ||
| for (uint32_t offset = 1; offset < BLOCK_SIZE; offset <<= 1) { | ||
| ext3::Fe3 prev = (tid >= offset) ? shmem[tid - offset] : ext3::one(); | ||
| __syncthreads(); | ||
| if (tid >= offset) { | ||
| shmem[tid] = ext3::mul(prev, shmem[tid]); | ||
| } | ||
| __syncthreads(); | ||
| } | ||
|
|
||
| // Write per-element scan result. | ||
| if (gid < n) { | ||
| scan_out[gid * 3 + 0] = shmem[tid].a; | ||
| scan_out[gid * 3 + 1] = shmem[tid].b; | ||
| scan_out[gid * 3 + 2] = shmem[tid].c; | ||
| } | ||
|
|
||
| // Block total = scan value at the last VALID thread of this block. | ||
| // The last valid gid in this block is min(block_end - 1, n - 1). | ||
| // Computing it explicitly (instead of `tid == 255 || gid == n - 1`) | ||
| // ensures EXACTLY ONE thread writes per block — in a partial last | ||
| // block the two conditions would otherwise both fire and race. | ||
| uint64_t block_end = ((uint64_t)blockIdx.x + 1) * BLOCK_SIZE; | ||
| uint64_t last_valid_gid = (block_end - 1 < n - 1) ? (block_end - 1) : (n - 1); | ||
| if (gid == last_valid_gid) { | ||
| block_totals[(uint64_t)blockIdx.x * 3 + 0] = shmem[tid].a; | ||
| block_totals[(uint64_t)blockIdx.x * 3 + 1] = shmem[tid].b; | ||
| block_totals[(uint64_t)blockIdx.x * 3 + 2] = shmem[tid].c; | ||
| } | ||
| } | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // 3. apply_block_offsets_fwd_ext3 | ||
| // | ||
| // Phase 3 of the forward scan: each block b > 0 multiplies its per-block | ||
| // scan by `block_totals_scanned[b-1]` (the inclusive prefix of preceding | ||
| // block totals). Block 0 has no offset, so it returns early. | ||
| // --------------------------------------------------------------------------- | ||
| extern "C" __global__ void apply_block_offsets_fwd_ext3( | ||
| uint64_t *scan_inout, // 3 * n u64 (modified in place) | ||
| uint64_t n, | ||
| const uint64_t *block_totals_scanned // 3 * K u64, inclusive prefix of phase-1 totals | ||
| ) { | ||
| if (blockIdx.x == 0) return; | ||
| uint64_t tid = threadIdx.x; | ||
| uint64_t gid = (uint64_t)blockIdx.x * BLOCK_SIZE + tid; | ||
| if (gid >= n) return; | ||
|
|
||
| ext3::Fe3 offset = { | ||
| block_totals_scanned[(blockIdx.x - 1) * 3 + 0], | ||
| block_totals_scanned[(blockIdx.x - 1) * 3 + 1], | ||
| block_totals_scanned[(blockIdx.x - 1) * 3 + 2], | ||
| }; | ||
| ext3::Fe3 val = { | ||
| scan_inout[gid * 3 + 0], | ||
| scan_inout[gid * 3 + 1], | ||
| scan_inout[gid * 3 + 2], | ||
| }; | ||
| ext3::Fe3 res = ext3::mul(offset, val); | ||
| scan_inout[gid * 3 + 0] = res.a; | ||
| scan_inout[gid * 3 + 1] = res.b; | ||
| scan_inout[gid * 3 + 2] = res.c; | ||
| } | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // 4. block_inclusive_scan_rev_ext3 | ||
| // | ||
| // Mirror of `block_inclusive_scan_fwd_ext3` for the suffix product: | ||
| // suffix[i] = input[i] * input[i+1] * ... * input[n-1] | ||
| // | ||
| // Block b processes pos_from_end in [b*B, (b+1)*B), where gid = n-1-pos_from_end. | ||
| // Inside shmem the order is reversed so a forward Hillis-Steele scan over | ||
| // the loaded values produces the suffix scan in the original index space. | ||
| // --------------------------------------------------------------------------- | ||
| extern "C" __global__ void block_inclusive_scan_rev_ext3( | ||
| const uint64_t *input, | ||
| uint64_t n, | ||
| uint64_t *scan_out, | ||
| uint64_t *block_totals | ||
| ) { | ||
| __shared__ ext3::Fe3 shmem[BLOCK_SIZE]; | ||
| uint64_t tid = threadIdx.x; | ||
| uint64_t pos_from_end = (uint64_t)blockIdx.x * BLOCK_SIZE + tid; | ||
| bool valid = pos_from_end < n; | ||
| uint64_t gid = valid ? (n - 1 - pos_from_end) : 0; | ||
|
|
||
| if (valid) { | ||
| shmem[tid].a = input[gid * 3 + 0]; | ||
| shmem[tid].b = input[gid * 3 + 1]; | ||
| shmem[tid].c = input[gid * 3 + 2]; | ||
| } else { | ||
| shmem[tid] = ext3::one(); | ||
| } | ||
| __syncthreads(); | ||
|
|
||
| for (uint32_t offset = 1; offset < BLOCK_SIZE; offset <<= 1) { | ||
| ext3::Fe3 prev = (tid >= offset) ? shmem[tid - offset] : ext3::one(); | ||
| __syncthreads(); | ||
| if (tid >= offset) { | ||
| shmem[tid] = ext3::mul(prev, shmem[tid]); | ||
| } | ||
| __syncthreads(); | ||
| } | ||
|
|
||
| if (valid) { | ||
| scan_out[gid * 3 + 0] = shmem[tid].a; | ||
| scan_out[gid * 3 + 1] = shmem[tid].b; | ||
| scan_out[gid * 3 + 2] = shmem[tid].c; | ||
| } | ||
|
|
||
| // Mutually-exclusive last-thread mask (same idea as fwd): the last | ||
| // valid pos_from_end in this block is min(block_end - 1, n - 1). | ||
| uint64_t block_end_rev = ((uint64_t)blockIdx.x + 1) * BLOCK_SIZE; | ||
| uint64_t last_valid_pos = (block_end_rev - 1 < n - 1) ? (block_end_rev - 1) : (n - 1); | ||
| if (pos_from_end == last_valid_pos) { | ||
| block_totals[(uint64_t)blockIdx.x * 3 + 0] = shmem[tid].a; | ||
| block_totals[(uint64_t)blockIdx.x * 3 + 1] = shmem[tid].b; | ||
| block_totals[(uint64_t)blockIdx.x * 3 + 2] = shmem[tid].c; | ||
| } | ||
| } | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // 5. apply_block_offsets_rev_ext3 | ||
| // | ||
| // Phase 3 of the suffix scan. Block b > 0 multiplies its per-block scan | ||
| // by the inclusive prefix of block totals from blocks [0..b-1] (which, in | ||
| // the reverse-block indexing, correspond to the indices LARGER than this | ||
| // block's gids). | ||
| // --------------------------------------------------------------------------- | ||
| extern "C" __global__ void apply_block_offsets_rev_ext3( | ||
| uint64_t *scan_inout, | ||
| uint64_t n, | ||
| const uint64_t *block_totals_scanned | ||
| ) { | ||
| if (blockIdx.x == 0) return; | ||
| uint64_t tid = threadIdx.x; | ||
| uint64_t pos_from_end = (uint64_t)blockIdx.x * BLOCK_SIZE + tid; | ||
| if (pos_from_end >= n) return; | ||
| uint64_t gid = n - 1 - pos_from_end; | ||
|
|
||
| ext3::Fe3 offset = { | ||
| block_totals_scanned[(blockIdx.x - 1) * 3 + 0], | ||
| block_totals_scanned[(blockIdx.x - 1) * 3 + 1], | ||
| block_totals_scanned[(blockIdx.x - 1) * 3 + 2], | ||
| }; | ||
| ext3::Fe3 val = { | ||
| scan_inout[gid * 3 + 0], | ||
| scan_inout[gid * 3 + 1], | ||
| scan_inout[gid * 3 + 2], | ||
| }; | ||
| ext3::Fe3 res = ext3::mul(offset, val); | ||
| scan_inout[gid * 3 + 0] = res.a; | ||
| scan_inout[gid * 3 + 1] = res.b; | ||
| scan_inout[gid * 3 + 2] = res.c; | ||
| } | ||
|
|
||
| // --------------------------------------------------------------------------- | ||
| // 6. batch_inverse_combine_ext3 | ||
| // | ||
| // out[i] = prefix[i-1] * inv_total * suffix[i+1] | ||
| // | ||
| // Boundaries: prefix[-1] = identity, suffix[n] = identity. | ||
| // inv_total = 1 / (prefix[n-1]) = 1 / (suffix[0]); the caller computes it | ||
| // on host via Fermat's little theorem (one extension-field inverse per | ||
| // batch) and uploads as a 3 * u64 device buffer. | ||
| // --------------------------------------------------------------------------- | ||
| extern "C" __global__ void batch_inverse_combine_ext3( | ||
| const uint64_t *prefix, // 3 * n u64 | ||
| const uint64_t *suffix, // 3 * n u64 | ||
| const uint64_t *inv_total, // 3 u64 | ||
| uint64_t n, | ||
| uint64_t *out // 3 * n u64 | ||
| ) { | ||
| uint64_t i = (uint64_t)blockIdx.x * BLOCK_SIZE + threadIdx.x; | ||
| if (i >= n) return; | ||
|
|
||
| ext3::Fe3 inv_t = {inv_total[0], inv_total[1], inv_total[2]}; | ||
|
|
||
| ext3::Fe3 p; | ||
| if (i == 0) { | ||
| p = ext3::one(); | ||
| } else { | ||
| p.a = prefix[(i - 1) * 3 + 0]; | ||
| p.b = prefix[(i - 1) * 3 + 1]; | ||
| p.c = prefix[(i - 1) * 3 + 2]; | ||
| } | ||
|
|
||
| ext3::Fe3 s; | ||
| if (i == n - 1) { | ||
| s = ext3::one(); | ||
| } else { | ||
| s.a = suffix[(i + 1) * 3 + 0]; | ||
| s.b = suffix[(i + 1) * 3 + 1]; | ||
| s.c = suffix[(i + 1) * 3 + 2]; | ||
| } | ||
|
|
||
| ext3::Fe3 tmp = ext3::mul(p, inv_t); | ||
| ext3::Fe3 res = ext3::mul(tmp, s); | ||
|
|
||
| out[i * 3 + 0] = res.a; | ||
| out[i * 3 + 1] = res.b; | ||
| out[i * 3 + 2] = res.c; | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the index multiplications could be hoisted outside of the function to become mere additions.
The choice of
subract_x = 0meaning "actually DO subtract x" is counter-intuitive.