Skip to content
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions main/acle.md
Original file line number Diff line number Diff line change
Expand Up @@ -471,6 +471,7 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin

* Added support for modal 8-bit floating point matrix multiply-accumulate widening intrinsics.
* Added support for 16-bit floating point matrix multiply-accumulate widening intrinsics.
* Added support for range prefetch intrinsic when `__ARM_FEATURE_RPRFM` is defined.

### References

Expand Down Expand Up @@ -3613,6 +3614,79 @@ values.
| KEEP | 0 | Temporal fetch of the addressed location (that is, allocate in cache normally) |
| STRM | 1 | Streaming fetch of the addressed location (that is, memory used only once) |

The following intrinsic is also available when `__ARM_FEATURE_RPRFM` is defined:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation for the builtin to fail compilation when this feature is not present? or be treated as a NOP? The existing _pld documentation suggests the latter.

I guess it comes down to whether the builtin provides useful information that can be used even if RPRFM is not available (e.g. by emitting other prefetch instructions, or non-temporal accesses).

The reason I mention this is because if there's only a weak link to the underlying instruction then we'll not want to use a feature macro, but instead something that just says the builtin is available, like __ARM_PREFIX_RANGE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing documentation above the _pld builtin applies here too, in that the range builtin should not fail when the feature is not present and is treated as a nop.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I agree with Paul, I think we should have a __ARM_PREFETCH_RANGE macro to tell the user if this part of the ACLE is currently implemented for the compiler and is available. Then it can be used for conditional compilation?

I think we could also have a __ARM_FEATURE_RPRFM macro, to indicate if the actual feat is enabled though?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to describe the __ARM_PREFETCH_RANGE macro.

@AlfGalf Given that the __ARM_FEATURE_RPRFM macro should be used to query whether this is implemented, I'm not sure why we would also want to add __ARM_FEATURE_RPRFM if the instruction does not require something like +rprfm?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support @kmclaughlin-arm , the FEAT_RPRFM is OPTIONAL from Armv8.0, and not guarded by any dedicated feature flag, moreover it looks like RPRFM is a preferred disassembly for PRFM with UXTW, as guarded by this test https://github.com/llvm/llvm-project/blob/f9a3076180bfa1ebffd518971eefd30939a5ced6/llvm/test/MC/AArch64/rprfm.s#L273 , so not sure what we would achieve with having __ARM_FEATURE_RPRFM.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, okay, agreed best to not have __ARM_FEATURE_RPRFM


``` c
void __pldx_range(/*constant*/ unsigned int /*access_kind*/,
/*constant*/ unsigned int /*retention_policy*/,
/*constant*/ size_t /*reuse distance*/,
/*constant*/ signed int /*stride*/,
/*constant*/ unsigned int /*count*/,
/*constant*/ signed int /*length*/,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strongly held view, but after reading the description of __pld_range I do feel its "parameters" have a more natural order (i.e. length, count, stride, reuse distance). What do you think to using the same order for __pldx_range?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if the order is consistent for both of the intrinsics, and it makes sense to keep the ordering from bits 0-63 used by __pld_range.

void const volatile *addr);
```

Generates a data prefetch instruction for a range of addresses starting from a
given base address. Locations within the specified address ranges are prefetched
into one or more caches. This intrinsic allows the specification of the
expected access kind (read or write), the data retention policy (temporal or
streaming) and the reuse distance, stride, count and length metadata values.

The access kind and data retention policy arguments can only be one of the
following values.

| **Access Kind** | **Value** | **Summary** |
| --------------- | --------- | ---------------------------------------- |
| PLD | 0 | Fetch the addressed location for reading |
| PST | 1 | Fetch the addressed location for writing |

| **Retention Policy** | **Value** | **Summary** |
| -------------------- | --------- | -------------------------------------------------------------------------- |
| KEEP | 0 | Temporal fetch of the addressed location (that is, allocate in cache normally) |
| STRM | 1 | Streaming fetch of the addressed location (that is, memory used only once) |

The table below describes the ranges of the reuse distance, stride, count and length arguments.

| **Metadata** | **Range** | **Summary** |
| -------------- | ------------------- | -------------------------------------------------------------------- |
| Reuse Distance | | Maximum number of bytes to be accessed before executing the next |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add the Range here for Reuse Distance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CarolineConcatto, the reason for removing the Reuse Distance range was discussed here: #423 (comment)

| | | RPRFM instruction that specifies the same range. All values are |
| | | rounded up to the nearest power of 2 in the range 32KiB to 512MiB. |
| | | Values exceeding the maximum of 512MiB will be represented by 0, |
| | | indicating distance not known. |
| | | Note: This value is ignored if a streaming prefetch is specified. |
| Stride | [-2MiB, +2MiB) | Number of bytes to advance the block address by after `Length` |
| | | bytes have been accessed. Note: This value is ignored if Count is 1. |
| Count | [1, 65536] | Number of blocks to be accessed. |
| Length | [-2MiB, +2MiB) | Number of contiguous bytes to be accessed. |

``` c
void __pld_range(/*constant*/ unsigned int /*access_kind*/,
/*constant*/ unsigned int /*retention_policy*/,
unsigned long /*metadata*/,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is quite late to request a change, sorry about that, but is it worth using uint64_t for metadata instead of unsigned long to avoid any issues with targets where long != 64 bits?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or long long, happy with either :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I think it makes sense to change it. Metadata is now using uint64_t.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

void const volatile *addr);
```

Generates a data prefetch instruction for a range of addresses starting from a
given base address. Locations within the specified address ranges are prefetched
into one or more caches. The access kind and retention policy arguments can
have the same values as in `__pldx_range`. The bits of the metadata argument
are interpreted as follows:

| **Metadata** | **Bits** | **Range** | **Summary** |
| -------------- | -------- | --------------- | ------------------------------------------------------------ |
| Length | 0-21 | [-2MiB, +2MiB) | Signed integer representing the number of contiguous |
| | | | bytes to be accessed. |
| Count | 37-22 | [0, 65535] | Unsigned integer representing number of blocks of data |
| | | | to be accessed, minus 1. |
| Stride | 59-38 | [-2MiB, +2MiB) | Signed integer representing the number of bytes to advance |
| | | | the block address by after `Length` bytes have been |
| | | | accessed. This value is ignored if Count is 0. |
| Reuse Distance | 63-60 | [0, 15] | Indicates the maximum number of bytes to be accessed before |
| | | | executing the next RPRFM instruction that specifies the same |
| | | | range. Bits encode decreasing powers of two in the range |
| | | | 1 (512MiB) to 15 (32KiB). 0 indicates distance not known. |

### Instruction prefetch

``` c
Expand Down
Loading