ras: aest: extend AEST support to Device Tree frontend#1146
ras: aest: extend AEST support to Device Tree frontend#1146umang-chheda wants to merge 24 commits into
Conversation
This patch introduces the creation of AEST platform devices, where each device represents a logical "error node device" grouping one or more AEST nodes from the ACPI table. Instead of relying on the optional 'error_node_device' field in the AEST table[1], this commit uses the interrupt number as the sole identifier for the parent device. This design simplifies the driver logic by providing a single, consistent mechanism for grouping nodes. The 'error_node_device' field can be unspecified, but an AEST node is always physically associated with a parent component. The interrupt number serves as a reliable proxy for this association. This approach is based on the safe assumption that distinct hardware components (e.g., SMMU, CMN, GIC) are assigned unique error interrupts and do not share them. [1]: https://developer.arm.com/documentation/den0085/latest Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-2-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Parse register information from the AEST table in the probe function, create corresponding structures, and mappings AEST record. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-3-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Support for various AEST group formats allows for flexible configuration of AEST node address space sizes and maximum record counts per group. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-4-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
…IO register Use record_read/write to simultaneously read and write system registers and MMIO registers while maintaining code conciseness. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-5-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The RAS version of a component can be probed via its ERRDEVARCH register. In cases where a component (e.g., SMMU) does not implement an ERRDEVARCH register, the driver falls back to using the RAS version of the Processing Element (PE). Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-6-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add inject register descripted in Common Fault Injection Model Extension. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-7-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The CE threshold defines the number of Correctable Errors (CE) that must occur in a record before triggering an interrupt. Error records support multiple threshold configurations, including 8B, 16B, and 32B. This patch detects the supported threshold settings for error records and sets the default threshold to 1, ensuring an interrupt is generated for every CE occurrence. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-8-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The interrupt numbers for certain error records may be explicitly programmed into their configuration register. And for PPIs, each core will maintains its own copy of the aest_device structure. Given that handling RAS errors entails complex processes such as EDAC and memory_failure, all handling is deferred to and handled within a bottom-half context. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-9-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Move the configuration of interrupts and CE thresholds into the CPU hotplug callbacks for the per-CPU AEST node. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-10-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Exposes certain AEST driver information to userspace.
Only ROOT can access these interface because it includes
hardware-sensitive information:
ls /sys/kernel/debug/aest/
memory<id> smmu<id> ...
ls /sys/kernel/debug/aest/memory<id>/
record0 record1 ...
All details at:
Documentation/ABI/testing/debugfs-aest
Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com>
Link: https://patch.msgid.link/20260122094656.73399-11-tianruidong@linux.alibaba.com
Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
This commit introduces error counting functionality for AEST records. Previously, error statistics were not directly available for individual error records or AEST nodes. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-12-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
This commit introduces the ability to configure the Corrected Error (CE) threshold for AEST records through debugfs. This allows administrators to dynamically adjust the CE threshold for error reporting. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-13-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
AEST offers both soft and hard injection. Soft injection simulates errors in software, providing flexibility to define the error register content. Hard injection, on the other hand, utilizes error injection registers to introduce hardware faults, strictly requiring values that adhere to their specifications. Read Documentation/ABI/testing/debugfs-aest to learn how to use them. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-14-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
AEST table include vendor error node to support the component that do not implement standard Arm RAS architecture[1]. Each vendor node may have their own initialize and interrupt handle function. This patch supply a framework to process vendor error nodes, the vendor process function is binded with vendor HID. [1]: https://developer.arm.com/documentation/ddi0587/latest/ Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-15-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The CMN (Coherent Mesh Network) architecture incorporates five distinct device types. Each device type is associated with an error group register set. The struct aest_cmn_700 models a single CMN instance, while struct aest_cmn_700_child represents an individual CMN device. CMN's error records utilize a memory-mapped single error record view [1]. Critically, one error record corresponds to one AEST node, implying that a single CMN instance can generate hundreds of AEST nodes. To manage this scale, this driver introduces a virtual AEST node, which represents an entire CMN device, such as an HNI or HNF. This allows an HNF AEST node, for instance, to leverage its errgsr register to pinpoint which specific error record has reported an error. During the AEST probe phase, the CMN AEST driver identifies the CMN node type using the cmn_node_info register. It then reorganizes all AEST nodes belonging to the same CMN node type into a cohesive CMN AEST node structure. To locate the relevant CMN register addresses, the CMN's presence in the DSDT is required, along with the CMN node offset specified in the AEST vendor specification data [1]. [1]: https://developer.arm.com/documentation/102308/latest/ Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-16-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add a trace event for hardware errors reported by the ARMv8 RAS extension registers. userspace app can monitor this trace event and decode error information. Signed-off-by: Ruidong Tian <tianruidong@linux.alibaba.com> Link: https://patch.msgid.link/20260122094656.73399-17-tianruidong@linux.alibaba.com Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
… messages Two related fixes for processor nodes with ACPI_AEST_PROC_FLAG_SHARED or ACPI_AEST_PROC_FLAG_GLOBAL set (e.g. cluster L3 cache, DSU): 1. aest_dev_is_oncore() returns true for any PROCESSOR_ERROR_NODE, causing shared processor nodes (which use an SPI) to take the cpuhp/PPI path. cpuhp_setup_state() is called instead of aest_online_dev(), so aest_config_irq() is never called and the hardware IRQ-config register is never programmed. Fix aest_dev_is_oncore() to check irq_is_percpu() on the registered IRQ. Only nodes whose FHI or ERI is a per-CPU PPI take the oncore path, nodes with an SPI take aest_online_dev(). 2. alloc_aest_node_name() uses processor_id for the node name of all processor nodes. Shared/global nodes have processor_id=0 (the field is unused when SHARED/GLOBAL is set), so every shared node and the per-PE node for CPU 0 both got the name "processor.0", making error logs ambiguous. For shared/global nodes, build the name as "processor.<resource_type>.<device_id>" (e.g. "processor.cache.1") so each node has a unique, meaningful identifier. Per-PE nodes keep the original "processor.<mpidr>" form. Also add proc_flags to struct aest_event so aest_print() can distinguish shared from per-PE nodes and print an appropriate message. Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-1-d5d6ffacf0a5@oss.qualcomm.com/ Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The error counts visible under: /sys/kernel/debug/aest/<dev>/processor<cpu>/<node>/err_count always reported zero, even though corrected errors (CEs) were being serviced by the interrupt handler. aest_oncore_dev_init_debugfs() sets up per CPU debugfs entries but wired them up incorrectly in two places: - this_cpu_ptr(adev->adev_oncore) was used inside for_each_possible_cpu(). This always selects the slot for the CPU executing the init code, so all debugfs files ended up referencing the same per CPU aest_device instance instead of the CPU indicated by the loop variable. - The code referenced adev->nodes[i], i.e. the template nodes allocated before __setup_ppi, rather than the per-CPU copies at percpu_dev->nodes[i]. The IRQ handler updates CE counters in the per-CPU records created by __setup_ppi, the template records are never touched at runtime, so err_count always read as zero. Fix this by: - Using per_cpu_ptr(adev->adev_oncore, cpu) when iterating over CPUs. Wiring debugfs files to percpu_dev->nodes[i] so counters reflect the data updated by the IRQ handler. - Using adev->nodes[i].name for debugfs directory names. The per-CPU node receives name via a shallow memcpy and is not the authoritative source. Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-2-d5d6ffacf0a5@oss.qualcomm.com/ Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The record_implemented bitmap uses the same semantics as the rest of the driver: a SET bit means the record is NOT implemented (skip it), a CLEAR bit means the record IS implemented (process it). aest_node_init_debugfs() and aest_node_err_count_show() were iterating all record_count records unconditionally, creating debugfs entries and accumulating error counts for unimplemented records too. Fix both functions to skip records where the corresponding bit is set in node->record_implemented, consistent with how aest_node_foreach_record() handles the same bitmap. Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-3-d5d6ffacf0a5@oss.qualcomm.com/ Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The driver unconditionally calls panic() whenever an unrecoverable, uncontainable UE (UET_UC or UET_UEU) is detected. There is no way for the user to suppress this behaviour, which makes it difficult to test UE injection or to run in environments where a kernel panic on every UE is undesirable. Add a module parameter `aest_panic_on_ue` When set to 0 the driver logs the UE and continues instead of panicking. Usage: # Boot time (kernel cmdline) aest.aest_panic_on_ue=0 # Runtime echo 0 > /sys/module/aest/parameters/aest_panic_on_ue Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-4-d5d6ffacf0a5@oss.qualcomm.com/ Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
The Arm Error Source Table (AEST) specification describes how firmware exposes RAS error source topology to the operating system. On ACPI systems this information is provided via the AEST ACPI table. Introduce Device Tree bindings that provide an equivalent description of AEST error sources for DT-based platforms. Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-5-d5d6ffacf0a5@oss.qualcomm.com/ Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add a Device Tree frontend for the Arm AEST RAS framework, allowing the existing AEST core driver to be used on DT-only systems. The DT frontend parses the "arm,aest" Device Tree hierarchy and populates the same internal structures as the ACPI-based implementation. It is initialized at the same layer as ACPI and is mutually exclusive with it, ensuring identical behaviour regardless of the firmware interface in use. Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-6-d5d6ffacf0a5@oss.qualcomm.com/ Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add AEST RAS error source nodes for the Lemans SoC. The DT describes a processor error source covering all CPU cores and a shared L3 cache error source for the cluster. These nodes model the hardware error reporting blocks and associated interrupts as required by the Arm AEST specification. Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-7-d5d6ffacf0a5@oss.qualcomm.com/ Co-developed-by: Faruque Ansari <faruque.ansari@oss.qualcomm.com> Signed-off-by: Faruque Ansari <faruque.ansari@oss.qualcomm.com> Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
Add AEST RAS error source nodes for the Monaco SoC. The DT describes a processor error source covering all CPU cores and a shared L3 cache error source for the cluster. These nodes model the hardware error reporting blocks and associated interrupts as required by the Arm AEST specification. Link: https://lore.kernel.org/lkml/20260505-aest-devicetree-support-v1-8-d5d6ffacf0a5@oss.qualcomm.com/ Co-developed-by: Faruque Ansari <faruque.ansari@oss.qualcomm.com> Signed-off-by: Faruque Ansari <faruque.ansari@oss.qualcomm.com> Signed-off-by: Umang Chheda <umang.chheda@oss.qualcomm.com>
PR #1146 — validate-patchPR: #1146
|
PR #1146 — checker-log-analyzerPR: #1146
Detailed report: Full report
|
This series extends Tian Ruidong’s [1] ACPI-based AEST support series
to also cover Device Tree based platforms.
While the existing AEST driver relies on the AEST ACPI table [3], many
embedded Arm platforms use Device Tree exclusively and cannot use the
driver today. This series adds a DT frontend that mirrors the ACPI
implementation and feeds the same core driver, keeping ACPI and DT
paths functionally equivalent.
Along the way, several correctness issues were identified in the core
driver and are fixed in the first part of this series.
The DT frontend is mutually exclusive with ACPI and does not introduce
any DT-specific logic into the core.