[Bug] - Intermittent hibernation resume failure on AL2023 aarch64 (kernel 6.18.20) — `PM: Image not found (code -22)`

**Describe the bug**

When an EC2 instance running AL2023 (aarch64) is hibernated and then started again, it intermittently fails to resume and becomes unreachable (no SSH, no application response). The instance state shows as `running` in the EC2 console, but the kernel never completes resume. Recovery requires a `Stop` followed by `Start` (a cold boot), at which point the system comes up normally.

The kernel log on the next cold boot shows:

```
PM: Image not found (code -22)
PM: Image not found (code -16)
```

`code -22` is `-EINVAL`, indicating the kernel could not parse a valid hibernation image header at the configured `resume_offset`.

The `/swap` file managed by `ec2-hibinit-agent` is fragmented across **3 separate XFS allocation groups** with discontinuous physical block ranges, while the kernel is given only a single `resume_offset` value. There is also an apparent inconsistency between `filefrag` and `xfs_bmap` reporting different first-block offsets for the same untouched file (details below) — which may indicate extent reallocation between the hibernate and resume steps.

A related issue on `aws/amazon-ec2-hibinit-agent` ([#25](https://github.com/aws/amazon-ec2-hibinit-agent/issues/25)) flagged use of `fallocate` on XFS but was closed; that fix may not fully cover the fragmented multi-AG case on a recent kernel.

**To Reproduce**

Steps to reproduce the behavior:

1. Launch an EC2 instance from `ami-0bf96732a1c71350f` (AL2023 aarch64) on `t4g.large` (8 GiB RAM), with hibernation enabled at launch and an encrypted root EBS volume. With AL2023's default XFS geometry (`agsize` ~4 GiB), an 8 GiB swap file cannot fit in a single allocation group and is forced to span multiple AGs.
2. Wait for `hibinit-agent.service` to complete first-boot setup
3. Hibernate the instance via the EC2 console or `aws ec2 stop-instances --hibernate`
4. Start the instance
5. Observe: instance reaches `running` state but is unreachable. After recovering with `Stop` + `Start`, `journalctl -k | grep PM:` shows `PM: Image not found (code -22)`

The failure is **intermittent** — only some hibernate/resume cycles fail. `Stop` + `Start` recovers reliably every time.

**Expected behavior**

Hibernate followed by Start should reliably resume the instance to its pre-hibernation state, as documented for AL2023 hibernation support.

**Screenshots**

<img width="1496" height="573" alt="Image" src="https://github.com/user-attachments/assets/336dc179-8fa2-41c8-a77e-55f158fcbf02" />

**Desktop (please complete the following information):**

Not applicable (server-side issue). Replacing with the relevant environment data:

- **AMI:** `ami-0bf96732a1c71350f`
- **OS:** Amazon Linux 2023
- **Kernel:** `6.18.20-20.229.amzn2023.aarch64`
- **Architecture:** aarch64 (Graviton)
- **Instance type:** `t4g.large` (2 vCPU, 8 GiB RAM, Graviton2)
- **Region:** eu-central-1
- **Hibernation agent:** `ec2-hibinit-agent-1.0.10-2.amzn2023.noarch`
- **Root filesystem:** XFS (AL2023 default)
- **Root volume:** 30 GiB EBS, encrypted, ~77% used at time of report

**Smartphone (please complete the following information):**

Not applicable.

**Additional context**

### Kernel cmdline (resume configuration)

```
BOOT_IMAGE=(hd0,gpt1)/boot/vmlinuz-6.18.20-20.229.amzn2023.aarch64
root=UUID=911c4ca1-6548-40cd-9ab1-eb37b1abb990 ro
console=tty0 console=ttyS0,115200n8
nvme_core.io_timeout=4294967295
rd.emergency=poweroff rd.shell=0
selinux=1 security=selinux quiet numa_cma=1:64M no_console_suspend=1
resume_offset=5324800 resume=/dev/nvme0n1p1
```

### `/sys/power/*`

```
$ cat /sys/power/resume
259:1
$ cat /sys/power/resume_offset
5324800
$ cat /sys/power/disk
[shutdown] reboot test_resume
```

### Swap state

`swapon --show` and `/proc/swaps` both report **no active swap**. Consistent with hibinit-agent's design (the swap file is used as a reserved disk region only, not via `swapon`).

### `/swap` file

```
$ ls -lah /swap
-rw-------. 1 root root 7.7G Apr 24 14:42 /swap

$ findmnt -no FSTYPE /
xfs
```

### `filefrag -v /swap` (read first)

```
File size of /swap is 8172470272 (1995232 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    5324800..   5324800:      1:
   1:        1..  958207:    5324801..   6283007: 958207:             unwritten
```

### `xfs_bmap -v /swap` (read seconds later, file untouched)

```
/swap:
 EXT: FILE-OFFSET           BLOCK-RANGE        AG AG-OFFSET            TOTAL FLAGS
   0: [0..7]:               42598400..42598407  5 (711680..711687)         8 000101
   1: [8..7665663]:         42598408..50264063  5 (711688..8377343)  7665656 001010
   2: [7665664..14975999]:  26199040..33509375  3 (1067008..8377343) 7310336
   3: [14976000..15961855]: 12173312..13159167  1 (3795968..4781823)  985856 000101
```

The file spans **3 different XFS allocation groups** (AG 5, 3, 1) with discontinuous physical block ranges. The kernel is only given one `resume_offset` value.

The `filefrag` and `xfs_bmap` outputs were taken seconds apart on the same file with no writes in between, yet they report different first-block physical offsets (`5324800` vs `42598400`). I'd appreciate maintainer input on whether this is an extent reallocation, a difference in offset reporting between the tools, or something else.

### `xfs_info /`

```
meta-data=/dev/nvme0n1p1         isize=512    agcount=8, agsize=1047168 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=4096   blocks=7861499, imaxpct=25
         =                       sunit=128    swidth=128 blks
log      =internal log           bsize=4096   blocks=16384, version=2
```

With `agsize=1047168 blks` (~4 GiB per AG) and a 7.7 GiB swap file, the file cannot fit in a single AG.

### Kernel log around resume

```
kernel: Kernel command line: ... resume_offset=5324800 resume=/dev/nvme0n1p1
kernel: PM: Image not found (code -22)
kernel: PM: genpd: Disabling unused power domains
systemd[1]: Created slice system-systemd\x2dhibernate\x2dresume.slice
kernel: PM: Image not found (code -22)
kernel: PM: Image not found (code -16)
```

### Workaround

`Stop` + `Start` (cold boot) recovers reliably every time. Hibernation is effectively unreliable for unattended workloads on this AMI/kernel/filesystem combination.

### What would help triage

- Whether the kernel resume path (pre-`swapon`) reads the full extent map from the swap header, or assumes contiguity from `resume_offset`
- Whether AL2023's hibinit-agent is validated against XFS root volumes where RAM size forces a multi-AG swap file (i.e., RAM > ~4 GiB on default XFS geometry)
- Whether this combination (XFS root + kernel 6.18 + aarch64 + hibernation) has been tested end-to-end in CI



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] - Intermittent hibernation resume failure on AL2023 aarch64 (kernel 6.18.20) — `PM: Image not found (code -22)` #1088

Kernel cmdline (resume configuration)

`/sys/power/*`

Swap state

`/swap` file

`filefrag -v /swap` (read first)

`xfs_bmap -v /swap` (read seconds later, file untouched)

`xfs_info /`

Kernel log around resume

Workaround

What would help triage

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] - Intermittent hibernation resume failure on AL2023 aarch64 (kernel 6.18.20) — PM: Image not found (code -22) #1088

Description

Kernel cmdline (resume configuration)

/sys/power/*

Swap state

/swap file

filefrag -v /swap (read first)

xfs_bmap -v /swap (read seconds later, file untouched)

xfs_info /

Kernel log around resume

Workaround

What would help triage

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] - Intermittent hibernation resume failure on AL2023 aarch64 (kernel 6.18.20) — `PM: Image not found (code -22)` #1088

`/sys/power/*`

`/swap` file

`filefrag -v /swap` (read first)

`xfs_bmap -v /swap` (read seconds later, file untouched)

`xfs_info /`