Skip to content

[Bug] - Intermittent hibernation resume failure on AL2023 aarch64 (kernel 6.18.20) — PM: Image not found (code -22) #1088

@p24-max

Description

@p24-max

Describe the bug

When an EC2 instance running AL2023 (aarch64) is hibernated and then started again, it intermittently fails to resume and becomes unreachable (no SSH, no application response). The instance state shows as running in the EC2 console, but the kernel never completes resume. Recovery requires a Stop followed by Start (a cold boot), at which point the system comes up normally.

The kernel log on the next cold boot shows:

PM: Image not found (code -22)
PM: Image not found (code -16)

code -22 is -EINVAL, indicating the kernel could not parse a valid hibernation image header at the configured resume_offset.

The /swap file managed by ec2-hibinit-agent is fragmented across 3 separate XFS allocation groups with discontinuous physical block ranges, while the kernel is given only a single resume_offset value. There is also an apparent inconsistency between filefrag and xfs_bmap reporting different first-block offsets for the same untouched file (details below) — which may indicate extent reallocation between the hibernate and resume steps.

A related issue on aws/amazon-ec2-hibinit-agent (#25) flagged use of fallocate on XFS but was closed; that fix may not fully cover the fragmented multi-AG case on a recent kernel.

To Reproduce

Steps to reproduce the behavior:

  1. Launch an EC2 instance from ami-0bf96732a1c71350f (AL2023 aarch64) on t4g.large (8 GiB RAM), with hibernation enabled at launch and an encrypted root EBS volume. With AL2023's default XFS geometry (agsize ~4 GiB), an 8 GiB swap file cannot fit in a single allocation group and is forced to span multiple AGs.
  2. Wait for hibinit-agent.service to complete first-boot setup
  3. Hibernate the instance via the EC2 console or aws ec2 stop-instances --hibernate
  4. Start the instance
  5. Observe: instance reaches running state but is unreachable. After recovering with Stop + Start, journalctl -k | grep PM: shows PM: Image not found (code -22)

The failure is intermittent — only some hibernate/resume cycles fail. Stop + Start recovers reliably every time.

Expected behavior

Hibernate followed by Start should reliably resume the instance to its pre-hibernation state, as documented for AL2023 hibernation support.

Screenshots

Image

Desktop (please complete the following information):

Not applicable (server-side issue). Replacing with the relevant environment data:

  • AMI: ami-0bf96732a1c71350f
  • OS: Amazon Linux 2023
  • Kernel: 6.18.20-20.229.amzn2023.aarch64
  • Architecture: aarch64 (Graviton)
  • Instance type: t4g.large (2 vCPU, 8 GiB RAM, Graviton2)
  • Region: eu-central-1
  • Hibernation agent: ec2-hibinit-agent-1.0.10-2.amzn2023.noarch
  • Root filesystem: XFS (AL2023 default)
  • Root volume: 30 GiB EBS, encrypted, ~77% used at time of report

Smartphone (please complete the following information):

Not applicable.

Additional context

Kernel cmdline (resume configuration)

BOOT_IMAGE=(hd0,gpt1)/boot/vmlinuz-6.18.20-20.229.amzn2023.aarch64
root=UUID=911c4ca1-6548-40cd-9ab1-eb37b1abb990 ro
console=tty0 console=ttyS0,115200n8
nvme_core.io_timeout=4294967295
rd.emergency=poweroff rd.shell=0
selinux=1 security=selinux quiet numa_cma=1:64M no_console_suspend=1
resume_offset=5324800 resume=/dev/nvme0n1p1

/sys/power/*

$ cat /sys/power/resume
259:1
$ cat /sys/power/resume_offset
5324800
$ cat /sys/power/disk
[shutdown] reboot test_resume

Swap state

swapon --show and /proc/swaps both report no active swap. Consistent with hibinit-agent's design (the swap file is used as a reserved disk region only, not via swapon).

/swap file

$ ls -lah /swap
-rw-------. 1 root root 7.7G Apr 24 14:42 /swap

$ findmnt -no FSTYPE /
xfs

filefrag -v /swap (read first)

File size of /swap is 8172470272 (1995232 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    5324800..   5324800:      1:
   1:        1..  958207:    5324801..   6283007: 958207:             unwritten

xfs_bmap -v /swap (read seconds later, file untouched)

/swap:
 EXT: FILE-OFFSET           BLOCK-RANGE        AG AG-OFFSET            TOTAL FLAGS
   0: [0..7]:               42598400..42598407  5 (711680..711687)         8 000101
   1: [8..7665663]:         42598408..50264063  5 (711688..8377343)  7665656 001010
   2: [7665664..14975999]:  26199040..33509375  3 (1067008..8377343) 7310336
   3: [14976000..15961855]: 12173312..13159167  1 (3795968..4781823)  985856 000101

The file spans 3 different XFS allocation groups (AG 5, 3, 1) with discontinuous physical block ranges. The kernel is only given one resume_offset value.

The filefrag and xfs_bmap outputs were taken seconds apart on the same file with no writes in between, yet they report different first-block physical offsets (5324800 vs 42598400). I'd appreciate maintainer input on whether this is an extent reallocation, a difference in offset reporting between the tools, or something else.

xfs_info /

meta-data=/dev/nvme0n1p1         isize=512    agcount=8, agsize=1047168 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=4096   blocks=7861499, imaxpct=25
         =                       sunit=128    swidth=128 blks
log      =internal log           bsize=4096   blocks=16384, version=2

With agsize=1047168 blks (~4 GiB per AG) and a 7.7 GiB swap file, the file cannot fit in a single AG.

Kernel log around resume

kernel: Kernel command line: ... resume_offset=5324800 resume=/dev/nvme0n1p1
kernel: PM: Image not found (code -22)
kernel: PM: genpd: Disabling unused power domains
systemd[1]: Created slice system-systemd\x2dhibernate\x2dresume.slice
kernel: PM: Image not found (code -22)
kernel: PM: Image not found (code -16)

Workaround

Stop + Start (cold boot) recovers reliably every time. Hibernation is effectively unreliable for unattended workloads on this AMI/kernel/filesystem combination.

What would help triage

  • Whether the kernel resume path (pre-swapon) reads the full extent map from the swap header, or assumes contiguity from resume_offset
  • Whether AL2023's hibinit-agent is validated against XFS root volumes where RAM size forces a multi-AG swap file (i.e., RAM > ~4 GiB on default XFS geometry)
  • Whether this combination (XFS root + kernel 6.18 + aarch64 + hibernation) has been tested end-to-end in CI

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions