Describe the bug
When an EC2 instance running AL2023 (aarch64) is hibernated and then started again, it intermittently fails to resume and becomes unreachable (no SSH, no application response). The instance state shows as running in the EC2 console, but the kernel never completes resume. Recovery requires a Stop followed by Start (a cold boot), at which point the system comes up normally.
The kernel log on the next cold boot shows:
PM: Image not found (code -22)
PM: Image not found (code -16)
code -22 is -EINVAL, indicating the kernel could not parse a valid hibernation image header at the configured resume_offset.
The /swap file managed by ec2-hibinit-agent is fragmented across 3 separate XFS allocation groups with discontinuous physical block ranges, while the kernel is given only a single resume_offset value. There is also an apparent inconsistency between filefrag and xfs_bmap reporting different first-block offsets for the same untouched file (details below) — which may indicate extent reallocation between the hibernate and resume steps.
A related issue on aws/amazon-ec2-hibinit-agent (#25) flagged use of fallocate on XFS but was closed; that fix may not fully cover the fragmented multi-AG case on a recent kernel.
To Reproduce
Steps to reproduce the behavior:
- Launch an EC2 instance from
ami-0bf96732a1c71350f (AL2023 aarch64) on t4g.large (8 GiB RAM), with hibernation enabled at launch and an encrypted root EBS volume. With AL2023's default XFS geometry (agsize ~4 GiB), an 8 GiB swap file cannot fit in a single allocation group and is forced to span multiple AGs.
- Wait for
hibinit-agent.service to complete first-boot setup
- Hibernate the instance via the EC2 console or
aws ec2 stop-instances --hibernate
- Start the instance
- Observe: instance reaches
running state but is unreachable. After recovering with Stop + Start, journalctl -k | grep PM: shows PM: Image not found (code -22)
The failure is intermittent — only some hibernate/resume cycles fail. Stop + Start recovers reliably every time.
Expected behavior
Hibernate followed by Start should reliably resume the instance to its pre-hibernation state, as documented for AL2023 hibernation support.
Screenshots
Desktop (please complete the following information):
Not applicable (server-side issue). Replacing with the relevant environment data:
- AMI:
ami-0bf96732a1c71350f
- OS: Amazon Linux 2023
- Kernel:
6.18.20-20.229.amzn2023.aarch64
- Architecture: aarch64 (Graviton)
- Instance type:
t4g.large (2 vCPU, 8 GiB RAM, Graviton2)
- Region: eu-central-1
- Hibernation agent:
ec2-hibinit-agent-1.0.10-2.amzn2023.noarch
- Root filesystem: XFS (AL2023 default)
- Root volume: 30 GiB EBS, encrypted, ~77% used at time of report
Smartphone (please complete the following information):
Not applicable.
Additional context
Kernel cmdline (resume configuration)
BOOT_IMAGE=(hd0,gpt1)/boot/vmlinuz-6.18.20-20.229.amzn2023.aarch64
root=UUID=911c4ca1-6548-40cd-9ab1-eb37b1abb990 ro
console=tty0 console=ttyS0,115200n8
nvme_core.io_timeout=4294967295
rd.emergency=poweroff rd.shell=0
selinux=1 security=selinux quiet numa_cma=1:64M no_console_suspend=1
resume_offset=5324800 resume=/dev/nvme0n1p1
/sys/power/*
$ cat /sys/power/resume
259:1
$ cat /sys/power/resume_offset
5324800
$ cat /sys/power/disk
[shutdown] reboot test_resume
Swap state
swapon --show and /proc/swaps both report no active swap. Consistent with hibinit-agent's design (the swap file is used as a reserved disk region only, not via swapon).
/swap file
$ ls -lah /swap
-rw-------. 1 root root 7.7G Apr 24 14:42 /swap
$ findmnt -no FSTYPE /
xfs
filefrag -v /swap (read first)
File size of /swap is 8172470272 (1995232 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 5324800.. 5324800: 1:
1: 1.. 958207: 5324801.. 6283007: 958207: unwritten
xfs_bmap -v /swap (read seconds later, file untouched)
/swap:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..7]: 42598400..42598407 5 (711680..711687) 8 000101
1: [8..7665663]: 42598408..50264063 5 (711688..8377343) 7665656 001010
2: [7665664..14975999]: 26199040..33509375 3 (1067008..8377343) 7310336
3: [14976000..15961855]: 12173312..13159167 1 (3795968..4781823) 985856 000101
The file spans 3 different XFS allocation groups (AG 5, 3, 1) with discontinuous physical block ranges. The kernel is only given one resume_offset value.
The filefrag and xfs_bmap outputs were taken seconds apart on the same file with no writes in between, yet they report different first-block physical offsets (5324800 vs 42598400). I'd appreciate maintainer input on whether this is an extent reallocation, a difference in offset reporting between the tools, or something else.
xfs_info /
meta-data=/dev/nvme0n1p1 isize=512 agcount=8, agsize=1047168 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1 bigtime=1 inobtcount=1 nrext64=0
data = bsize=4096 blocks=7861499, imaxpct=25
= sunit=128 swidth=128 blks
log =internal log bsize=4096 blocks=16384, version=2
With agsize=1047168 blks (~4 GiB per AG) and a 7.7 GiB swap file, the file cannot fit in a single AG.
Kernel log around resume
kernel: Kernel command line: ... resume_offset=5324800 resume=/dev/nvme0n1p1
kernel: PM: Image not found (code -22)
kernel: PM: genpd: Disabling unused power domains
systemd[1]: Created slice system-systemd\x2dhibernate\x2dresume.slice
kernel: PM: Image not found (code -22)
kernel: PM: Image not found (code -16)
Workaround
Stop + Start (cold boot) recovers reliably every time. Hibernation is effectively unreliable for unattended workloads on this AMI/kernel/filesystem combination.
What would help triage
- Whether the kernel resume path (pre-
swapon) reads the full extent map from the swap header, or assumes contiguity from resume_offset
- Whether AL2023's hibinit-agent is validated against XFS root volumes where RAM size forces a multi-AG swap file (i.e., RAM > ~4 GiB on default XFS geometry)
- Whether this combination (XFS root + kernel 6.18 + aarch64 + hibernation) has been tested end-to-end in CI
Describe the bug
When an EC2 instance running AL2023 (aarch64) is hibernated and then started again, it intermittently fails to resume and becomes unreachable (no SSH, no application response). The instance state shows as
runningin the EC2 console, but the kernel never completes resume. Recovery requires aStopfollowed byStart(a cold boot), at which point the system comes up normally.The kernel log on the next cold boot shows:
code -22is-EINVAL, indicating the kernel could not parse a valid hibernation image header at the configuredresume_offset.The
/swapfile managed byec2-hibinit-agentis fragmented across 3 separate XFS allocation groups with discontinuous physical block ranges, while the kernel is given only a singleresume_offsetvalue. There is also an apparent inconsistency betweenfilefragandxfs_bmapreporting different first-block offsets for the same untouched file (details below) — which may indicate extent reallocation between the hibernate and resume steps.A related issue on
aws/amazon-ec2-hibinit-agent(#25) flagged use offallocateon XFS but was closed; that fix may not fully cover the fragmented multi-AG case on a recent kernel.To Reproduce
Steps to reproduce the behavior:
ami-0bf96732a1c71350f(AL2023 aarch64) ont4g.large(8 GiB RAM), with hibernation enabled at launch and an encrypted root EBS volume. With AL2023's default XFS geometry (agsize~4 GiB), an 8 GiB swap file cannot fit in a single allocation group and is forced to span multiple AGs.hibinit-agent.serviceto complete first-boot setupaws ec2 stop-instances --hibernaterunningstate but is unreachable. After recovering withStop+Start,journalctl -k | grep PM:showsPM: Image not found (code -22)The failure is intermittent — only some hibernate/resume cycles fail.
Stop+Startrecovers reliably every time.Expected behavior
Hibernate followed by Start should reliably resume the instance to its pre-hibernation state, as documented for AL2023 hibernation support.
Screenshots
Desktop (please complete the following information):
Not applicable (server-side issue). Replacing with the relevant environment data:
ami-0bf96732a1c71350f6.18.20-20.229.amzn2023.aarch64t4g.large(2 vCPU, 8 GiB RAM, Graviton2)ec2-hibinit-agent-1.0.10-2.amzn2023.noarchSmartphone (please complete the following information):
Not applicable.
Additional context
Kernel cmdline (resume configuration)
/sys/power/*Swap state
swapon --showand/proc/swapsboth report no active swap. Consistent with hibinit-agent's design (the swap file is used as a reserved disk region only, not viaswapon)./swapfilefilefrag -v /swap(read first)xfs_bmap -v /swap(read seconds later, file untouched)The file spans 3 different XFS allocation groups (AG 5, 3, 1) with discontinuous physical block ranges. The kernel is only given one
resume_offsetvalue.The
filefragandxfs_bmapoutputs were taken seconds apart on the same file with no writes in between, yet they report different first-block physical offsets (5324800vs42598400). I'd appreciate maintainer input on whether this is an extent reallocation, a difference in offset reporting between the tools, or something else.xfs_info /With
agsize=1047168 blks(~4 GiB per AG) and a 7.7 GiB swap file, the file cannot fit in a single AG.Kernel log around resume
Workaround
Stop+Start(cold boot) recovers reliably every time. Hibernation is effectively unreliable for unattended workloads on this AMI/kernel/filesystem combination.What would help triage
swapon) reads the full extent map from the swap header, or assumes contiguity fromresume_offset