Conversation
There was a problem hiding this comment.
Pull request overview
This PR enables discard_unpacked_layers = true in containerd's config during VHD build to reduce VHD size by removing compressed layer data after images are pulled. According to the PR description, this saves approximately 1GB of disk space, reducing /var/lib/containerd from 21GB to 20GB.
Changes:
- Enables
discard_unpacked_layersin containerd config before pulling container images during VHD build - Removes the temporary containerd config after VHD build completes, before generating the VHD
- Adds logging of containerd directory space usage to VHD logs
| # CSE generates the production config at provisioning time. | ||
| if [ -f /etc/containerd/config.toml ]; then | ||
| rm -f /etc/containerd/config.toml | ||
| systemctl restart containerd || true |
There was a problem hiding this comment.
The containerd restart at line 1014 should check the exit status. Currently it uses || true which silently ignores failures. After removing the config.toml, if containerd fails to restart, the VHD build should fail rather than continue with a broken containerd service. Consider using proper error handling or at least logging the failure before continuing.
| systemctl restart containerd || true | |
| if ! systemctl restart containerd; then | |
| echo "Failed to restart containerd after removing /etc/containerd/config.toml" >&2 | |
| exit 1 | |
| fi |
| sed -i 's/discard_unpacked_layers = false/discard_unpacked_layers = true/' /etc/containerd/config.toml | ||
| # If discard_unpacked_layers was not in the default config, add it under the correct section | ||
| if ! grep -q "discard_unpacked_layers = true" /etc/containerd/config.toml; then | ||
| sed -i '/\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\][[:space:]]*$/a\ discard_unpacked_layers = true' /etc/containerd/config.toml |
There was a problem hiding this comment.
The sed command on line 536 uses a regular expression pattern that may not be portable across all OS distributions. The [[:space:]]*$ pattern might behave differently on Ubuntu vs Azure Linux/Mariner. Additionally, the pattern assumes a specific formatting of the TOML file from containerd config default, which could vary across containerd versions. Consider testing this with multiple containerd versions to ensure it works reliably, or use a more robust TOML parser like tomlq or yq to manipulate the config file.
| sed -i '/\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\][[:space:]]*$/a\ discard_unpacked_layers = true' /etc/containerd/config.toml | |
| sed -i '/^\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\]/a\ discard_unpacked_layers = true' /etc/containerd/config.toml |
| # Enable discard_unpacked_layers during VHD build to reduce VHD size. | ||
| # Containerd will discard unpacked layer data after it's been committed to a snapshot. | ||
| # This config is only used during VHD build; CSE overwrites it at provisioning time. | ||
| echo "Enabling discard_unpacked_layers for VHD build" | ||
| mkdir -p /etc/containerd | ||
| containerd config default > /etc/containerd/config.toml | ||
| sed -i 's/discard_unpacked_layers = false/discard_unpacked_layers = true/' /etc/containerd/config.toml | ||
| # If discard_unpacked_layers was not in the default config, add it under the correct section | ||
| if ! grep -q "discard_unpacked_layers = true" /etc/containerd/config.toml; then | ||
| sed -i '/\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\][[:space:]]*$/a\ discard_unpacked_layers = true' /etc/containerd/config.toml | ||
| fi | ||
| systemctl restart containerd | ||
| for i in $(seq 1 30); do | ||
| if ctr version >/dev/null 2>&1; then | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done |
There was a problem hiding this comment.
This change lacks test coverage. The VHD build process now enables discard_unpacked_layers, which fundamentally changes how containerd handles layer data during VHD build. This should have:
- E2E tests to verify VHD images work correctly after this change
- Tests to ensure the CoreDNS binary extraction still works (since it depends on image mount operations)
- Tests to verify the containerd config cleanup doesn't break provisioning
Consider adding test cases in the e2e test suite to validate these behaviors.
| fi | ||
| sleep 1 | ||
| done | ||
|
|
There was a problem hiding this comment.
The containerd restart wait loop lacks error handling for the timeout case. After 30 seconds (lines 539-544), if ctr version never succeeds, the script continues anyway. This could lead to subsequent operations failing in confusing ways. Add explicit error handling after the loop:
if ! ctr version >/dev/null 2>&1; then
echo "ERROR: containerd failed to become ready after 30 seconds"
exit $ERR_CONTAINERD_INSTALL_TIMEOUT
fi| if ! ctr version >/dev/null 2>&1; then | |
| echo "ERROR: containerd failed to become ready after 30 seconds" | |
| exit $ERR_CONTAINERD_INSTALL_TIMEOUT | |
| fi |
| # Remove the temporary VHD-build-time containerd config so it doesn't persist in the VHD. | ||
| # CSE generates the production config at provisioning time. | ||
| if [ -f /etc/containerd/config.toml ]; then | ||
| rm -f /etc/containerd/config.toml | ||
| systemctl restart containerd || true | ||
| fi |
There was a problem hiding this comment.
The comment states "CSE generates the production config at provisioning time" but the config removal (lines 1012-1015) happens unconditionally. This means if any future code in this script tries to use containerd after line 1015, it will be running with no config file or potentially stale config. Consider:
- Verifying no code after this point depends on containerd operations
- Adding a comment warning that no containerd operations should happen after this point
- Or keeping the config and letting CSE overwrite it (as the comment at line 529 claims happens anyway)
| sed -i 's/discard_unpacked_layers = false/discard_unpacked_layers = true/' /etc/containerd/config.toml | ||
| # If discard_unpacked_layers was not in the default config, add it under the correct section | ||
| if ! grep -q "discard_unpacked_layers = true" /etc/containerd/config.toml; then | ||
| sed -i '/\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\][[:space:]]*$/a\ discard_unpacked_layers = true' /etc/containerd/config.toml |
There was a problem hiding this comment.
The sed replacement on line 533 only works if containerd's default config contains the exact text "discard_unpacked_layers = false". If the default config format changes in future containerd versions (e.g., uses different spacing, "= false" becomes "=false", or the setting is commented out), this replacement will silently fail, and the fallback insertion at line 536 will be needed. However, if both fail, the script continues without the intended optimization and without any error indication.
Consider making the logic more robust:
- Always check after both sed commands whether the setting is present
- Exit with an error if neither method succeeded
- Log which method succeeded for debugging purposes
| sed -i 's/discard_unpacked_layers = false/discard_unpacked_layers = true/' /etc/containerd/config.toml | |
| # If discard_unpacked_layers was not in the default config, add it under the correct section | |
| if ! grep -q "discard_unpacked_layers = true" /etc/containerd/config.toml; then | |
| sed -i '/\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\][[:space:]]*$/a\ discard_unpacked_layers = true' /etc/containerd/config.toml | |
| # First, try to flip the default value from false to true if present. | |
| sed -i 's/discard_unpacked_layers = false/discard_unpacked_layers = true/' /etc/containerd/config.toml | |
| # Check if the setting is now present and enabled (allowing for flexible whitespace). | |
| if grep -q 'discard_unpacked_layers[[:space:]]*=[[:space:]]*true' /etc/containerd/config.toml; then | |
| echo "discard_unpacked_layers enabled via replacement in default containerd config" | |
| else | |
| echo "discard_unpacked_layers not found after replacement; attempting insertion into containerd config" | |
| # If discard_unpacked_layers was not in the default config, add it under the correct section. | |
| sed -i '/\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\][[:space:]]*$/a\ discard_unpacked_layers = true' /etc/containerd/config.toml | |
| # Verify that the insertion succeeded. | |
| if grep -q 'discard_unpacked_layers[[:space:]]*=[[:space:]]*true' /etc/containerd/config.toml; then | |
| echo "discard_unpacked_layers enabled via insertion into containerd config" | |
| else | |
| echo "Error: failed to enable discard_unpacked_layers in /etc/containerd/config.toml" >&2 | |
| exit 1 | |
| fi |
| systemctl restart containerd || true | ||
| fi | ||
|
|
||
| echo "containerd directory space: $(du -sh /var/lib/containerd)" >> ${VHD_LOGS_FILEPATH} |
There was a problem hiding this comment.
The PR description states this change "saves 1GB" but doesn't specify what the baseline was. The log at line 1017 will help track this metric going forward, but consider also logging the space usage BEFORE enabling discard_unpacked_layers to better quantify the actual savings in VHD build logs. This would help validate whether the complexity and risk of this change is worth the space savings, especially given the critical bug with CoreDNS binary extraction.
| # Enable discard_unpacked_layers during VHD build to reduce VHD size. | ||
| # Containerd will discard unpacked layer data after it's been committed to a snapshot. | ||
| # This config is only used during VHD build; CSE overwrites it at provisioning time. | ||
| echo "Enabling discard_unpacked_layers for VHD build" | ||
| mkdir -p /etc/containerd | ||
| containerd config default > /etc/containerd/config.toml | ||
| sed -i 's/discard_unpacked_layers = false/discard_unpacked_layers = true/' /etc/containerd/config.toml | ||
| # If discard_unpacked_layers was not in the default config, add it under the correct section | ||
| if ! grep -q "discard_unpacked_layers = true" /etc/containerd/config.toml; then | ||
| sed -i '/\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\][[:space:]]*$/a\ discard_unpacked_layers = true' /etc/containerd/config.toml | ||
| fi | ||
| systemctl restart containerd | ||
| for i in $(seq 1 30); do | ||
| if ctr version >/dev/null 2>&1; then | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done | ||
|
|
There was a problem hiding this comment.
Critical: Setting discard_unpacked_layers = true before pulling images will cause the extractAndCacheCoreDnsBinary function to fail. The function at line 976 uses ctr -n k8s.io images mount to extract the CoreDNS binary, but this operation requires the unpacked layer data that containerd will have discarded. When discard_unpacked_layers is enabled, containerd removes unpacked layer data after committing to snapshots, making it impossible to mount images for filesystem access later.
This configuration should be enabled AFTER extractAndCacheCoreDnsBinary completes (after line 1008), not before image pulls. Move this entire block to after line 1008 to avoid breaking the CoreDNS binary extraction.
| # Enable discard_unpacked_layers during VHD build to reduce VHD size. | |
| # Containerd will discard unpacked layer data after it's been committed to a snapshot. | |
| # This config is only used during VHD build; CSE overwrites it at provisioning time. | |
| echo "Enabling discard_unpacked_layers for VHD build" | |
| mkdir -p /etc/containerd | |
| containerd config default > /etc/containerd/config.toml | |
| sed -i 's/discard_unpacked_layers = false/discard_unpacked_layers = true/' /etc/containerd/config.toml | |
| # If discard_unpacked_layers was not in the default config, add it under the correct section | |
| if ! grep -q "discard_unpacked_layers = true" /etc/containerd/config.toml; then | |
| sed -i '/\[plugins\."io\.containerd\.grpc\.v1\.cri"\.containerd\][[:space:]]*$/a\ discard_unpacked_layers = true' /etc/containerd/config.toml | |
| fi | |
| systemctl restart containerd | |
| for i in $(seq 1 30); do | |
| if ctr version >/dev/null 2>&1; then | |
| break | |
| fi | |
| sleep 1 | |
| done |
This PR sets
discard_unpacked_layers = truein containerd before pulling images, which removes all the compressed filesi checked the VM and it doesnt save much all it saves is 1GB
I checked that we dont store
.tarfiles anymore using content lsWhat this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #