Base Commit: 7e655bc
Review Date: January 7, 2026
Branch: cloud66-deployment
- deleter.js - Enhanced error handling and cache cleanup
- src/middlewares/common.js - Added retry logic for 5xx errors and request timeout
- src/middlewares/static.js - Minor fixes
- TROUBLESHOOTING.md - Documentation updates
scripts/- Multiple monitoring and maintenance scriptssystemd/- Systemd service and timer filesdoc/- New documentation filesREADME_DISK_SPACE.md- Disk space management guide
Changes:
- Added retry logic for HTTP 5xx errors (502, 503, 504)
- Improved cache cleanup (removes files older than 7 days instead of clearing everything)
- Better error logging for HTTP status codes
Risk Assessment:
- ✅ Low Risk: Retry logic is additive, doesn't change existing behavior
- ✅ Low Risk: Cache cleanup is more conservative (7 days vs immediate)
⚠️ Medium Risk: New retry logic could increase API load if 5xx errors are frequent- ✅ Benefit: Better resilience to transient API failures
Testing Needed:
- Verify retry behavior doesn't cause excessive API calls
- Verify cache cleanup works correctly
Changes:
- Added retry logic for HTTP 5xx errors in API calls
- Added 30-second timeout for site metadata loading
- Improved error handling and logging
Risk Assessment:
- ✅ Low Risk: Timeout prevents hanging requests (critical fix)
- ✅ Low Risk: Retry logic is additive
⚠️ Medium Risk: 30-second timeout might be too short for slow API responses- ✅ Benefit: Prevents 504 Gateway Timeout errors from hanging requests
Testing Needed:
- Verify timeout works correctly
- Verify retries don't cause excessive API load
- Test with slow API responses
Changes:
- Fixed bug: Moved
res.writeHead()before writing data - Headers must be written before response data (HTTP protocol requirement)
Risk Assessment:
- ✅ Low Risk: Bug fix, ensures headers are sent correctly
- ✅ Benefit: Prevents potential HTTP protocol violations
- ✅ No Breaking Changes: Just fixes order of operations
New Files:
scripts/nginx-health-monitor-updated.sh- Enhanced health monitorscripts/nginx-auto-restart.sh- Auto-restart for stuck nginxscripts/disk-space-cleanup.sh- Disk space managementscripts/post-deploy-fix.sh- Post-deployment fixesscripts/verify-monitoring-setup.sh- Setup verificationsystemd/*.serviceandsystemd/*.timer- Systemd configuration
Risk Assessment:
- ✅ Low Risk: Scripts are operational, don't affect application code
- ✅ Benefit: Automated monitoring and recovery
⚠️ Note: Scripts need to be deployed to/opt/forge-scripts/on server
| Component | Risk Level | Impact | Mitigation |
|---|---|---|---|
| deleter.js | MEDIUM | API retry logic | Monitor API call frequency |
| common.js | MEDIUM | Request timeout | Test timeout behavior |
| static.js | LOW | Minor changes | Standard testing |
| Scripts | LOW | Operational only | Deploy separately |
Overall Risk: MEDIUM - Changes are mostly additive improvements with timeout protection
- Retry logic for 5xx errors (deleter.js, common.js)
- Request timeout protection (common.js)
- Improved cache cleanup (deleter.js)
- Enhanced error logging
- Test API retry behavior with simulated 502 errors
- Test request timeout with slow API responses
- Verify cache cleanup doesn't remove active cache files
- Test error handling paths
# Stage production code changes
git add deleter.js src/middlewares/common.js src/middlewares/static.js TROUBLESHOOTING.md .gitignore
# Commit with descriptive message
git commit -m "feat: Add 5xx error retry logic and request timeout protection
- Add retry logic for HTTP 5xx errors in deleter.js and common.js
- Add 30-second timeout for site metadata loading to prevent hanging requests
- Improve cache cleanup to remove files older than 7 days
- Enhanced error logging for better debugging
- Fixes: 504 Gateway Timeout errors from hanging requests"# Stage operational scripts and documentation
git add scripts/ systemd/ doc/ README_DISK_SPACE.md
# Commit separately
git commit -m "feat: Add monitoring and maintenance scripts
- Add nginx health monitor with stuck worker detection
- Add nginx auto-restart script for stuck workers
- Add disk space cleanup script and systemd timer
- Add post-deploy fix script
- Add monitoring setup verification script
- Add comprehensive documentation"- Push to
cloud66-deploymentbranch - Cloud66 will automatically deploy code changes
- Note: Scripts need manual deployment to
/opt/forge-scripts/
After Cloud66 deployment completes:
# Copy updated health monitor
sudo cp /app/scripts/nginx-health-monitor-updated.sh /opt/forge-scripts/nginx-health-monitor.sh
sudo chmod +x /opt/forge-scripts/nginx-health-monitor.sh
# Verify it has the container detection fix
grep "docker ps --format" /opt/forge-scripts/nginx-health-monitor.sh# Copy systemd files
sudo cp /app/systemd/nginx-health-monitor.* /etc/systemd/system/
sudo systemctl daemon-reload
# Enable and start timer
sudo systemctl enable nginx-health-monitor.timer
sudo systemctl start nginx-health-monitor.timer
# Verify
sudo systemctl status nginx-health-monitor.timer# Copy auto-restart script
sudo cp /app/scripts/nginx-auto-restart.sh /opt/forge-scripts/
sudo chmod +x /opt/forge-scripts/nginx-auto-restart.sh
# Add to crontab (runs every 5 minutes)
sudo crontab -e
# Add: */5 * * * * /opt/forge-scripts/nginx-auto-restart.sh >/dev/null 2>&1# Run verification script
sudo /app/scripts/verify-monitoring-setup.sh
# Or manually test
sudo /opt/forge-scripts/nginx-health-monitor.shIf issues occur after deployment:
# Revert to previous commit
git revert HEAD
git push origin cloud66-deployment
# Cloud66 will redeploy automatically- The changes are mostly independent
- Can disable problematic features via environment variables if needed
- API Call Frequency - Ensure retry logic doesn't cause excessive calls
- Request Timeouts - Monitor for timeout errors in logs
- Cache Cleanup - Verify old cache files are being removed
- Nginx Worker Status - Check for stuck workers
- Error Rates - Monitor 502/504 error rates
# Watch for timeout errors
docker logs <container> 2>&1 | grep -i "timeout"
# Watch for retry activity
docker logs <container> 2>&1 | grep "🔄 Retrying"
# Watch for cache cleanup
docker logs <container> 2>&1 | grep "🧹 Cleaned"Code Changes: ✅ Ready for deployment
Risk Level: MEDIUM (acceptable with monitoring)
Recommendation: Deploy with monitoring
Next Steps:
- Review and approve this document
- Commit changes
- Deploy via Cloud66
- Set up scripts on server
- Monitor for 24 hours