Skip to content

Changes the value of config vm_memballoon_stats_period to 60#8520

Draft
RodrigoDLopez wants to merge 2 commits intoapache:4.22from
RodrigoDLopez:restore_vm_memballoon_stats_period_config_default_value
Draft

Changes the value of config vm_memballoon_stats_period to 60#8520
RodrigoDLopez wants to merge 2 commits intoapache:4.22from
RodrigoDLopez:restore_vm_memballoon_stats_period_config_default_value

Conversation

@RodrigoDLopez
Copy link
Contributor

@RodrigoDLopez RodrigoDLopez commented Jan 16, 2024

Description

Currently, the default value for the vm_memballoon_stats_period configuration is set to 0. As a result, out of the box in ACS, memory metrics for Windows instances (with virtio drivers correctly installed and configured) is compromised when using KVM hypervisor. This pull request aims to propose setting the vm_memballoon_stats_period configuration value to 60. This correction addresses situations such as the one reported through the issue: #8453.

fixes: #8453

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

  • Before changes:
    image

  • After changes:
    image

How Has This Been Tested?

A Windows VM was created, and it was confirmed that the VirtIO drivers were correctly installed. However, upon inspecting the metrics collected by ACS, it was observed that they did not align with the actual consumption of the VM. Subsequently, after changing the configuration value from 0 to 60, the metrics collected by ACS matched the consumption values reported by the VM.

@codecov
Copy link

codecov bot commented Jan 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c3b77cb) 18.28% compared to head (fa25d98) 30.80%.
Report is 5 commits behind head on main.

Additional details and impacted files
@@              Coverage Diff              @@
##               main    #8520       +/-   ##
=============================================
+ Coverage     18.28%   30.80%   +12.52%     
- Complexity    16826    33998    +17172     
=============================================
  Files          4848     5341      +493     
  Lines        324301   375027    +50726     
  Branches      45561    54554     +8993     
=============================================
+ Hits          59290   115525    +56235     
+ Misses       256437   244211    -12226     
- Partials       8574    15291     +6717     
Flag Coverage Δ
simulator-marvin-tests 24.68% <0.00%> (+6.39%) ⬆️
uitests 4.39% <ø> (?)
unit-tests 16.50% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DaanHoogland
Copy link
Contributor

@andrijapanicsb @alexandremattioli any ideas on this?

@weizhouapache
Copy link
Member

this value can be set manually, or by ansible/chef/puppet.
I prefer to leave users to decide.

@andrijapanicsb
Copy link
Contributor

many of the global settings are set to a certain value, many of them to 0 - and can be easily changed (like most of the other global settings). I prefer to keep it as it is, and allow operator to decide what to change.

@RodrigoDLopez
Copy link
Contributor Author

The value of this configuration was originally proposed with 60 to avoid misunderstandings as the one that are happening with operators. However, this value has been modified, and there is no information regarding this change communicated clearly for operators. As a result, numerous users may be experiencing the same issue reported in #8453. It might be a good idea to follow the steps taken with this other configurations 6c9e8a0, where the value of vm.stats.max.retention.time configuration was changed from 0 to 720 for the sake of simplicity for users/operators. Therefore, I suggest doing the same here, and if operators wish, they can modify the value to 0.

BTW: using the value 0 makes the monitoring stop working in KVM.

@weizhouapache
Copy link
Member

The value of this configuration was originally proposed with 60 to avoid misunderstandings as the one that are happening with operators. However, this value has been modified, and there is no information regarding this change communicated clearly for operators. As a result, numerous users may be experiencing the same issue reported in #8453. It might be a good idea to follow the steps taken with this other configurations 6c9e8a0, where the value of vm.stats.max.retention.time configuration was changed from 0 to 720 for the sake of simplicity for users/operators. Therefore, I suggest doing the same here, and if operators wish, they can modify the value to 0.

BTW: using the value 0 makes the monitoring stop working in KVM.

we are free to change the default value in a PR, or before official release. Once the configuration is included in a official release, we should be careful to change the default value.@RodrigoDLopez

@weizhouapache
Copy link
Member

BTW: using the value 0 makes the monitoring stop working in KVM.

worth to mention that I have faced an issue caused by it in the past . #8453 (comment)

although @shaerul said it worked well in his testing, I think it would be better to enable/disable memory ballooning and statistics at vm level.

@andrijapanicsb
Copy link
Contributor

(For what is worth it - I've been told by a RH engineer (in 2019) that memory auto-ballooning is an abandoned project - so be careful with the feature)

@RodrigoDLopez RodrigoDLopez changed the title Restores value of config vm_memballoon_stats_period back to 60 Changes the value of config vm_memballoon_stats_period to 60 Jan 18, 2024
@RodrigoDLopez
Copy link
Contributor Author

I believe that metrics monitoring should work without operators having to worry about configurations and enabling/disabling the feature. Changing the configuration value from '60' to '0' compromised the functionality for the KVM hypervisor; which in turn, can make it seem as it does not work out of the box. An example of this is operators lacking visibility regarding the necessity of adding this configuration in the agent.properties file (see: #8453 (comment)).

From my perspective, using the value '60' resolves issues similar to those faced by @shaerul in #8453. What are the main impacts you foresee with this change?


Regarding the issue encountered during the migration of a Windows instance with the memory balloon enabled (#8453 (comment)), could you explain the error scenario in detail? This would allow me to attempt to reproduce locally and, if necessary, extend this PR to address the issue related to the migration of Windows VMs with the memory balloon feature enabled.

@weizhouapache
Copy link
Member

I believe that metrics monitoring should work without operators having to worry about configurations and enabling/disabling the feature. Changing the configuration value from '60' to '0' compromised the functionality for the KVM hypervisor; which in turn, can make it seem as it does not work out of the box. An example of this is operators lacking visibility regarding the necessity of adding this configuration in the agent.properties file (see: #8453 (comment)).

In my opinion, the wrong memory usage is not a critical issue for operators.
I believe most cloud operators use automation tools (chef, ansible, puppet, etc). It is very easy to apply the changes on all kvm hosts.

From my perspective, using the value '60' resolves issues similar to those faced by @shaerul in #8453. What are the main impacts you foresee with this change?

As I said, I have faced an issue in the past. I am not sure if the issue was caused by windows virtio drivers or qemu/libvirt.
anyway, we do not need to make decision for users.
There is already a configuration for users to enable/disable it or set to a different value.

Regarding the issue encountered during the migration of a Windows instance with the memory balloon enabled (#8453 (comment)), could you explain the error scenario in detail? This would allow me to attempt to reproduce locally and, if necessary, extend this PR to address the issue related to the migration of Windows VMs with the memory balloon feature enabled.

I do not remember the details very clearly. the kvm host is Ubuntu 18 or 20
the issue was difficult to reproduce, it happened to some vms with virtio drivers, large memory and memory intensive applications.
but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

cc @RodrigoDLopez

Copy link
Contributor

@GutoVeronezi GutoVeronezi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrijapanicsb @weizhouapache, what are your concerns with this change? Without it, a nice feature in ACS, that enables users to monitor users' VMs via UI does not work out of the box for KVM deployments, which can cause disappointments and misunderstandings as the ones already noticed in the mailing list and issues in Github.

@weizhouapache
Copy link
Member

hat are your concerns with this change? Without it, a nice feature in ACS, that enables users to monitor users' VMs via UI does not work out of the box for KVM deployments, which can cause disappointments and misunderstandings as the ones already noticed in the mailing list and issues in Github.

can please see my comments above ? @GutoVeronezi

@GutoVeronezi
Copy link
Contributor

can please see my comments above ? @GutoVeronezi

@weizhouapache you said you have faced an issue in the past, but did not specify how to reproduce it or its details. On the other hand, you said that you are not sure if the issue was caused by the virtio drivers; therefore, we do not have a solid reason to not change it.

@weizhouapache
Copy link
Member

can please see my comments above ? @GutoVeronezi

@weizhouapache you said you have faced an issue in the past, but did not specify how to reproduce it or its details. On the other hand, you said that you are not sure if the issue was caused by the virtio drivers; therefore, we do not have a solid reason to not change it.

@GutoVeronezi
I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

@GutoVeronezi
Copy link
Contributor

@GutoVeronezi I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.

Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

@weizhouapache do you have steps to reproduce it?

@weizhouapache
Copy link
Member

@GutoVeronezi I have mentioned this but I am sure it was caused by memory stat, because the issue is gone after disabling memory stat on Windows vms.
Anyone who wants to enable the memory stats collection and happy to take the risky, go for it. ACS has already provided the option to do it.

@weizhouapache do you have steps to reproduce it?

@GutoVeronezi
you know some issues are difficult to reproduce.
the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc
I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

@GutoVeronezi
Copy link
Contributor

@GutoVeronezi you know some issues are difficult to reproduce. the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

But then we do not have a solid reason to not change the value, just speculations. If at least we could reproduce the error you saying that exist, we could look for the real cause and solve the problem (or at least find an explanation).

@weizhouapache
Copy link
Member

@GutoVeronezi you know some issues are difficult to reproduce. the issue could be related to the template (Windows version, virtio drivers, etc), and the host (Linux OS version, qemu/libvirt version), etc I did not face any issue with Linux VMs caused by it. so I made some changes to enable it on Linux VMs, but disable it for Windows VMs.

But then we do not have a solid reason to not change the value, just speculations. If at least we could reproduce the error you saying that exist, we could look for the real cause and solve the problem (or at least find an explanation).

@GutoVeronezi
If you are running a production, you might know, when we face an issue (Windows VM got frozen after live migration in my case), the first thing we need to do, is to find the workaround which fix the issue and then avoid it (in my case, the workaround is disabling memory stats).
root cause? in my case, we need to dive into the source code of QEMU and virtio driver. the root cause is not important to me, as I can accept the problem that Windows VM has wrong memory usage (100%) which is a minor issue, espscially comparing with the issue that Windows VM got frozen.

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

@GutoVeronezi
Copy link
Contributor

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

But again, it is a speculation. We have solid use cases of operators having to contact the community to understand why their Windows VMs are with wrong stats; and every time, the solution is to change the agent.properties. However, we do not have use cases that corroborate with the situation you said you faced a long time ago. Indeed, we do have use cases that shows that the aforementioned situation does not happen, vide #8453 (comment). Therefore, we do not have a solid use case/reason to not change the value.

Please, bring solid use cases so we can discuss it further.

@weizhouapache
Copy link
Member

changing vm_memballoon_stats_period to 60, will add a new setting in the vm definition, which will probably cause some regression which is out of our control. We should disable it by default.

But again, it is a speculation. We have solid use cases of operators having to contact the community to understand why their Windows VMs are with wrong stats; and every time, the solution is to change the agent.properties. However, we do not have use cases that corroborate with the situation you said you faced a long time ago. Indeed, we do have use cases that shows that the aforementioned situation does not happen, vide #8453 (comment). Therefore, we do not have a solid use case/reason to not change the value.

Please, bring solid use cases so we can discuss it further.

interesting
Your use case is solid
my issue is not solid

@sureshanaparti sureshanaparti added this to the 4.21.0 milestone Jun 5, 2025
@weizhouapache weizhouapache removed this from the 4.21.0 milestone Aug 29, 2025
@DaanHoogland
Copy link
Contributor

@weizhouapache @GutoVeronezi , is this still worth discussing?

@DaanHoogland DaanHoogland marked this pull request as draft October 10, 2025 08:19
@leonyonz
Copy link

leonyonz commented Oct 22, 2025

Is workaround that @weizhouapache mention already implement on newest version ? im facing same issues on CloudStack 4.19.1.1 and fix it with the workaround provide.

I have not experienced any VM frozen problems during live migration after implementing the workaround.

@weizhouapache
Copy link
Member

Is workaround that @weizhouapache mention already implement on newest version ? im facing same issues on CloudStack 4.19.1.1 and fix it with the workaround provide.

I have not experienced any VM frozen problems during live migration after implementing the workaround.

@leonyonz
if there is no recent issue with Windows vm migraton when memory ballooning is enabled, the issue might have been solved in qemu.

feel free to continue @RodrigoDLopez @GutoVeronezi
I have no objections

@daviftorres
Copy link
Contributor

Dear @RodrigoDLopez,

Thank you for taking the initiative to define a useful default value for the vm.memballoon.stats.period variable.

After identifying the configuration tweaks that resolved our previous issues, I documented them using Ansible to ensure parameters are consistently set as needed. For example:

    - name: Agent configuration customization - Stats period.
      ansible.builtin.lineinfile:
        path: /etc/cloudstack/agent/agent.properties
        regexp: '^vm.memballoon.stats.period'
        line: 'vm.memballoon.stats.period=5'
        state: present
      ignore_errors: true

I’d also like to suggest a few additional defaults that could be beneficial to consider:

    - name: Agent configuration customization - CPU mode.
      ansible.builtin.lineinfile:
        path: /etc/cloudstack/agent/agent.properties
        regexp: '^guest.cpu.mode'
        line: 'guest.cpu.mode=host-passthrough'
        state: present
      ignore_errors: true

    - name: Agent configuration customization - CPU features.
      ansible.builtin.lineinfile:
        path: /etc/cloudstack/agent/agent.properties
        regexp: '^guest.cpu.features'
        line: 'guest.cpu.features=vmx aes smx mmx ht'
        state: present
      ignore_errors: true

    - name: Agent configuration customization - Video driver.
      ansible.builtin.lineinfile:
        path: /etc/cloudstack/agent/agent.properties
        regexp: '^vm.video.hardware'
        line: 'vm.video.hardware=qxl'
        state: present
      ignore_errors: true

    - name: Agent configuration customization - Console proxy.
      ansible.builtin.lineinfile:
        path: /etc/cloudstack/agent/agent.properties
        regexp: '^consoleproxy.httpListenPort'
        line: 'consoleproxy.httpListenPort=443'
        state: present
      ignore_errors: true

    - name: Configuring EUFI booting.
      ansible.builtin.copy:
        dest: /etc/cloudstack/agent/uefi.properties
        content: |
          guest.nvram.template.secure=/usr/share/OVMF/OVMF_VARS_4M.ms.fd
          guest.nvram.template.legacy=/usr/share/OVMF/OVMF_VARS_4M.fd
          guest.nvram.path=/var/lib/libvirt/qemu/nvram/
          guest.loader.secure=/usr/share/OVMF/OVMF_CODE_4M.secboot.fd
          guest.loader.legacy=/usr/share/OVMF/OVMF_CODE_4M.fd

Lastly, it might also be worth setting the log level to DEBUG by default, since detailed logs are always required for support and troubleshooting:

    - name: Agent configuration customization - Debug level logging.
      ansible.builtin.replace:
        path: /etc/cloudstack/agent/log4j-cloud.xml
        regexp: 'INFO'
        replace: 'DEBUG'
        backup: yes
      ignore_errors: true

These are just suggestions, shared to contribute to the collective knowledge of developers and practitioners.

Best regards,

@rajujith
Copy link

@RodrigoDLopez Since this is for the 4.22.1 release, could you retarget the PR to the 4.22 branch?

@DaanHoogland DaanHoogland changed the base branch from main to 4.22 January 13, 2026 08:22
@DaanHoogland DaanHoogland requested a review from Copilot January 27, 2026 10:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request changes the default value of the vm_memballoon_stats_period configuration from 0 to 60 seconds to fix memory metrics reporting for Windows instances on KVM hypervisors. The issue caused CloudStack to report 100% memory utilization for Windows VMs even when actual usage was significantly lower.

Changes:

  • Updated default value from 0 to 60 seconds in AgentProperties.java (code and JavaDoc)
  • Updated the commented example value in agent.properties configuration file

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java Changed the default value for VM_MEMBALLOON_STATS_PERIOD property from 0 to 60 and updated the JavaDoc documentation
agent/conf/agent.properties Updated the commented example configuration value from 0 to 60

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress