-
Notifications
You must be signed in to change notification settings - Fork 2
Add Support for Migration Statistics API Call #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gardenlinux
Are you sure you want to change the base?
Add Support for Migration Statistics API Call #43
Conversation
e29356d to
4101b11
Compare
f49e577 to
fdf5858
Compare
f612cf6 to
e9a3321
Compare
e9a3321 to
ecb5f45
Compare
e6c80dd to
fe8cd0d
Compare
fe8cd0d to
a87fe95
Compare
a87fe95 to
6d68c0c
Compare
vm-migration/src/progress.rs
Outdated
| /// [live-migration protocol]: super::protocol | ||
| #[derive(Clone, Debug, serde::Serialize, serde::Deserialize)] | ||
| pub struct MigrationProgressAndStatus { | ||
| /// UNIX timestamp of the start of the live-migration process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please note that this structure will be public API for ever and once we reploy it, it will be hard to change
6d68c0c to
c934e0e
Compare
c934e0e to
78699eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this!
I wish the PR description, or some other form of documentation, explained the intended use cases for this feature.
Is the main use case debugging slow/hanging live migrations, or is this intended to be used to collect metrics for every live migration which will then be used to provide statistical insights? If it is the latter, it would be good to document (somewhere) the compatibility story with standard metrics services like Prometheus. I think @snue kowns a thing or two about such topics 🙂
The metrics crate seems to be the metrics analogue of log/tracing and it would be good to know why that isn't a good fit for the task(s) addressed by this PR. I am assuming that there is some context I am missing and that it is quite possible that I have some misconceptions about the problems you are trying to address.
Otherwise I would just like to say that the code looks good and it was easy to follow the implementation for the most part.
| pub memory_bytes_remaining_iteration: u64, | ||
| /// The amount of transmitted 4k pages. | ||
| pub memory_pages_4k_transmitted: u64, | ||
| /// The amount of remaining 4k pages for this iteration. | ||
| pub memory_pages_4k_remaining_iteration: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What about larger pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VMM + KVM transparently splits huge-pages for a migration. I'm however not sure if this also applies to read-only pages. We should discuss this again with Thomas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From offline discussion. KVM splits all pages to 4k for the scope of the migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message contains:
an
vm_migrationAPI call
but it should be "a vm_migration API call".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next time, please write the title of the commit message to the comment. I have no chance to find out which commit message this comment belongs to other than brute force :)
vmm/src/api/mod.rs
Outdated
| /// The progress of a possibly ongoing live migration. | ||
| VmMigrationProgress(Box<Option<MigrationProgressAndStatus>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the None variant will only ever be present if the API is called before a live migration has started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think about refactoring the whole thing to Result<MigrationProgress> and drop the Option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I think Option makes the most sense as is the most consistent API without suddenly changing semantics
| { | ||
| Ok(info) => { | ||
| let mut response = Response::new(Version::Http11, StatusCode::OK); | ||
| let info_serialized = serde_json::to_string(&info).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would be good to either replace unwrap with expect or to add a comment explaining why it is OK to unwrap in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with it but it is done like this everywhere in ch-remote. I'd much rather change this in upstream CHV
a1b5313 to
708695d
Compare
708695d to
40e58e3
Compare
I've updated the code.
I don't see how this would help. These statistics are not "nice helpers that we need sometimes to debug". They are a fundamental and important functionality we want to export via API. I hope the new PR description makes the intent here clearer. |
ceb915e to
5b63ea5
Compare
amphi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few small comments and nits.
| // Give management software a chance to fetch the migration state. | ||
| // The VMM already executes on the other side and keeping Cloud Hypervisor running for a | ||
| // couple of more seconds is fine. | ||
| info!("Sleeping five seconds before shutting off."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this waiting, and maybe also the mark_as_finished() a few lines above, should happen in check_migration_result. Otherwise the management software would see the mark_as_failed very late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree about the sleep. Not sure about the other. Other thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update. all of this need to happen in the end of the migration thread, not in check_migration_result. Otherwise, the API thread is blocked and API users can never query the latest state.
a83f554 to
7ee8334
Compare
7ee8334 to
f937d87
Compare
f937d87 to
679c6f1
Compare
On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is the first commit in a series of commits to introduce a new API endpoint in Cloud Hypervisor to report progress and live-insights about an ongoing live migration. Having live and frequently refreshing statistics/metrics about an ongoing live migration is especially interesting for debugging and monitoring. For the first time, we will be able to see how live-migrations behave and create benchmarking infrastructure around it. The ch driver in libvirt will use these information to populate its `virsh domjobinfo` information. We will add a new API endpoint to query information. Further, the endpoint will be interesting to query information about a previously failed or canceled live migration. Specifically interesting about this API endpoint is that it will be the first endpoint that needs the "asynchronization" of the API: more than one API request in parallel. This needs support at least in the HTTP API and the internal API. The "SendMigration" call is long-running and active even if someone is querying the new endpoint. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. In this commit, we add the HTTP endpoint to export ongoing VM live-migration progress. This work was made possible because of the following fundamental prerequisites: - internal API was made async - http thread was made async This way, one can send requests to fetch the latest state without blocking in any code path of the API. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. This commit prepares the avoidance of naming clashes in the following. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. This commit actually brings all the functionality together. The first version has the limitation that we populate the latest snapshot once per memory iteration, although this is the most interesting part by far. In a follow-up, we can make this more fine-grained. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live-migration. See the first commit for an introduction. There isn't really an error that can happen when we query this endpoint. A previous snapshot may either be there or not. It also doesn't make sense here to check if the current VM is running, as users should always be able to query information about the past (failed or canceled) live migration. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This is part of the commit series to enable live updates about an ongoing live migration. See the first commit for an introduction. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
The logging is not very spammy nor costly (iterations take seconds to dozens of minutes) and is clearly a win for us to debug things. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
679c6f1 to
307475e
Compare
TL;DR
This extends the internal API with a
vm_progressfunction and adds avm.migration-progressHTTP endpoint including support inch-remoteviach-remote migration-processto query the latest migration progress.Motivation
Monitoring a live-migration with live updated information is very important for debugging, development, and monitoring. This is something that verbose logging can't achieve as there is a clear desire to have that structured information somewhere on the outside - and also to prevent spammy logs.
The most interesting part is the pre-copy phase where we get information on each new memory iteration. The first version is rather coarse-grained with one update per memory iteration. More to follow.
The ch driver in libvirt will use these information to populate its
virsh domjobinfoinformation.Further, the endpoint will be interesting to query information about a
previously failed or canceled live migration.
Prerequisites
The two major pre-requisites were:
Steps Before Merge
ch-remote