-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-14446. Clarify Datanode disk space related commands and metrics #9646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /** | ||
| * @return raw filesystem capacity (cached) for the configured volume path. | ||
| */ | ||
| public long getFsCapacity() { | ||
| return source.getCapacity(); | ||
| } | ||
|
|
||
| /** | ||
| * @return raw filesystem available space (cached) for the configured volume path. | ||
| */ | ||
| public long getFsAvailable() { | ||
| return source.getAvailable(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for the patch.
Instead of adding these new methods in VolumeUsage, I suggest getting the same values via realUsage() in a single call, and passing that to getCurrentUsage().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch, I updated the PR.
| /** | ||
| * Simple class for grouping disk space related values. | ||
| */ | ||
| public static final class FsUsageTotals { | ||
| private final long capacity; | ||
| private final long available; | ||
| private final long used; | ||
|
|
||
| private FsUsageTotals(long capacity, long available, long used) { | ||
| this.capacity = capacity; | ||
| this.available = available; | ||
| this.used = used; | ||
| } | ||
|
|
||
| public long getCapacity() { | ||
| return capacity; | ||
| } | ||
|
|
||
| public long getAvailable() { | ||
| return available; | ||
| } | ||
|
|
||
| public long getUsed() { | ||
| return used; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this earlier: can we use SpaceUsageSource.Fixed instead?
ozone/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/fs/SpaceUsageSource.java
Lines 48 to 77 in 65f0af0
| /** | |
| * A static source of space usage. Can be a point in time snapshot of a | |
| * real volume usage, or can be used for testing. | |
| */ | |
| final class Fixed implements SpaceUsageSource { | |
| private final long capacity; | |
| private final long available; | |
| private final long used; | |
| public Fixed(long capacity, long available, long used) { | |
| this.capacity = capacity; | |
| this.available = Math.max(Math.min(available, capacity - used), 0); | |
| this.used = used; | |
| } | |
| @Override | |
| public long getCapacity() { | |
| return capacity; | |
| } | |
| @Override | |
| public long getAvailable() { | |
| return available; | |
| } | |
| @Override | |
| public long getUsedSpace() { | |
| return used; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Can you please check acceptance test failure? Happened for both:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the failures were related. Since the json output for usageinfo changed, the related robot test was failing. I pushed a fix to datanode.robot. CI is passing now - https://github.com/siddhantsangwan/ozone/actions/runs/21211360808.
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for updating the patch, LGTM. Let's wait for others to take a look, too.
sodonnel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I am not too familiar with this area.
priyeshkaratha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for working on this. Overall code LGTM.
|
|
||
| private static final MetricsInfo CAPACITY = | ||
| Interns.info("Capacity", "Capacity"); | ||
| Interns.info("OzoneCapacity", "Ozone usable capacity (after reserved space adjustment)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the new naming conventions make more sense, this is a breaking change for existing metrics consumers. They will need to update their usage (for example, from Capacity to OzoneCapacity).
How are such breaking changes tracked and communicated? Is there any documentation or process to notify consumers that they need to make updates? For example, if system tests depend on these metrics, how do we ensure these changes are identified and handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, we have annotations in the code to mark the stability and intended audience of an interface. See InterfaceAudience.java and InterfaceStability.java. For example, SCMNodeMetrics is marked as @InterfaceAudience.Private which means it's Unstable as per the documentation in:
Classes that are {@link InterfaceAudience.Private} are to be considered unstable
/**
* No guarantee is provided as to reliability or stability across any
* level of release granularity.
*/
@Documented
@Retention(RetentionPolicy.RUNTIME)
public @interface Unstable {
}
I considered this when making this change. The options were:
- To not change the existing names at all and only introduce new terms Filesystem Capacity etc. This wouldn't have solved the confusion.
- To keep both - the existing terms capacity, available and scmUsed and also introduce ozoneCapacity, ozoneAvailable, ozoneUsed. This would have caused more confusion.
- Change the existing names and introduce fs stats, which is what I did in this PR.
Moreover, I think this change is required for correctness - see https://issues.apache.org/jira/browse/HDDS-14388. Currently even ozone devs find it hard to understand these stats and have to spend a non trivial amount of time to figure them out. Overall, I think we should try to make them clear across the ozone code.
devabhishekpal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @siddhantsangwan for this improvement.
I just had one nit, but overall this patch LGTM, +1
| @@ -0,0 +1,36 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, can we move this to overview.types or datanode.types instead of an individual file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be taken up in https://issues.apache.org/jira/browse/HDDS-14483?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devabhishekpal let's handle this in HDDS-14483 since that involves Recon overview pages changes.
|
Thanks for the reviews everyone. Merging this now that CI is green. |
What changes were proposed in this pull request?
Please see the jira for the problem description.
Changes proposed:
capacity = fsCapacity - du.reservedThis is what the new proto message looks like:
Note
fsUsedis not reported in the heartbeat as it can be calculated where needed usingfsCapacity - fsAvailable.capacity, now we clearly differentiate by usingOzone Capacityto meanFilesystem Capacity - du.reservedetc.What's remaining (to be done in subsequent pull requests):
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14446
How was this patch tested?
Edited and added various unit tests.
The Recon overview page is still the same as before:
usageinfocommand shows the new names and the new fs stats:SCM JMX metrics:
Datanode Web UI main page (ozone-datanode-1):
Note that Total Capacity shown above should be the same as Filesystem Capacity. I've retained Total Capacity just because it was already there. We could remove it if needed.
Datanode JMX metrics (ozone-datanode-1):
Recon Datanode page
PR is draft while CI runs in my fork.