Skip to content

HIVE-27126: queue level resource stats for YARN RM.#6501

Open
architjainjain wants to merge 1 commit into
apache:masterfrom
architjainjain:HIVE-27126-yarn-RM
Open

HIVE-27126: queue level resource stats for YARN RM.#6501
architjainjain wants to merge 1 commit into
apache:masterfrom
architjainjain:HIVE-27126-yarn-RM

Conversation

@architjainjain

@architjainjain architjainjain commented May 20, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Display Yarn resources availability in real time along with Tez job progress

Why are the changes needed?

Right we don't know if the queue used in Hive query execution is having enough resources or not. We are already displaying tez job details like number of task per each vertex and how are they progressing.

If the resources available are not good enough to execute the query in parallel or query that use to take shorter time is taking time, then we can use this new detail to understand that queue is busy and resources are not enough

Does this PR introduce any user-facing change?

yes, it shows queue level metrics.

How was this patch tested?

the patch was tested by creating yarn cluster and running queries, with different queues.
image

after disabling the queue metrics:
set hive.tez.queue.metrics.refresh.interval=0s;
image

@sonarqubecloud

Copy link
Copy Markdown

@abstractdog abstractdog left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @architjainjain so far, dropped some comments, but I wasn't able to fully read it through, I'll get back after, in the meantime I can do some testing too hopefully

* behaviour added as part of HIVE-27126.
*
* We capture stdout via a ByteArrayOutputStream and inspect the rendered output.
*/public class TestInPlaceUpdate {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line break before public

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

new String(new char[InPlaceUpdate.MIN_TERMINAL_WIDTH]).replace("\0", "-");

/** Counts non-overlapping occurrences of {@code needle} in {@code haystack}. */
private static int countOccurrences(String haystack, String needle) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about reusing commons-lang3 StringUtils.countMatches

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +32 to +33
* Unit tests for InPlaceUpdate — focusing on the queue-metrics separator
* behaviour added as part of HIVE-27126.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestInPlaceUpdate sounds like a generic unit test class for InPlaceUpdate, so this comment is supposed to get outdated over time

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/** Minimal ProgressMonitor stub — returns empty headers/rows/footer. */
private static ProgressMonitor makeMonitor(String queueMetrics) {
return new ProgressMonitor() {
@Override public List<String> headers() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public in new line, pls take care of the same below too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +178 to +182





Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra lines

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +318 to +324
for (int i = 0; i < 5; i++) {
collectors[i] = new YarnQueueMetricsCollector(
mockYarnClient, "default", refreshIntervalMs, "jitter-test-query-" + i);
assertNotNull("Collector " + i + " should be created successfully", collectors[i]);
}
// If we get here, all 5 collectors were created with their own jittered delays
// without conflict or exception - thundering herd fix is in place

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering mock yarn clients, creating concurrent collectors for jitter testing doesn't seem that valuable to me...I don't think they will ever throw an exception

mockYarnClient, "default", 30, "circuit-breaker-recovery-query");

try {
// Wait for recovery - snapshot should eventually be populated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need extra assertion about the failure first to make 100% sure we actually hit circuit breaker logic

Comment on lines +428 to +429
public void testCircuitBreakerDoesNotAffectSuccessfulCollection() throws Exception {
// Normal operation - no failures, circuit breaker should never activate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a simple happy testing, I guess it's covered by testSuccessfulMetricsCollection

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog Thanks for catching this! I've added JavaDoc to clarify that this test is part of the circuit breaker test suite. While it looks similar to testSuccessfulMetricsCollection(), it specifically validates that the circuit breaker doesn't interfere with normal operations. The circuit breaker suite needs three scenarios: failure, recovery, and happy path. This test covers the happy path behavior for circuit breaker specifically, not general metrics collection. I've added cross-references to the related tests for clarity.

Comment on lines +485 to +493
when(mockQueueStats.getAllocatedMemoryMB()).thenReturn(1024L);
when(mockQueueStats.getAvailableMemoryMB()).thenReturn(1024L);
when(mockQueueStats.getAllocatedVCores()).thenReturn(4L);
when(mockQueueStats.getAvailableVCores()).thenReturn(4L);
when(mockQueueStats.getNumAppsRunning()).thenReturn(1L);
when(mockQueueStats.getPendingContainers()).thenReturn(0L);
when(mockQueueInfo.getQueueStatistics()).thenReturn(mockQueueStats);
when(mockQueueInfo.getCapacity()).thenReturn(0.5f);
when(mockYarnClient.getQueueInfo("default")).thenReturn(mockQueueInfo);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like defining all these variables adds a lot of boilerplate to the unit tests, whereas in this case we're only interested in collection timestamp, could a separate method be created just for adding happy value, and it can be overridden per test, like:

  when(mockQueueStats.getAllocatedMemoryMB()).thenReturn(1024L);
  when(mockQueueStats.getAvailableMemoryMB()).thenReturn(1024L);
  when(mockQueueStats.getAllocatedVCores()).thenReturn(4L);
  when(mockQueueStats.getAvailableVCores()).thenReturn(4L);
  when(mockQueueStats.getNumAppsRunning()).thenReturn(1L);
  when(mockQueueStats.getPendingContainers()).thenReturn(0L);
  when(mockQueueInfo.getQueueStatistics()).thenReturn(mockQueueStats);
  when(mockQueueInfo.getCapacity()).thenReturn(0.5f);
  when(mockYarnClient.getQueueInfo("default")).thenReturn(mockQueueInfo);

method javadoc can also tell that these are only default happy values for the unit tests

Comment on lines +824 to +830
// HIVE-27126: Thrift regeneration omitted setStartTimeIsSet(true) from the constructor.
// Explicitly call setStartTime() to set the isset flag required for Thrift validation.
tProgressUpdateResp.setStartTime(progressUpdate.startTimeMillis);
if (progressUpdate.queueMetrics() != null && !progressUpdate.queueMetrics().isEmpty()) {
tProgressUpdateResp.setQueueMetrics(progressUpdate.queueMetrics());
}
resp.setProgressUpdateResponse(tProgressUpdateResp);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change needed?

@architjainjain architjainjain Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works around a Thrift code generation bug. When we regenerated Thrift code after adding the queueMetrics field, the generated constructor for TProgressUpdateResp accepts startTimeMillis but fails to set the isset flag. Without this flag, Thrift serialization treats the field as unset, causing clients to receive incomplete progress updates. The explicit setStartTime() call ensures both the value AND the isset flag are properly set, maintaining backward compatibility with Thrift clients.

without this test the generated file is not having the starttime isset flag updated.

https://github.com/apache/hive/pull/6501/changes#diff-d71ddeafeea57e0fbf6a9dfda436eab06b69a39ac46c815cc1b15bfffe5f35e0L155

this.status = status;
this.footerSummary = footerSummary;
this.startTime = startTime;
setStartTimeIsSet(true);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog this is getting removed. so we added manually after the constructor call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants