Completed functionality for `admin check` command #5348

kevinrr888 · 2025-02-21T15:49:44Z

New checks for admin check command

Implemented SYSTEM_CONFIG check:
- Checks ZooKeeper locks for Accumulo server processes
- Checks ZooKeeper table nodes
- Checks that the WAL metadata in ZooKeeper is valid
Added new SERVER_CONFIG check:
- Checks that all configured properties are valid
- Checks that required properties are present in the config
Added new tests in AdminCheckIT for SYSTEM_CONFIG and SERVER_CONFIG. Added failing test cases for all checks.
Updated description for admin check run command functionality #4892. This describes all the checks thus far for the new accumulo admin check command. See this for the complete functionality of the command.
Considered adding check to MetadataConstraints to ensure tablets reference files that actually exist (i.e., in MetadataConstraints.validateDataFileMetadata, check that stf.getPath() exists in HDFS). This would make sure the file exists before the mutation is ever written and would also check that it exists when we run the admin check command for metadata (root metadata, root table, and metadata table): see MetadataCheckRunner.checkColumns(). However, this led to a ComprehensiveIT test failure, so I assume the file may not always exist before the metadata for it is written...
Deleted CheckServerConfig (run via accumulo check-server-config) as the new accumulo admin check run SERVER_CONFIG will inherently do the same checks. Maybe we don't want to completely get rid of this check yet. Maybe it would be better to have the check just output that the check has moved under accumulo admin check?
There were a couple other existing check commands that I originally considered moving under this new admin command (CheckAccumuloProperties and CheckCompactionConfig), however, their usage is specifically to check a file irrespective of any running Accumulo instance, so I did not feel it would make sense to move these under the new admin command.

This PR along with #4957 closes #4892

- Implemented SYSTEM_CONFIG check: - Checks ZooKeeper locks for Accumulo server processes - Checks ZooKeeper table nodes - Checks that the WAL metadata in ZooKeeper is valid - Added new SERVER_CONFIG check: - Checks that all configured properties are valid - Checks that required properties are present in the config - Added new tests in `AdminCheckIT` for SYSTEM_CONFIG and SERVER_CONFIG. Added failing test cases for all checks. - Deleted CheckServerConfig (run via `accumulo check-server-config`) as the new `accumulo admin check run SERVER_CONFIG` will inherently do the same checks.

kevinrr888 · 2025-02-21T15:52:30Z

test/src/main/java/org/apache/accumulo/test/AdminCheckIT.java

 import java.util.TreeMap;
 import java.util.function.Supplier;
-import java.util.regex.Pattern;



General comment about AdminCheckIT: considered changing from ConfigurableMacBase to SharedMiniCluserBase for this test from suggestion of @dlmarion. Prior to these new changes, SharedMiniClusterBase was sufficient and much faster, but some of these tests now do damage to the MAC to test for failure cases of the checks (e.g., deleting essential files, messing with metadata), so stuck with ConfigurableMacBase

kevinrr888 · 2025-02-21T15:54:25Z

server/base/src/main/java/org/apache/accumulo/server/conf/CheckServerConfig.java

-import com.google.auto.service.AutoService;
-
-@AutoService(KeywordExecutable.class)
-public class CheckServerConfig implements KeywordExecutable {


I deleted CheckServerConfig since now this is checked through this SERVER_CONFIG check: the ServerContext object is constructed the same (see ServerUtilOpts) and we end up calling context.getConfiguration() in the SERVER_CONFIG check, so the same checks are done with a few additions. Maybe we don't want to completely get rid of this check yet. Maybe it would be better to have the check just output that the check has moved under accumulo admin check?

kevinrr888 · 2025-02-21T15:55:29Z

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/ServerConfigCheckRunner.java

+    // there are many properties that should be set (default value or user set), identifying them
+    // all and checking them here is unrealistic. Some property that is not set but is expected
+    // will likely result in some sort of failure eventually anyway. We will just check a few
+    // obvious required properties here.
+    Set<Property> requiredProps = Set.of(Property.INSTANCE_ZK_HOST, Property.INSTANCE_ZK_TIMEOUT,
+        Property.INSTANCE_SECRET, Property.INSTANCE_VOLUMES, Property.GENERAL_THREADPOOL_SIZE,
+        Property.GENERAL_DELEGATION_TOKEN_LIFETIME,
+        Property.GENERAL_DELEGATION_TOKEN_UPDATE_INTERVAL, Property.GENERAL_IDLE_PROCESS_INTERVAL,
+        Property.GENERAL_LOW_MEM_DETECTOR_INTERVAL, Property.GENERAL_LOW_MEM_DETECTOR_THRESHOLD,
+        Property.GENERAL_PROCESS_BIND_ADDRESS, Property.GENERAL_SERVER_LOCK_VERIFICATION_INTERVAL,
+        Property.MANAGER_CLIENTPORT, Property.TSERV_CLIENTPORT, Property.GC_CYCLE_START,
+        Property.GC_CYCLE_DELAY, Property.GC_PORT, Property.MONITOR_PORT, Property.TABLE_MAJC_RATIO,
+        Property.TABLE_SPLIT_THRESHOLD);


any other important ones I left out? Any I shouldn't have included?

Not a change for this PR, but wondering if this should be pushed to the validation code of each property. For example could we attempt to do something like the following in addition to the code that goes through the defined props above. Thinking if a props validation fails on null or empty string then its "required" and should be set. Looking at some of the important props, like instance.volumes their types would need change from something besides STRING that is more specific, which would be a good general change to make (would be good to have specific type to do validation for instance volumes and that could include not accepting empty string).

for(var prop : Property.values()) { var value = config.get(prop); if (!Property.isValidProperty(prop.getKey(), value)) { log.warn("Invalid property (key={} val={}) found in the config", prop, value); } }

If the rest of the code worked like this, then would not need this list here.

As the code is currently written it loops over all the props the loop in the prev comment may not work well because the get method replaces w/ the default value when not present.

In general it seems like it would be best to move the concept of a required property into the Property class in some form. Then the entire system could react appropriately when a required property is not present and is requested. For now a list in this class seems fine.

I experimented w/ validating the volume prop in #5365 based on the exploration done as part of this.

This sounds like a good idea to me. There are a lot of PropertyType.STRING when a String isn't really what the property is. PropertyType exists to check that the property is valid; setting the PropertyType to STRING is just a way to ignore this validation. I wonder if it would be best for a 1 to 1 mapping Property to PropertyType. This would probably be overkill though, another idea could be to analyze the PropertyTypes that are always valid. From briefly looking, PropertyType.PATH, PropertyType.STRING, PropertyType.URI are always valid. I don't think any properties should always be valid. Those that are PropertyType.STRING could probably be given a more appropriate existing PropertyType or a new one. PATH and URI could have validation.

In addition to this, can analyze all properties, determine if they are required or not, and change the validation:

Properties that are not required could accept empty string, null, or a valid value (where validity is well defined)

Properties that are required could only accept a valid value

Like you said, for this PR, can just push this list of required properties into Property. Maybe for now/in this PR this list of required props is only accessed/checked in this admin check. Might be a bit of a scope creep to start checking this list elsewhere in the code. Could do it in follow on.

Definitely want to avoid any scope creep in this PR. Identified some areas that need improvement based on this work, we can open follow on issues or PRs for those.

Could leave the list as is in the PR. For follow on issues, do we need two issues? One for addressing the STRING types and another for somehow representing and documenting required props in the Property.java?

I think either would be fine, but it might be easier as just one issue. There would be overlap in these changes so might be hard to split up/work on as two separate issues/PRs. For example, instance.volumes would need to be a required property (which would be tied to validation in it's PropertyType) and would need to be moved away from PropertyType.STRING

...er/base/src/main/java/org/apache/accumulo/server/util/checkCommand/RootTableCheckRunner.java

kevinrr888 · 2025-02-21T15:58:08Z

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java

+        log.trace("Checking if the WAL path {} found in the metadata exists in HDFS...",
+            parseRes.getSecond());
+        if (!context.getVolumeManager().exists(parseRes.getSecond())) {
+          log.warn("WAL metadata for tserver {} references a WAL that does not exist",
+              tserverInstanceAtWals);
+          status = Admin.CheckCommand.CheckStatus.FAILED;
+        }


Let me know if this part might is a bit much to be checking here

This check is good, but it has a race condition. File can be removed after by the gc after getting the list of wals. Could do the following to avoid all race conditions.

wals1 = set of wals from ZK in the open or closed state (gc will eventually remove wals in unreferenced state)

tservers1 = set of live tservers

missingSet = set of wals that are in walSet1 and not in hdfs

wals2 = set of wals from ZK in the open or closed state

tservers2 = set of live tservers

Any thing in missingSet that is also in wals2, tservers1 and tservers2 is really odd. The wal state implies its in use (before and after the dfs check) and the tservers were alive before and after the dfs checks (the GC will eventually remove inuse logs assigned to dead tserver). So this walog should exist in hdfs. Could use guavas Sets.intersection for these checks.

Did not see a test for this case.

Thanks, I pushed 8a0488c which also included a test case for this. Let me know your thoughts.

By "live tservers" I went with what tservers are present in the "/wals" path in ZK, but I realize now you may have meant something like TabletMetadata.getLiveTServers since maybe the data at "/wals" may still reference dead tservers... Or maybe both are equivalent

you may have meant something like TabletMetadata.getLiveTServers

Yeah, that that is what I was thinking of. I was looking at when the GC deletes walogs. Its when is unreferenced or not in the set of live tservers from zookeeper. Use LiveTserverSet to get those.

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java

kevinrr888 · 2025-03-25T19:36:48Z

@keith-turner - Could you take another look at this when you get the chance? I've updated this to work for 4.0. Since there's no more 3.x, this should make things simpler here. Main things changed in latest commit were:

Addressed Completed functionality for admin check command #5348 (comment)
SystemConfigCheckRunner changes

There are also some outstanding comments/questions

…) with RPC_PROCESS_BIND_ADDRESS (the replacement)

dlmarion · 2026-01-27T20:38:43Z

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java

    printRunning();
+
+    log.trace("********** Checking validity of some ZooKeeper nodes **********");
+    status = checkZkNodes(context, status);


I was looking at MetadataTableCheckRunner to get familiar with admin check in general and I thought I saw a bug where status from one check could return Failed, and then get overwritten by the next check. Digging in, that's not the case, as the method only returns Failed, otherwise it returns the status argument.

Maybe not for this PR, but I wonder if in general it would be better for each check to use booleans for pass / fail and use the following style instead:

boolean status = true; status &= check1(arg1, arg2, ...); status &= check2(arg1, arg2, ...);

I'd be fine with this as a follow-on. I went with the current approach because there's multiple statuses of a check:

public enum CheckStatus { OK, FAILED, SKIPPED_DEPENDENCY_FAILED, FILTERED_OUT; }

Running a check will only produce OK or FAILED (the other statuses are if a check is skipped/not run), so this could be converted to bool, but may make things a bit less clear in my opinion

Could keep the enums and add static method to CheckStatus that merges multiple statuses like static CheckStatus merge(CheckStatus ... statuses)

From what I could find SKIPPED_DEPENDENCY_FAILED and FILTERED_OUT were only used in the code that you see below:

accumulo/server/base/src/main/java/org/apache/accumulo/server/util/Admin.java

Lines 1428 to 1461 in c1dec00

private static int executeRunCheckCommand(CheckCommand cmd, List<CheckCommand.Check> givenChecks,

ServerContext context, ServerUtilOpts opts) throws Exception {

// Get all the checks in the order they are declared in the enum

final var allChecks = CheckCommand.Check.values();

final TreeMap<CheckCommand.Check,CheckCommand.CheckStatus> checkStatus = new TreeMap<>();

for (CheckCommand.Check check : allChecks) {

if (depsFailed(check, checkStatus)) {

checkStatus.put(check, CheckCommand.CheckStatus.SKIPPED_DEPENDENCY_FAILED);

} else {

if (givenChecks.contains(check)) {

checkStatus.put(check, cmd.getCheckRunner(check).runCheck(context, opts, cmd.fixFiles));

} else {

checkStatus.put(check, CheckCommand.CheckStatus.FILTERED_OUT);

}

}

}

printChecksResults(checkStatus);

if (checkStatus.values().stream()

.anyMatch(status -> status == CheckCommand.CheckStatus.FAILED)) {

return 5;

} else {

return 0;

}

}

private static boolean depsFailed(CheckCommand.Check check,

TreeMap<CheckCommand.Check,CheckCommand.CheckStatus> checkStatus) {

return check.getDependencies().stream()

.anyMatch(dep -> checkStatus.get(dep) == CheckCommand.CheckStatus.FAILED

|| checkStatus.get(dep) == CheckCommand.CheckStatus.SKIPPED_DEPENDENCY_FAILED);

}

If runCheck on line 1439 returned a boolean, then you could convert that to OK or FAILED at that time and I think it would make the checks a little easier to read. I'm happy to do it in a follow-on.

Good point I thought things would be more convoluted than that. I agree this would be a good change

I'll leave this comment unresolved so we don't forget about it when this is merged

dlmarion · 2026-01-27T20:45:37Z

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java

+            log.debug("No {} appear to be running. This may or may not be expected.", serverType);
+          } else {
+            // no exception and >= 1 server found
+            log.trace("Verified ZooKeeper lock(s) for {}", servers);


Do we want to print out all of the servers, or just the number of them? I'm wondering if there is any value to a really long list of servers in the log.

Yeah, this is a log in normal/healthy scenario, no real benefit of a verbose log here.

I replaced with logging the size where server size >= 1 in 1af0ad3 but kept the log the same where server size == 1 since that log will be short.

dlmarion · 2026-01-27T20:48:40Z

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java

+      // each child node of the tserver should be WAL metadata
+      final var wals = zrw.getChildren(tserverPath);
+      if (wals.isEmpty()) {
+        log.debug("No WAL metadata found for tserver {}", tsi);


is this a failure?

I think this is possible in a normal scenario where the tserver only hosts tablets that have never been written to. This seems unlikely, but not sure if this should fail in this case... Maybe could make this a WARN but still not fail. I can add a comment and improve the log message too.

Also when looking at this again, realized I don't check that this has a node for each running TServer, which I think it always should:
var tserverInstancesAtWals = zrw.getChildren(WalStateManager.ZWALS);

NVM I did have that check but removed it after Keith's suggestion: #5348 (comment)

Made this a WARN and improved log message

- RPC_PROCESS_BIND_ADDRESS was added as a required property in the server config check, but this prop does not need to be set. Removed this from the required properties - Fixed some flaky assertions

keith-turner · 2026-01-27T22:27:10Z

Noticed #6078 and #6079 while looking at this.

server/base/src/main/java/org/apache/accumulo/server/conf/CheckAccumuloProperties.java

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java

keith-turner · 2026-01-27T23:45:33Z

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java

+        log.trace("Checking if the WAL path {} found in the metadata exists in HDFS...",
+            parseRes.getSecond());
+        if (!context.getVolumeManager().exists(parseRes.getSecond())) {
+          log.warn("WAL metadata for tserver {} references a WAL that does not exist",
+              tserverInstanceAtWals);
+          status = Admin.CheckCommand.CheckStatus.FAILED;
+        }


This check is good, but it has a race condition. File can be removed after by the gc after getting the list of wals. Could do the following to avoid all race conditions.

wals1 = set of wals from ZK in the open or closed state (gc will eventually remove wals in unreferenced state)

tservers1 = set of live tservers

missingSet = set of wals that are in walSet1 and not in hdfs

wals2 = set of wals from ZK in the open or closed state

tservers2 = set of live tservers

Any thing in missingSet that is also in wals2, tservers1 and tservers2 is really odd. The wal state implies its in use (before and after the dfs check) and the tservers were alive before and after the dfs checks (the GC will eventually remove inuse logs assigned to dead tserver). So this walog should exist in hdfs. Could use guavas Sets.intersection for these checks.

Did not see a test for this case.

… case for this check

kevinrr888 commented Feb 21, 2025

View reviewed changes

kevinrr888 self-assigned this Feb 21, 2025

kevinrr888 added this to the 3.1.0 milestone Feb 21, 2025

kevinrr888 linked an issue Feb 21, 2025 that may be closed by this pull request

admin check run command functionality #4892

Open

10 tasks

simplified checks for compactors ZK locks and scan servers ZK locks

fe97ad6

kevinrr888 changed the title ~~New checks for admin check command~~ Completed functionality for admin check command Feb 24, 2025

kevinrr888 mentioned this pull request Feb 28, 2025

Adds property validation for instance.volumes property #5365

Merged

jkucera0 reviewed Mar 10, 2025

View reviewed changes

...base/src/main/java/org/apache/accumulo/server/util/checkCommand/SystemConfigCheckRunner.java Outdated Show resolved Hide resolved

ctubbsii changed the base branch from 3.1 to main March 13, 2025 23:51

ctubbsii modified the milestones: 3.1.0, 4.0.0 Mar 13, 2025

Merge remote-tracking branch 'upstream/main' into 3.1-feature-4892-part2

fbfa1fc

kevinrr888 added 2 commits May 16, 2025 16:36

Merge branch 'main' into 3.1-feature-4892-part2

7edd220

Merge branch 'main' into 3.1-feature-4892-part2

48b828a

kevinrr888 mentioned this pull request Jun 3, 2025

Add check for bulk import loaded columns that have no corresponding live fate operation #5175

Open

kevinrr888 added 2 commits January 27, 2026 14:57

Merge branch 'main' into 3.1-feature-4892-part2

815d246

replaced usage of GENERAL_PROCESS_BIND_ADDRESS (deprecated and unused…

9d77743

…) with RPC_PROCESS_BIND_ADDRESS (the replacement)

dlmarion reviewed Jan 27, 2026

View reviewed changes

Fixed AdminCheckIT failures:

f695fe1

- RPC_PROCESS_BIND_ADDRESS was added as a required property in the server config check, but this prop does not need to be set. Removed this from the required properties - Fixed some flaky assertions

keith-turner reviewed Jan 27, 2026

View reviewed changes

kevinrr888 added 5 commits January 28, 2026 09:32

Merge branch 'main' into 3.1-feature-4892-part2

ed10610

update AdminCheckIT from changes in main

9aa52a5

address some review

1af0ad3

removed unneccessary comment

666a8ca

address suggestion for WAL metadata check, added test to test failure…

8a0488c

… case for this check

	private static int executeRunCheckCommand(CheckCommand cmd, List<CheckCommand.Check> givenChecks,
	ServerContext context, ServerUtilOpts opts) throws Exception {
	// Get all the checks in the order they are declared in the enum
	final var allChecks = CheckCommand.Check.values();
	final TreeMap<CheckCommand.Check,CheckCommand.CheckStatus> checkStatus = new TreeMap<>();

	for (CheckCommand.Check check : allChecks) {
	if (depsFailed(check, checkStatus)) {
	checkStatus.put(check, CheckCommand.CheckStatus.SKIPPED_DEPENDENCY_FAILED);
	} else {
	if (givenChecks.contains(check)) {
	checkStatus.put(check, cmd.getCheckRunner(check).runCheck(context, opts, cmd.fixFiles));
	} else {
	checkStatus.put(check, CheckCommand.CheckStatus.FILTERED_OUT);
	}
	}
	}

	printChecksResults(checkStatus);

	if (checkStatus.values().stream()
	.anyMatch(status -> status == CheckCommand.CheckStatus.FAILED)) {
	return 5;
	} else {
	return 0;
	}
	}

	private static boolean depsFailed(CheckCommand.Check check,
	TreeMap<CheckCommand.Check,CheckCommand.CheckStatus> checkStatus) {
	return check.getDependencies().stream()
	.anyMatch(dep -> checkStatus.get(dep) == CheckCommand.CheckStatus.FAILED
	\|\| checkStatus.get(dep) == CheckCommand.CheckStatus.SKIPPED_DEPENDENCY_FAILED);
	}

Completed functionality for admin check command #5348

Are you sure you want to change the base?

Completed functionality for admin check command #5348

Conversation

kevinrr888 commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinrr888 Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinrr888 Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinrr888 Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinrr888 commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlmarion Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinrr888 Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keith-turner commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

Completed functionality for `admin check` command #5348

Completed functionality for `admin check` command #5348

kevinrr888 commented Feb 21, 2025 •

edited

Loading

kevinrr888 Feb 21, 2025 •

edited

Loading

kevinrr888 Feb 28, 2025 •

edited

Loading

kevinrr888 Jan 28, 2026 •

edited

Loading

kevinrr888 commented Mar 25, 2025 •

edited

Loading

dlmarion Jan 28, 2026 •

edited

Loading

kevinrr888 Jan 28, 2026 •

edited

Loading