HDDS-3128. Add ozone debug commands for Kerberos#9868
HDDS-3128. Add ozone debug commands for Kerberos#9868adoroszlai merged 20 commits intoapache:masterfrom
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @navinko for working on this. I think we should add these new commands under ozone debug, not at the top-level. Also, please implement as a picocli @Command. See VersionDebug as an example. KDiag would need to be copied to Ozone and adapted.
Hi @adoroszlai Thank you for reviewing . I have one doubt
|
The latter (copy to Ozone, change to use picocli instead of its own argument parsing, replace Hadoop-specific variables/properties with Ozone-specific ones, etc.). |
Thank you , I will update the PR shortly. |
|
Hi @adoroszlai |
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @navinko for updating the patch. I'll need some time for the review.
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @navinko for updating the patch, and sorry for the delay in review.
In addition to inline comments, I have the following suggestions:
- create a grouping subcommand
ozone debug kerberos(class:KerberosSubcommands) - move the two new subcommands under
ozone debug kerberosaskerbname=>ozone debug kerberos translate-principalkdiag=>ozone debug kerberos diagnose
- move all classes to a single package
org.apache.hadoop.ozone.debug.kerberos(no need for separateauthtolocalandkdiagpackages)
| print(conf, "ozone.acl.enabled"); | ||
| print(conf, "ozone.acl.authorizer.class"); | ||
| print(conf, "hadoop.security.authorization"); | ||
| print(conf, "ozone.om.security.client.protocol.acl"); | ||
|
|
||
| print(conf, "hdds.security.client.datanode.container.protocol.acl"); | ||
| print(conf, "hdds.security.client.scm.container.protocol.acl"); | ||
| print(conf, "hdds.security.client.scm.block.protocol.acl"); | ||
| print(conf, "hdds.security.client.scm.certificate.protocol.acl"); |
There was a problem hiding this comment.
Please use constants from HddsConfigKeys, OzoneConfigKeys etc. where available.
There was a problem hiding this comment.
there were few fields not available and i tried avoiding cyclic dependency , used as it is string .
| String value = conf.get(key); | ||
| System.out.println(key + " = " | ||
| + (value == null ? "(unset)" : value)); |
There was a problem hiding this comment.
Please reuse code across various probes. The same logic also appears in HttpAuthProbe, OzonePrincipalProbe and SecurityConfigProbe. Maybe a parent class ConfigProbe?
There was a problem hiding this comment.
Thanks @adoroszlai tried incorporation suggestions.
|
|
||
| String name(); | ||
|
|
||
| boolean run() throws Exception; |
There was a problem hiding this comment.
I think test() would be a better name.
| String realm = KerberosUtil.getDefaultRealm(); | ||
| System.out.println("Default realm = " + realm); | ||
| } catch (Exception e) { | ||
| System.out.println("WARNING: Unable to determine default realm"); |
There was a problem hiding this comment.
I think warnings and errors should be printed to System.err.
There was a problem hiding this comment.
used error stream , was using out only ealier .
| @Override | ||
| public Void call() throws Exception { | ||
| System.out.println("-- Kerberos Principal Translation --"); | ||
| OzoneConfiguration conf = new OzoneConfiguration(); |
There was a problem hiding this comment.
Let KerbNameDebug extends AbstractSubcommand, then configuration can be obtained via getOzoneConf(). It allows setting specific properties, as well as alternative config files.
The same applies to OzoneKDiag.
There was a problem hiding this comment.
Thanks @adoroszlai tried incorporation suggestions.
|
|
||
| @Override | ||
| public boolean run() { | ||
| System.out.println("-- Authorization Configuration --"); |
There was a problem hiding this comment.
All DiagnosticProbe implementations have name() that could be automatically printed instead of duplicating the title in run().
| System.out.println("PASS : " + pass); | ||
| System.out.println("WARN : " + warn); | ||
| System.out.println("FAIL : " + fail); | ||
| return null; |
There was a problem hiding this comment.
We may want to set non-zero exit code if there are any failures. I think it can be done by using Callable<Integer> instead of Callable<Void>.
There was a problem hiding this comment.
Thanks @adoroszlai tried incorporation suggestions.
| warn++; | ||
| System.out.println("[WARN] " + probe.name()); |
There was a problem hiding this comment.
I tried introducing problems, e.g. by making kinit non-executable, removing permissions from /etc/krb5.conf, etc. The probes returned false, which resulted only in warnings. But the system is non-functional with this setup. Shouldn't it give errors instead of warnings?
-- Kerberos kinit Command --
Executable kinit must be available on PATH
PATH = /opt/hadoop/libexec:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/jdk-21.0.2/bin:/opt/hadoop/bin
kinit not found on PATH
[WARN] Kerberos kinit Command
-- Kerberos Ticket --
ERROR checking kerberos credentials: Can't get Kerberos realm
[WARN] Kerberos Ticket
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @navinko for updating the patch.
| @MetaInfServices(DebugSubcommand.class) | ||
| public class TranslatePrincipalSubcommand extends AbstractSubcommand | ||
| implements Callable<Void>, DebugSubcommand { |
There was a problem hiding this comment.
Please remove leftover references to DebugSubcommand (now translate-principal is under KerberosSubcommand).
There was a problem hiding this comment.
I am really sorry ,my bad:)
I am still working on the pr din't realised it is not draft anymore.
Will update the PR shortly.
Thanks @adoroszlai tried incorporation suggestions. |
|
Hi @adoroszlai thank you for reviewing and suggestions . |
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks a lot @navinko for updating the patch, LGTM. Let's give others some time to take a look.
|
@Gargi-jais11 would you like to take a look? |
Sure @adoroszlai will take a look into this. |
Gargi-jais11
left a comment
There was a problem hiding this comment.
Thanks @navinko for working on the patch. This will really be helpful change.
Please find inline comments.
| BufferedWriter writer = new BufferedWriter( | ||
| new OutputStreamWriter(buffer, StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
A BufferedWriter is created that wraps the ByteArrayOutputStream buffer. writer is created and closed in finally — that's it. The stream that actually captures probe output is PrintStream ps. Only ps is ever used.
BufferWriter is a dead code and can be remove Line 67-68 and Line 89.
There was a problem hiding this comment.
My bad , din't realise while fixing spotbugs issues. Fixed now.
| } catch (Exception e) { | ||
| System.err.println("ERROR: Failed to translate principal " | ||
| + principal + " : " + e.getMessage()); | ||
| } | ||
| } | ||
| return null; |
There was a problem hiding this comment.
This subcommand impliments Callable<Void> and always returns null (exit code 0), even when a principal fails t o translates.
It should be Callable<Integer> and return a non-zero code if any principal translation fails — consistent with how DiagnoseSubcommand which returns 1 on failures.
There was a problem hiding this comment.
Thanks @Gargi-jais11 I just fixed it. I initially thought to keep it simpler since we are not doing many things here but keeping it consistent with diagnose design overall will be great .
| if (output.contains("ERROR:")) { | ||
| fail++; | ||
| out().println("[FAIL] " + probe.name()); | ||
| } else if (!valid) { | ||
| warn++; | ||
| out().println("[WARN] " + probe.name()); | ||
| } else { | ||
| pass++; | ||
| out().println("[PASS] " + probe.name()); | ||
| } | ||
| out().println(); | ||
| } |
There was a problem hiding this comment.
Right now FAIL is infferred with output.contains("ERROR:") while WARN/PASS use probe.test()’s boolean return. Those two signals can disagree (e.g. a probe prints ERROR: but still returns success, or the opposite).
It is better to have one source of truth and drop the string scan.
Right Approach would be:
step1: create them as an enum
public enum ProbeResult {
PASS, // probe ran cleanly, everything looks healthy
WARN, // probe ran, something is misconfigured but not broken
FAIL // probe detected a critical problem
}
Step2: Update the DiagnosticProbe interface to:
ProbeResult test(OzoneConfiguration conf) throws Exception; // was: boolean
step3: then you can Update DiagnoseSubcommand — remove the string scan entirely and use the Enum created
step4: Update probes to return ProbeResult explicitly
Here are examples of all three cases:
Purely informational probe (was always return true) — EnvironmentProbe
WARN probe (suboptimal but not broken) — AuthorizationProbe
FAIL probe (critical, system broken) — KerberosConfigProbe
A probe with both WARN and FAIL paths — KerberosTicketProbe
There was a problem hiding this comment.
Used enum and updated the code.
|
|
||
| try { | ||
| UserGroupInformation.setConfiguration(conf); | ||
| String authType = conf.getTrimmed("hadoop.security.authentication"); |
There was a problem hiding this comment.
Please use constant here: HADOOP_SECURITY_AUTHENTICATION
| * | ||
| * The probe does NOT attempt to perform a login (kinit), Instead it reports | ||
| * the current state so operators can diagnose security issues. | ||
| * Checks the current Kerberos authentication state of the process. |
There was a problem hiding this comment.
Nit: Duplicate this can be removed.
| // Force Kerberos failure | ||
| System.setProperty("java.security.krb5.conf", "/invalid/path"); |
There was a problem hiding this comment.
I think you need to update the comment as java.security.krb5.conf is for jvm kerberos probe to fail and accordingly assertion should check for -- JVM Kerberos Properties -- not Kerberos configuration.
DiagnoseSubcommand prints a banner for every probe:
So "-- Kerberos Configuration --" is always printed when that probe runs, whether it PASS, WARN, or FAIL.
Asserting that string only proves “this section appeared,” not that KerberosConfigProbe was the thing that failed.
So: the test works for “something Kerberos-related failed,” but the intent and assertions don’t match what really broke.
There was a problem hiding this comment.
Updated the comment , although the failure case i was testing for KerberosConfigProbe ,thats is why i was checking "-- Kerberos Configuration --
Now updated it to -- JVM Kerberos Properties --
There was a problem hiding this comment.
That's fine you could have also done it either way.
| boolean valid = true; | ||
|
|
||
| // Print relevant configs | ||
| print(conf, OzoneConfigKeys.OZONE_ACL_ENABLED); |
There was a problem hiding this comment.
We should also add "ozone.authorization.enabled" in the authorisation probe,since this acts as a master switch to enable/disable authorization checks in Ozone(admin privilege checks and ACL checks).
| print(conf, OzoneConfigKeys.OZONE_ACL_ENABLED); | |
| print(conf, OzoneConfigKeys.OZONE_AUTHORIZATION_ENABLED); | |
| print(conf, OzoneConfigKeys.OZONE_ACL_ENABLED); |
There was a problem hiding this comment.
thanks it make sense , added.
|
Thanks @Gargi-jais11 for the review, tried addressing review suggestions. |
Gargi-jais11
left a comment
There was a problem hiding this comment.
Thanks @navinko for updating the patch. Just few more minor comments.
| warn("Ozone security is disabled (ozone.security.enabled=false). " | ||
| + "Authorization checks are not enforced in non-secure mode."); |
There was a problem hiding this comment.
Instead of hardcoding please use the constants for key and value:
"Ozone security is disabled (" + OZONE_SECURITY_ENABLED_KEY + "=" + OZONE_SECURITY_ENABLED_KEY_DEFAULT + ". Authorization checks are not enforced in non-secure mode."
| OzoneConfigKeys.OZONE_AUTHORIZATION_ENABLED, true); | ||
|
|
||
| if (!ozoneAuthEnabled) { | ||
| warn("Ozone authorization is disabled (ozone.authorization.enabled=false). " |
There was a problem hiding this comment.
Similarly use the constants for ozone.authorization.enabled and its default value.
There was a problem hiding this comment.
Thanks @Gargi-jais11
Updated the changes .
Gargi-jais11
left a comment
There was a problem hiding this comment.
Thanks @navinko for updating the patch. LGTM!
Please document this tool as well. Is there any tracking jira for documenting this new tool added?
Thanks @Gargi-jais11 |
|
|
||
| // Read rule | ||
| String rules = conf.getTrimmed( | ||
| "hadoop.security.auth_to_local", "DEFAULT"); |
There was a problem hiding this comment.
Can use CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTH_TO_LOCAL.
| public static final String HDDS_SCM_HTTP_AUTH_TYPE = "hdds.scm.http.auth.type"; | ||
|
|
There was a problem hiding this comment.
Just noticed that this is a duplicate of:
There was a problem hiding this comment.
My bad , removed and reused exiting.
|
|
||
| // Validate Ozone ACLs enforcement | ||
| boolean ozoneAclEnabled = conf.getBoolean( | ||
| OzoneConfigKeys.OZONE_ACL_ENABLED, false); |
There was a problem hiding this comment.
| OzoneConfigKeys.OZONE_ACL_ENABLED, false); | |
| OzoneConfigKeys.OZONE_ACL_ENABLED, OzoneConfigKeys.OZONE_ACL_ENABLED_DEFAULT); |
There was a problem hiding this comment.
Thanks @adoroszlai ,
I have just updated the changes.
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @navinko for the reminder that the patch is updated.
|
Thanks @navinko for the patch, @Gargi-jais11 for the review. |
Thank you so much @adoroszlai , @Gargi-jais11 for your continued support through the review process and for getting this PR merged. |
What changes were proposed in this pull request?
HDDS-3128. Add support for kdiag and kerbname commands to ozone script
1: ozone debug kerberos translate-principal .. one or many .
This command translates Kerberos principals to local user names using the configured hadoop.security.auth_to_local rules.
Features:
2: ozone debug kerberos diagnose
This command provides Kerberos diagnostics for Ozone clusters similar to hadoop kdiag.
Features:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-3128
How was this patch tested?