28

In my server application I'm connecting to Kerberos secured Hadoop cluster from my java application. I'm using various components like the HDFS file system, Oozie, Hive etc. On the application startup I do call

UserGroupInformation.loginUserFromKeytabAndReturnUGI( ... );

This returns me UserGroupInformation instance and I keep it for application lifetime. When doing privileged action I launch them with ugi.doAs(action).

This works fine but I wonder if and when should I renew the kerberos ticket in UserGroupInformation? I found a method UserGroupInformation.checkTGTAndReloginFromKeytab() which seems to do the ticket renewal whenever it's close to expiry. I also found that this method is being called by various Hadoop tools like WebHdfsFileSystem for example.

Now if I want my server application (possibly running for months or even years) to never experience ticket expiry what is the best approach? To provide concrete questions:

  1. Can I rely on the various Hadoop clients they call checkTGTAndReloginFromKeytab whenever it's needed?
  2. Should I call ever checkTGTAndReloginFromKeytab myself in my code?
  3. If so should I do that before every single call to ugi.doAs(...) or rather setup a timer and call it periodically (how often)?
Chris Nauroth
  • 9,614
  • 1
  • 35
  • 39
Jan Zyka
  • 17,460
  • 16
  • 70
  • 118
  • Can you check under what condition is this 'UserGroupInformation.checkTGTAndReloginFromKeytab()' being called in those Hadoop tools your mentioning? Cause I have also been UserGroupInformation.loginUserFromKeytabAndReturnUGI for quite some time in long running applications and have not faced any issue yet. – Harinder Jan 05 '16 at 16:53
  • I didn't face any issues yet either and I'm not calling it ever myself. But I wonder if I'm just lucky or it's by desing ;) – Jan Zyka Jan 05 '16 at 20:02
  • No, I think we don't need to call this explicitly cause neither of us are facing issues with not calling this. But lets wait if someone else shares some knowledge on this :) – Harinder Jan 06 '16 at 11:29
  • 3
    My 2 cents: http://stackoverflow.com/questions/33211134/hbase-kerberos-connection-renewal-strategy/33243360#33243360 – Samson Scharfrichter Jan 06 '16 at 12:09
  • 1
    Maybe you can find insights in Steve Loughran's GitBook explicitly named **Hadoop and Kerberos: The Madness Beyond the Gate** (I heard that that he's still debugging the Kerberos client in Spark these days - once an exorcist, always an exorcist...) – Samson Scharfrichter Jan 06 '16 at 14:35
  • Last comment: what you are doing (persistent service connecting to HDFS, Hive, Oozie, etc) sounds very similar to what **Cloudera Hue** UI does. I never looked into their code base, but you should. – Samson Scharfrichter Jan 06 '16 at 14:39
  • @SamsonScharfrichter Thanks for those valuable 2 cents :). So from what I understand that in our case its working cause the renewable lifetime is long enough and we don't see this fail cause under the hood Hadoop API calls are executing the kinit -R command? – Harinder Jan 06 '16 at 16:46
  • ...or maybe some thread / process somewhere silently uses `checkTGTAndReloginFromKeytab()` and re-creates the TGT in the *global* cache, for all other threads/processes to use on the same machine (although they are blessfully ignorant of that Invisible Hand). – Samson Scharfrichter Jan 06 '16 at 16:55
  • @SamsonScharfrichter yes that could be the case, but we need to check the API for that to be sure what's happening there. Thanks a lot. – Harinder Jan 09 '16 at 05:57
  • Or lets start a bounty seeking some more advice from specialized sources. – Harinder Jan 09 '16 at 05:59
  • Big thumbs up for mentioning Hadoop and Kerberos: The Madness Beyond the Gate. I just want to drop a direct link for anyone who lands here in the future: https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/hadoop_and_kerberos.html – Chris Nauroth Jan 09 '16 at 07:58
  • As a followup to the book, know that I'm writiing a "kdiag" command for kerberos diagnostics, current output logs: https://issues.apache.org/jira/browse/SLIDER-1035 – stevel Jan 09 '16 at 18:30

1 Answers1

85

Hadoop committer here! This is an excellent question.

Unfortunately, it's difficult to give a definitive answer to this without a deep dive into the particular usage patterns of the application. Instead, I can offer general guidelines and describe when Hadoop would handle ticket renewal or re-login from a keytab automatically for you, and when it wouldn't.

The primary use case for Kerberos authentication in the Hadoop ecosystem is Hadoop's RPC framework, which uses SASL for authentication. Most of the daemon processes in the Hadoop ecosystem handle this by doing a single one-time call to UserGroupInformation#loginUserFromKeytab at process startup. Examples of this include the HDFS DataNode, which must authenticate its RPC calls to the NameNode, and the YARN NodeManager, which must authenticate its calls to the ResourceManager. How is it that daemons like the DataNode can do a one-time login at process startup and then keep on running for months, long past typical ticket expiration times?

Since this is such a common use case, Hadoop implements an automatic re-login mechanism directly inside the RPC client layer. The code for this is visible in the RPC Client#handleSaslConnectionFailure method:

          // try re-login
          if (UserGroupInformation.isLoginKeytabBased()) {
            UserGroupInformation.getLoginUser().reloginFromKeytab();
          } else if (UserGroupInformation.isLoginTicketBased()) {
            UserGroupInformation.getLoginUser().reloginFromTicketCache();
          }

You can think of this as "lazy evaluation" of re-login. It only re-executes login in response to an authentication failure on an attempted RPC connection.

Knowing this, we can give a partial answer. If your application's usage pattern is to login from a keytab and then perform typical Hadoop RPC calls, then you likely do not need to roll your own re-login code. The RPC client layer will do it for you. "Typical Hadoop RPC" means the vast majority of Java APIs for interacting with Hadoop, including the HDFS FileSystem API, the YarnClient and MapReduce Job submissions.

However, some application usage patterns do not involve Hadoop RPC at all. An example of this would be applications that interact solely with Hadoop's REST APIs, such as WebHDFS or the YARN REST APIs. In that case, the authentication model uses Kerberos via SPNEGO as described in the Hadoop HTTP Authentication documentation.

Knowing this, we can add more to our answer. If your application's usage pattern does not utilize Hadoop RPC at all, and instead sticks solely to the REST APIs, then you must roll your own re-login logic. This is exactly why WebHdfsFileSystem calls UserGroupInformation#checkTGTAndReloginFromkeytab, just like you noticed. WebHdfsFileSystem chooses to make the call right before every operation. This is a fine strategy, because UserGroupInformation#checkTGTAndReloginFromkeytab only renews the ticket if it's "close" to expiration. Otherwise, the call is a no-op.

As a final use case, let's consider an interactive process, not logging in from a keytab, but rather requiring the user to run kinit externally before launching the application. In the vast majority of cases, these are going to be short-running applications, such as Hadoop CLI commands. However, in some cases these can be longer-running processes. To support longer-running processes, Hadoop starts a background thread to renew the Kerberos ticket "close" to expiration. This logic is visible in UserGroupInformation#spawnAutoRenewalThreadForUserCreds. There is an important distinction here though compared to the automatic re-login logic provided in the RPC layer. In this case, Hadoop only has the capability to renew the ticket and extend its lifetime. Tickets have a maximum renewable lifetime, as dictated by the Kerberos infrastructure. After that, the ticket won't be usable anymore. Re-login in this case is practically impossible, because it would imply re-prompting the user for a password, and they likely walked away from the terminal. This means that if the process keeps running beyond expiration of the ticket, it won't be able to authenticate anymore.

Again, we can use this information to inform our overall answer. If you rely on a user to login interactively via kinit before launching the application, and if you're confident the application won't run longer than the Kerberos ticket's maximum renewable lifetime, then you can rely on Hadoop internals to cover periodic renewal for you.

If you're using keytab-based login, and you're just not sure if your application's usage pattern can rely on the Hadoop RPC layer's automatic re-login, then the conservative approach is to roll your own. @SamsonScharfrichter gave an excellent answer here about rolling your own.

HBase Kerberos connection renewal strategy

Finally, I should add a note about API stability. The Apache Hadoop Compatibility guidelines discuss the Hadoop development community's commitment to backwards-compatibility in full detail. The interface of UserGroupInformation is annotated LimitedPrivate and Evolving. Technically, this means the API of UserGroupInformation is not considered public, and it could evolve in backwards-incompatible ways. As a practical matter, there is a lot of code already depending on the interface of UserGroupInformation, so it's simply not feasible for us to make a breaking change. Certainly within the current 2.x release line, I would not have any fear about method signatures changing out from under you and breaking your code.

Now that we have all of this background information, let's revisit your concrete questions.

Can I rely on the various Hadoop clients they call checkTGTAndReloginFromKeytab whenever it's needed?

You can rely on this if your application's usage pattern is to call the Hadoop clients, which in turn utilize Hadoop's RPC framework. You cannot rely on this if your application's usage pattern only calls the Hadoop REST APIs.

Should I call ever checkTGTAndReloginFromKeytab myself in my code?

You'll likely need to do this if your application's usage pattern is solely to call the Hadoop REST APIs instead of Hadoop RPC calls. You would not get the benefit of the automatic re-login implemented inside Hadoop's RPC client.

If so should I do that before every single call to ugi.doAs(...) or rather setup a timer and call it periodically (how often)?

It's fine to call UserGroupInformation#checkTGTAndReloginFromKeytab right before every action that needs to be authenticated. If the ticket is not close to expiration, then the method will be a no-op. If you're suspicious that your Kerberos infrastructure is sluggish, and you don't want client operations to pay the latency cost of re-login, then that would be a reason to do it in a separate background thread. Just be sure to stay a little bit ahead of the ticket's actual expiration time. You might borrow the logic inside UserGroupInformation for determining if a ticket is "close" to expiration. In practice, I've never personally seen the latency of re-login be problematic.

Community
  • 1
  • 1
Chris Nauroth
  • 9,614
  • 1
  • 35
  • 39
  • 1
    Thanks a lot. My usage is typically the RPC calls u mentioned using the Java API calls for like FileSystem, MapReduce client, job.submit(). And I use UserGroupInformation.loginUserFromKeytabAndReturnUGI for getting the UGI. So I think I need not worry about my ticket renewal. – Harinder Jan 09 '16 at 10:29
  • Amazing answer. Hence awarding the bounty to you. – Harinder Jan 11 '16 at 04:52
  • Aceness. Exactly what I was looking for. – Jan Zyka Jan 11 '16 at 08:41
  • Wonderful answer, you rock Chris :) – tdebroc Feb 26 '16 at 10:17
  • Hi Chris, Excellent answer indeed but I've a question, how Kerberos setting would affect automatic ticket renewal done at the Hadoop RPC layer level? For instance how certain parameters in Keberos conf files like renew_life ticket_lifetime would impact? Can you just validate if my understanding is correct? As far as ticket is renewable and has some ticket_life and renew_life, Hadoop would be able to renew ticket for infinite time unless the app is stopped right? – Niranjan Subramanian Oct 15 '16 at 12:47
  • 1
    @NiranjanSubramanian , yes, Hadoop can keep renewing the ticket based on the Kerberos configuration. This not based on the Hadoop code itself directly reading krb5.conf. Instead, Hadoop code accesses the information attached to the ticket itself (the JDK class javax.security.auth.kerberos.KerberosTicket). The Hadoop code inspects the ticket and always tries to keep it renewed slightly ahead of its expiration, so that from the client's perspective, the ticket always works. – Chris Nauroth Oct 18 '16 at 18:38
  • @ChrisNauroth Thanks for the explanation. One last question is there any recommended Kerberos configuration that is documented by Hadoop? I don't find any information on how Kerberos should be configured. Is it okay to assume that default configuration would suffice and not any specific configuration is recommended? – Niranjan Subramanian Oct 19 '16 at 05:52
  • 1
    @NiranjanSubramanian , I don't believe there are specific recommendations around krb5.conf for Hadoop. Typically, this relates more to the operations team's decisions on how they want to manage Kerberos. I usually see ticket_lifetime=24h and renew_lifetime=7d, but this is really more of a guideline than a requirement. We often run system tests using much shorter lifetimes to try to stress test the renewal process, so I know that Hadoop works fine with much shorter lifetimes than this. – Chris Nauroth Oct 19 '16 at 19:49
  • @ChrisNauroth Thanks once again this answers my question. :) – Niranjan Subramanian Oct 20 '16 at 05:00
  • 1
    @ChrisNauroth I'm quite confused on when this API UserGroupInformation#spawnAutoRenewalThreadForUserCreds kicks in? This has a limitation of ticket not being renewable after its renewable life time. In my case I'm just using FileSystem API for writing continuously to HDFS, basically my app is a HDFSWriter and I do a single one-time call to UserGroupInformation#loginUserFromKeytab at my process startup. Do I need to worry about the above and would this Client#handleSaslConnectionFailure suffice? Can you let me know difference in login mech between these 2 API in this comment? – Niranjan Subramanian Oct 26 '16 at 06:48
  • 1
    @NiranjanSubramanian, `UserGroupInformation#spawnAutoRenewalThreadForUserCreds` is only relevant for applications that rely solely on the Kerberos ticket cache. In general, that means an interactive login that calls `kinit` before launching a command. If your application instead uses a keytab and calls `UserGroupInformation#loginUserFromKeytab` at startup before calling the `FileSystem` API, then you don't need to worry about this. The logic in the RPC layer to re-login automatically from the keytab has you covered. – Chris Nauroth Oct 26 '16 at 16:37
  • @ChrisNauroth Thanks once again for the clarification. :) Hopefully this should be my last doubt any idea what would've caused this exception java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hc4t01087/16.202.8.232"; destination host is: "g4t7501.houston.net":8020;? – Niranjan Subramanian Oct 26 '16 at 17:59
  • 1
    @ChrisNauroth Our application has an issue with auto renewal, we are using jdk 8 for our application and so that might create an issue with automatic renewal, [HADOOP-10786](https://issues.apache.org/jira/browse/HADOOP-10786). If we are not able to downgrade to JDK 7 line due to other dependencies, will calling `checkTGTAndReloginFromKeytab` before all RPCs solve the issue and is it advisable, given that it's No Op if the ticket is not close to expiration? – manthosh Dec 08 '16 at 02:22
  • @manthosh No, unfortunately if you are suffering from [HADOOP-10786](https://issues.apache.org/jira/browse/HADOOP-10786), then calling `checkTGTAndReloginFromKeytab` won't help. The nature of that bug is that it tricks the relogin code into thinking that the original login was not based on a keytab, so `checkTGTAndReloginFromKeytab` becomes a no-op. AFAIK, the only workarounds are either to downgrade to an earlier JDK version or upgrade to a Hadoop version that contains the [HADOOP-10786](https://issues.apache.org/jira/browse/HADOOP-10786) patch. – Chris Nauroth Dec 09 '16 at 21:23
  • @ChrisNauroth Can you shed some light on [this](http://stackoverflow.com/questions/41087997/auto-renewal-of-kerberos-ticket-not-working-from-java)? – manthosh Dec 13 '16 at 05:35
  • @manthosh , thanks for sharing the details of your setup on your new question. I posted an answer there. – Chris Nauroth Dec 13 '16 at 17:01
  • @ChrisNauroth Need some clarification again. When an application does this UserGroupInformation#loginUserFromKeytab at process start, does the application obtain a ticket by itself even if the ticket cache is empty? If yes can you throw some light on how its done? or prior to this manually user has to do kinit for the principal this application uses once? If ticket isn't manually obtained using Kinit in the ticket cache application throws this exception " No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) Doing this manual kinit a required precondition? – Niranjan Subramanian May 09 '17 at 18:19
  • 1
    @NiranjanSubramanian, for an application calling `UserGroupInformation#loginUserFromKeytab`, the application obtains a ticket for itself. It does this by reading the keytab file passed into the method and using the credentials stored in that file to authenticate with Kerberos. It should not be necessary to run `kinit` or otherwise populate the ticket cache before launching such an application. If this isn't working, then it implies something is failing with login from the keytab file. Try checking permissions on the keytab file, and try testing the keytab using `kinit -kt`. – Chris Nauroth May 12 '17 at 23:28
  • @ChrisNauroth Thanks for the explanation. So when application obtains a ticket, it doesn't interact with Kerberos ticket cache at all right? As you've mentioned in your previous comment it just gets the ticket as class javax.security.auth.kerberos.KerberosTicket, probably stores it and renews it internally right? – Niranjan Subramanian May 13 '17 at 04:21
  • @ChrisNauroth I got the answer for my previous question. So when the HDFS client authenticates with Kerberos server, internally it uses default value of useTicketCache = false in its configuration and puts the obtained TGT in the authenticated Subject's private credential set. – Niranjan Subramanian May 15 '17 at 12:06
  • We are seeing [this JDK bug](https://bugs.openjdk.java.net/browse/JDK-8147772) in a stack trace originating from `saslConnect#SaslRpcClient` (which immediately calls `GssKrb5Client#evaluateChallenge`). This starts happening right at the time the ticket expires (in our case, 10 hours). So it appears that the JDK bug is stopping the Hadoop RCP auto-renewal mechanism from working as intended. Does anyone have any advice on working around this until the JDK bug is fixed? – Jeff Evans Aug 24 '17 at 17:54