5/5/11

Your Authentication is only as good as your last packet

Hello again,

It's seems a bit unfortunate, that the times when I'm really inspired to write are the times when everything (or almost everything) comes crashing.

So, in short, today I'll write a little Kerberos Authentication and how it's affected by the smallest things.

Just a while ago, one of our clients called with a problem - his web server was working awfully slow, to the point where certain actions got a timeout after about 20 minutes of being stuck on a post request.
In order to clarify, both servers are running on Win 2003 Std x86 with IIS 6. The application in question was using Kerberos Authentication in order to confirm credentials on certain actions.

First thing that we noticed was that when we switch IIS to work with anonymous authentication the request didn't hang (though the action itself didn't perform as needed because it had no credentials), and we had a page returned almost immediately. 
Second thing we noticed was that one of the servers took a really long time to get through the login process.

So the immediate thing to do was to run Authentication Diagnostics for IIS by Microsoft (Download x86 Version Here) - It's a pretty handy utility to troubleshoot with. We ran some diagnostics and everything seemed OK, except for a failure that said that some users that are running application services don't have an SPN available in Active Directory, which seemed strange because as far as I recall, SPNs are usually a thing reserved to computer objects.
The next thing I decided to do was to iisreset both of the web servers and see if reseting the app pools together did anything of use. After the iisreset on one of the servers, everything seemed to work fine. I was glad to see that at least I got stuff working again, and quickly switched to the second server to check it, but it was still doing iisreset so I had to wait. Once the iisreset was done on the second server I checked it and saw that it still didn't work, but when I went back to the "working one" it reverted to not working again. For some reason I had a feeling that the last success was connected to the fact that the second server was down, so I decided to stop the w3p service on the second server and guess what - the "working one" was working one. This seemed to be in tune with the fact that the server that had his w3p service stopped was the same server that took forever to login.
After trying to troubleshoot with Event viewer on both Web server and the DCs it was trying to contact and getting off with nothing in particular, we had one of our network architects telling us to use a network sniffer on both web server and dc and see what's going on - basically a great idea, seeing that we had no leads what so ever on the problem except the fact that it had something to do with authentication.
After some digging with the sniffer we've noticed that all the KRB requests were made using UDP. A short comparison to the working server showed that it was using TCP for KRB requests. I already knew that there's a way to force my server to use TCP for kerberos authentication, but I still didn't know what was the problem with using UDP for this process. 
I went online and stumbled upon KB 244474 (by Microsoft of course), which in turn explained that up to Windows 2003 the default protocol for Kerberos Authentication is using UDP, which means that UDP will be first to try out and TCP will be the second (as far as I can tell). The main issue with UDP according to the KB is that KRB Authentication using UDP is supported up to a certain packet size, and also, if a fragmented packet is received by the DC - it will be dropped. Knowing that we went back to the sniffer, and noticed that the web server was sending request in the following fashion - First, a fragmented packet on the verge of the size limit was sent, with a smaller packet following it. The sniffer on the DC logged that only the smaller packet was being received. From this I gathered the following - The web server was trying to send packets larger than the limit set for UDP, so the packets were fragmented, and sent according to the size limit, but the DC was dropping them on the basis of them being fragmented. This seemed to be the issue causing the hangs we've been having. Next thing I figured was that this process wasn't timing out because the DC was only receiving a part of every request and thus never completing the negotiation. This info was gathered mostly by relying on common sense and was based on past experience, if I get this confirmed I'll be sure to post back.
So, once we forced the problematic server to use TCP for KRB requests, the long login time was solved and so was the original problem.
This is what I think happened to cause the problem in the first place - 
Both web servers are separate but they also use the same back-end servers. I guess, that in order to perform actions involving the back-end servers they had the Authenticate with a DC. The problematic server was trying to access the back-end servers and once it established a transaction with the back-end servers it went on to authenticate with a DC, a process that it could not accomplish in a timely manner. In the same time, the second server was prevented from working with the back-end servers, because they already had an open transaction with another server. This seems to me like the most probable reason for this whole mess.
I've yet to confirm all the details, but I'm pretty sure I got most of it right (if not all of it), and solving the problem seems like enough evidence on that account.

My thoughts on the subject
  1. Starting with Windows 2008, the default protocol for kerberos authentication is TCP. I can't see why Microsoft didn't release a Hotfix to make Windows 2003 Servers act the same way.
  2. When working in a Front to Back end configuration using AD credentials, be sure to open a transaction after the credentials were verified by a DC.
  3. IIS Authentication Diagnostics should have tested for UDP authentication specifically, knowing that this is the default protocol in Windows 2003.
  4. Don't ignore "minor" problems while facing larger issues. These so called "minor" issues could very well be your lead on the bigger problem.
Anyways, now everything seems to be fine. If I gather any more info on the subject, I'll revisit the issue in future posts.

TCP is Power,
Dani.