7/10/11

Orphaned Sites and how to treat them

Hello again,

Today I'll be talking about time sync and what it has to do with GPO.
Recently, we've been experiencing some issues with station in certain sites not receiving GPO from the Domain.
We've been troubleshooting this issue for a while and there seemed to be no problem with our station, their logon process was correct and each station knew the which dc was the closest to her (or so we thought).
As it turns out, the common problem with all the station that were experiencing difficulties with GPO was W32Time Errors in the System Event Log.
We used the nltest tool to enable netlogon debugging (you can do so by typing the following in cmd - nltest /dbflag:0x2080ffff). Then, we restarted the W32Time service and tried to sync it manually. We noticed in the log, that the station knew it was in a certain site (and it was the right site) and it got the corresponding dc to our site topology, but the station declined that DC because it was from a different site.

Right now, I'd like to take a few lines to understand the "concept" of DC-less sites.
When you look through the basics of Active Directory Design, there's a basic concept that says that in most cases, each site should get at least one DC (considering the user "bandwidth" of the site in question). One of the exception for this case is the need for sites in order to provide other "site-oriented" services (like DFS Namespaces for example). 
In cases such as those, station in certain sites are left "orphaned" and seek out a DC to "adopt" them. Microsoft predicted this scenario and thought of an auto-coverage mechanism that depends on the site topology (i.e. site links), thus providing you with an automatic mechanism that returns you the closest resource of the relevant service ("closest" is a matter of minimal cost of site links). In short, you get the closest server that can provide the needed service (like DC got login requests).

Now that we've established a common baseline on "AD Site Auto-Coverage", I'll elaborate a little bit more about our situation. All the stations in question were located in DC-less sites, thus depending on the auto-coverage mechanism. As far as we could see, this (auto-coverage) worked perfectly. Netlogon logged that the station was in site "A" and got a DC from it Parent-Site ("B"), but for some reason, the station refused to accept the returned DC. After a few tries the station declared that there aren't any available Domain Controllers. This led to any cached info the station contained concerning a reliable dc  to contact, being flushed on a regular basis.
Thus far, we concluded that the station wasn't able to establish a reliable connection with a DC to get GPO information, but didn't know why.
Turns out, that the W32Time Service has a parameter called CrossSiteSyncFlags, that is responsible for the "Cross Site" behavior of W32Time. In short, it defines what's to be done in cases that the w32time need to rely on site topology. This certain parameter has 3 modes - The first, doesn't allow the time service to sync with any source outside the machine's site. The second, only allows syncing with the site the machine's located in and the PDC. The option we needed (and wasn't configured) was the third one - the machine can synchronize with a source outside it's site. This parameter can be configured via GPO (more about it in the Tech Reference). As soon as we configured the parameter to the third "mode", there were no more time issues on the stations, and it seemed that all was well. Sadly, after we reset the station, the same problem occurred. We started thinking if there's any other site related reasons that could prevent the station from getting GPO. Just to be clear - we have a very strict firewall policy concerning inter-site connections. The short version is - no station is allowed to communicate outside the site topology, meaning that station can only communicate with machines in sites that have a link to the station's own site. After some thinking, it was concluded (not by me) that since we're using mostly windows 2003 domain controllers, we probably don't have SiteCostedReferrals enabled. 
What SiteCostedReferrals essentially means is that the DFS namespace mechanism will work according to the site topology as defined by link costs. If you do not enable this, then sites that don't have a DFS resource available will access other available members of the DFS namespace randomly. As you all know, sysvol and netlogon folders are essential to the group policy process and are viewed like a regular DFS namespace. So, in our case, sites with no DC(no DFS resource) went on to access random sites that were blocked by our nice and friendly firewall. When we enabled the SiteCostedReferrals on all our DCs, the problem just sorted itself out and the station finally could make a stable connection with the \\domain\sysvol folder. This "issue" was resolved starting Windows Server 2008.

So, a few things you need to keep in mind if you are implementing DC-less sites in you Active Directory design : 
  • Make sure that the machines in your domain are accessing other resources on the domain based on the domain's site topology.
  • Make sure your time sync mechanism is functioning, and if not, look into your CrossSiteSyncFlags configuration.
  • If your DCs (or any DFS enabled servers for that matter) have a Windows Server 2003 OS or lower, enable SiteCostedReferrals or bare with the fact that machines in sites with no DFS resource available will not comply with your site topology.

Yours truly,
Dani .H

Technical Reference

6/27/11

A LDAP in VBScript's Clothing

Hello again,

It's been a while since my last post. Haven't been inspired much lately.
This post will be a short one, and hopefully the next one will be longer and will interest you more - I'm planning on doing a piece regarding DFS.

So, we had a situation today that required us to change a specific user attribute for all users in a specific group.
There are a couple of ways to deal with this situation. If this is a so-called "Mainstream" attribute, you can use the infamous dsmod cmd tool-  it allows you to change certain attributes, and if you combine it with dsquery and dsget it will come through for you. Sadly, not all cases are simple, forcing you to utilize an advanced set of solutions. 
First thing that comes to mind is Powershell. If you wish to use powershell you can do so using the DirectoryServices library. Another way, that's a little less complex is using the QAD (Quest Active Directory) add-on. 
There's also a handy tool called ADModify, enabling you to accomplish a vast variety of LDAP actions.

To my great disappointment, I had no Internet access and no machines with PS installed on them in the domain in question, so I did what any of you would do (I imagine) - wrote a VBScript. 

The common way to connect to your AD with VBScript is by creating an object that has a DN (Distinguished Name) string like so : 
dim objADUser
set objADUser = GetObject("LDAP://CN=username,OU=someou,DC=domain,DC=com")

In case you want to change a certain attribute of the user object, it is easily done in the following way :
objADUser.Put "AttributeName" = "NewValueForAttribute"

In my case, I neede to get all users from a certain group, which is done by writing :
dim objADGroup
set objADGroup = GetObject("LDAP://CN=groupname,OU=someou,DC=domain,DC=com")
objADGroup.GetInfo
(so far we've set up the group for exploration)
arrMembers = objADGroup.GetEx("member")
(this here, gets the group's membership and sets it into the arrMembers array)

After doing all this, we just need to combine it together into a simple script, like so :
dim objADUser
dim objADGroup
dim arrMembers
set objADGroup = GetObject("LDAP://CN=groupname,OU=someou,DC=domain,DC=com")
arrMembers = objADGroup.GetEx("member")
objADGroup.GetInfo
foreach strMember in arrMembers
dim objADUser
set objADUser = GetObject("LDAP://" & strMember)
objADUser.Put "AttributeName" = "NewValueForAttribute"
Next


I suppose this can be done in a more elegant and efficient way, but this few lines do the job well.
You're welcome to use them if you wish.

There are also other tools out there to modify AD objects, but seeing that making a script is so simple, there's no reason to use some third-party tools.

Hope you've learnt something from this post.
Until next time,
Dani .H

Technical Reference:
Great tutorials about VBScripting for AD - Users, Groups

5/5/11

Your Authentication is only as good as your last packet

Hello again,

It's seems a bit unfortunate, that the times when I'm really inspired to write are the times when everything (or almost everything) comes crashing.

So, in short, today I'll write a little Kerberos Authentication and how it's affected by the smallest things.

Just a while ago, one of our clients called with a problem - his web server was working awfully slow, to the point where certain actions got a timeout after about 20 minutes of being stuck on a post request.
In order to clarify, both servers are running on Win 2003 Std x86 with IIS 6. The application in question was using Kerberos Authentication in order to confirm credentials on certain actions.

First thing that we noticed was that when we switch IIS to work with anonymous authentication the request didn't hang (though the action itself didn't perform as needed because it had no credentials), and we had a page returned almost immediately. 
Second thing we noticed was that one of the servers took a really long time to get through the login process.

So the immediate thing to do was to run Authentication Diagnostics for IIS by Microsoft (Download x86 Version Here) - It's a pretty handy utility to troubleshoot with. We ran some diagnostics and everything seemed OK, except for a failure that said that some users that are running application services don't have an SPN available in Active Directory, which seemed strange because as far as I recall, SPNs are usually a thing reserved to computer objects.
The next thing I decided to do was to iisreset both of the web servers and see if reseting the app pools together did anything of use. After the iisreset on one of the servers, everything seemed to work fine. I was glad to see that at least I got stuff working again, and quickly switched to the second server to check it, but it was still doing iisreset so I had to wait. Once the iisreset was done on the second server I checked it and saw that it still didn't work, but when I went back to the "working one" it reverted to not working again. For some reason I had a feeling that the last success was connected to the fact that the second server was down, so I decided to stop the w3p service on the second server and guess what - the "working one" was working one. This seemed to be in tune with the fact that the server that had his w3p service stopped was the same server that took forever to login.
After trying to troubleshoot with Event viewer on both Web server and the DCs it was trying to contact and getting off with nothing in particular, we had one of our network architects telling us to use a network sniffer on both web server and dc and see what's going on - basically a great idea, seeing that we had no leads what so ever on the problem except the fact that it had something to do with authentication.
After some digging with the sniffer we've noticed that all the KRB requests were made using UDP. A short comparison to the working server showed that it was using TCP for KRB requests. I already knew that there's a way to force my server to use TCP for kerberos authentication, but I still didn't know what was the problem with using UDP for this process. 
I went online and stumbled upon KB 244474 (by Microsoft of course), which in turn explained that up to Windows 2003 the default protocol for Kerberos Authentication is using UDP, which means that UDP will be first to try out and TCP will be the second (as far as I can tell). The main issue with UDP according to the KB is that KRB Authentication using UDP is supported up to a certain packet size, and also, if a fragmented packet is received by the DC - it will be dropped. Knowing that we went back to the sniffer, and noticed that the web server was sending request in the following fashion - First, a fragmented packet on the verge of the size limit was sent, with a smaller packet following it. The sniffer on the DC logged that only the smaller packet was being received. From this I gathered the following - The web server was trying to send packets larger than the limit set for UDP, so the packets were fragmented, and sent according to the size limit, but the DC was dropping them on the basis of them being fragmented. This seemed to be the issue causing the hangs we've been having. Next thing I figured was that this process wasn't timing out because the DC was only receiving a part of every request and thus never completing the negotiation. This info was gathered mostly by relying on common sense and was based on past experience, if I get this confirmed I'll be sure to post back.
So, once we forced the problematic server to use TCP for KRB requests, the long login time was solved and so was the original problem.
This is what I think happened to cause the problem in the first place - 
Both web servers are separate but they also use the same back-end servers. I guess, that in order to perform actions involving the back-end servers they had the Authenticate with a DC. The problematic server was trying to access the back-end servers and once it established a transaction with the back-end servers it went on to authenticate with a DC, a process that it could not accomplish in a timely manner. In the same time, the second server was prevented from working with the back-end servers, because they already had an open transaction with another server. This seems to me like the most probable reason for this whole mess.
I've yet to confirm all the details, but I'm pretty sure I got most of it right (if not all of it), and solving the problem seems like enough evidence on that account.

My thoughts on the subject
  1. Starting with Windows 2008, the default protocol for kerberos authentication is TCP. I can't see why Microsoft didn't release a Hotfix to make Windows 2003 Servers act the same way.
  2. When working in a Front to Back end configuration using AD credentials, be sure to open a transaction after the credentials were verified by a DC.
  3. IIS Authentication Diagnostics should have tested for UDP authentication specifically, knowing that this is the default protocol in Windows 2003.
  4. Don't ignore "minor" problems while facing larger issues. These so called "minor" issues could very well be your lead on the bigger problem.
Anyways, now everything seems to be fine. If I gather any more info on the subject, I'll revisit the issue in future posts.

TCP is Power,
Dani.

4/15/11

When stations play Hide and Seek with DCs

Hey guys,

I hope to keep this post short, because it's not really something super complex. 
b.t.w I'll be on vacation this week, so don't expect anything out of the ordinary - muse usually comes to me at work :)

Today I'd like to talk about how a station chooses (or rather locates) a DC to communicate with. It's been on my mind this week, because I've had an opportunity to be a part of a technical job interview and the guy we interviewed didn't seem to know how this process works. I'd like to dedicate this post to him. 

To start off, I just want to point out that Microsoft probably has a more thorough explanation on their technet library, so I'm just going to simplify it a bit for those who want to complete a successful job interview and don't have the tolerance to read a lot of complex (and sometimes - good for nothing) technical terms.
So, first of all, the name of the process that commences the "search" is "DC Locator" (no surprises here I hope..). The DC locator works his "charm" through the netlogon service, which means that if netlogon is down locally, you won't be able to find a DC to authenticate with (but you'll probably get an error that extensively explains this). 
Next, your nifty DC locator will try to query a DNS server, and as you've probably guessed -if you don't have a connection with the DNS servers, you won't be able to locate a DC. So, as to the DNS query, if this is the first time your station attempts to query the DNS, it will first query the domain name and save it in the netlogon cache for future use (to save you some time on your net logon, hopefully). 
Now, the DC locator will query the DNS server of his choice for a dc, it will do so through the msdcs zone and it'll prefer a dc in the same subnet. Once a DC has been found, the "client" will establish communication with it using LDAP(Lightweight Directory Access Protocol), that is, to gain access to Active Directory. The DC will identify the site which the said "client" belongs to using client's IP subnet. If the current DC isn't the optimal choice (i.e. not a DC in the closest site), it will return the name of the client's optimal site - in case the client has already failed to communicate with a DC on that site, it will continue "working" with the current DC, else it will query the DNS with a site-specific query. Once the client establishes connection, it will "cache" the info for netlogon future usage. In a case where the client cached a non-optimal DC entry, it will flush its cache in 15 minutes and will reattempt this whole process from the top when needed.
Any further actions will include : Logon,Authenticaion etc.

Well, it turned out longer than I expected, so I hope it'll still simplify the whole process for you.

Good luck on your job interviews,
Dani ;)

Technical Reference : How Domain Controllers Are Located in Windows(Technet)

4/13/11

Time (or Space) is running out.. on Exchange ?!

Hello Again,

This time I'm going to talk about an issue that has happened to me a few time (to my misfortune). Hopefully, It'll help you deal with said issue in a more relaxed fashion and save you some trouble.
Picture the following scene - Your favorite monitoring system alerts that space is running low on the drive that stores your exchange transaction logs (for a specific group), but nobody notices, and it keeps running out.
This usually happens when your Exchange Server isn't being backed up in time.
One way to prevent this from happening, is making your monitoring system alert in a proper way - so there's no way this kind of thing can be overlooked. Another way is to schedule a backup of the exchange server more frequently (or schedule one at all - if you didn't think about it earlier).
Now, free space on your "log drive" is reaching it's critical mark, once it has reached zero available space it will automatically dismount all the stores of the group in question, but fear not - there is a way to deal with it.
First, you can try a manual backup of the exchange server (with your favorite backup manager or even ntbackup) - you might still have enough time to save the day.
Second, or should I say if time is of the essence, you'll need to resort to extreme measures - you'll need to delete  all the transaction logs of the group in question. "But wait, won't that affect me in some horrible way?" - Rest assured dear reader, follow these steps and everything will be ok.
  1. Navigate to the MDBDATA folder on the drive in question. While in the folder, select the first 3-4 days of logs "on record" and copy them to a location that can contain them (i.e. another drive that has lots of 
  2. free space).
  3. When you're finished copying, open a new Notepad instance and do the following - 
    • Open the Exchange System Manager
    • Navigate to the Group in question
    • For each Store, locate the .edb file location and copy it to a new line in the recently opened notepad instance.
  4. Now, for each line in your Notepad instance add the following in front of the line - eseutil /mh
  5. After completing these steps you are prepared to dismount all the store of the group, but be advised, this action will temporarily disconnect all mailboxes connected to these stores (on the bright side, it would've happened anyway, if not now then later). You can now dismount all the stores of the group in question.
  6. After you've done dismounting, open a CMD prompt, navigate to the folder where the exchange server is installed, copy all the content of your Notepad instance and paste it back to the CMD prompt. For each line that runs, you should get a "Clean Shutdown" in the "Shutdown State" line. (if this is not the case, you now have a serious problem, and you can address it using this post)
  7. Navigate to the MDBDATA folder from step 1 and remove the files you have backed up eralier.
  8. Now, select the rest of the files and cut them. Create a new folder and paste them info that folder.
  9. Now mount all the stores (and pray it all works as it should work).
  10. Once the stores are mounted you can delete the new folder you've created earlier and the files that you've backed up in step 1.
  11. Done.
To make sure this does not happen again - do as I mentioned earlier and make sure your backup plan for the exchange server is planned right and executed as scheduled.
Hopefully, you'll never have to resort to these measures, but just in case you do, I hope this post helps you.

Best Wishes, 
Dani.

p.s.
If you have any thoughts on the subject, leave a comment and I'll be sure to reply :)

4/11/11

NetApp, W32Time and stuff between them

Hello World,

This time I'm going to talk about an issue that surfaced recently. One of our storage guys claimed that his NetApp machines aren't getting a good TimeSync service from our Domain, thus drifting away in time and getting to the point where they can no longer co-operate with our domain due to an exceeded time skew.
He also claimed that he is sure that this happens due to the fact that most of our DCs don't have SP2 installed.

It seemed kinda strange, getting schooled by some storage guy, and even more strange was that I haven't noticed any problems in TimeSyncing anywhere in the domain.
I've decided to look into it and found some inconsistencies in his words.
First - There is no issue resolved regarding Time Synchronization in the release notes of Windows Server 2003 Service Pack 2. So there's no way that time sync isn't working for him because of that. 
Second - I found out that this doesn't happen in other domain's in the forest even though we have similar conditions in the other domains. Add to that, the fact that apparently, setting the clock on each NetApp machine manually is too much for one man to do (we have at least one NetApp machine in each site and we have lots of sites). 

All these things didn't add up, so we've decided to apply some best practices (courtesy of NetApp), and now storage guys are reporting no issues. I'll lay out some of them (only the general ones) for you to know :)

  • First and foremost - NetApp machines are site aware. As long as their subnet is defined under the AD Site configuration they are able to locate "Favored" DCs all on their own. In our case, we had a preferred DC manually defined in each machine, and it was the same one for all the machine, even if the site was connected with a low-bandwidth WAN connection. We removed the manual records, seeing that we have a perfectly good site configuration.
  • When defining the "Time Authority" for your NetApp machines, be sure to specify the FQDN of your domain in order for the site integration to work properly.
  • NetApp can use Time Synchronization in 2 different protocols, one of them being NTP. The best practice for any domain would be - all the DCs syncing with the PDCE in the domain, all the PDCEs in the forest syncing via NTP with an authorative DC in the forest, and this DC syncing (also via NTP) with some external time source (that's my personal opinion).
  • Make sure that the time deamon is online on each machine (you'll find out that in some cases, it's switched off for no particular reason).
  • As a best practice you should - "Set the timed window for adding a random offset within 5 minutes of the actual time update/verification. This way not all the systems are talking to the time server at exactly the same time every hour." - this can save you unnecessary timeouts.
I think that's all for now. If I'll have any further conclusions I'll be sure to post it back here.

    in hope of better time sync results, 
    Dani. :D

    Reference : Windows File Services Best Practices with NetApp Storage Systems (Downloadable technical reference from the Network Appliance website).