7/10/11

Orphaned Sites and how to treat them

Hello again,

Today I'll be talking about time sync and what it has to do with GPO.
Recently, we've been experiencing some issues with station in certain sites not receiving GPO from the Domain.
We've been troubleshooting this issue for a while and there seemed to be no problem with our station, their logon process was correct and each station knew the which dc was the closest to her (or so we thought).
As it turns out, the common problem with all the station that were experiencing difficulties with GPO was W32Time Errors in the System Event Log.
We used the nltest tool to enable netlogon debugging (you can do so by typing the following in cmd - nltest /dbflag:0x2080ffff). Then, we restarted the W32Time service and tried to sync it manually. We noticed in the log, that the station knew it was in a certain site (and it was the right site) and it got the corresponding dc to our site topology, but the station declined that DC because it was from a different site.

Right now, I'd like to take a few lines to understand the "concept" of DC-less sites.
When you look through the basics of Active Directory Design, there's a basic concept that says that in most cases, each site should get at least one DC (considering the user "bandwidth" of the site in question). One of the exception for this case is the need for sites in order to provide other "site-oriented" services (like DFS Namespaces for example). 
In cases such as those, station in certain sites are left "orphaned" and seek out a DC to "adopt" them. Microsoft predicted this scenario and thought of an auto-coverage mechanism that depends on the site topology (i.e. site links), thus providing you with an automatic mechanism that returns you the closest resource of the relevant service ("closest" is a matter of minimal cost of site links). In short, you get the closest server that can provide the needed service (like DC got login requests).

Now that we've established a common baseline on "AD Site Auto-Coverage", I'll elaborate a little bit more about our situation. All the stations in question were located in DC-less sites, thus depending on the auto-coverage mechanism. As far as we could see, this (auto-coverage) worked perfectly. Netlogon logged that the station was in site "A" and got a DC from it Parent-Site ("B"), but for some reason, the station refused to accept the returned DC. After a few tries the station declared that there aren't any available Domain Controllers. This led to any cached info the station contained concerning a reliable dc  to contact, being flushed on a regular basis.
Thus far, we concluded that the station wasn't able to establish a reliable connection with a DC to get GPO information, but didn't know why.
Turns out, that the W32Time Service has a parameter called CrossSiteSyncFlags, that is responsible for the "Cross Site" behavior of W32Time. In short, it defines what's to be done in cases that the w32time need to rely on site topology. This certain parameter has 3 modes - The first, doesn't allow the time service to sync with any source outside the machine's site. The second, only allows syncing with the site the machine's located in and the PDC. The option we needed (and wasn't configured) was the third one - the machine can synchronize with a source outside it's site. This parameter can be configured via GPO (more about it in the Tech Reference). As soon as we configured the parameter to the third "mode", there were no more time issues on the stations, and it seemed that all was well. Sadly, after we reset the station, the same problem occurred. We started thinking if there's any other site related reasons that could prevent the station from getting GPO. Just to be clear - we have a very strict firewall policy concerning inter-site connections. The short version is - no station is allowed to communicate outside the site topology, meaning that station can only communicate with machines in sites that have a link to the station's own site. After some thinking, it was concluded (not by me) that since we're using mostly windows 2003 domain controllers, we probably don't have SiteCostedReferrals enabled. 
What SiteCostedReferrals essentially means is that the DFS namespace mechanism will work according to the site topology as defined by link costs. If you do not enable this, then sites that don't have a DFS resource available will access other available members of the DFS namespace randomly. As you all know, sysvol and netlogon folders are essential to the group policy process and are viewed like a regular DFS namespace. So, in our case, sites with no DC(no DFS resource) went on to access random sites that were blocked by our nice and friendly firewall. When we enabled the SiteCostedReferrals on all our DCs, the problem just sorted itself out and the station finally could make a stable connection with the \\domain\sysvol folder. This "issue" was resolved starting Windows Server 2008.

So, a few things you need to keep in mind if you are implementing DC-less sites in you Active Directory design : 
  • Make sure that the machines in your domain are accessing other resources on the domain based on the domain's site topology.
  • Make sure your time sync mechanism is functioning, and if not, look into your CrossSiteSyncFlags configuration.
  • If your DCs (or any DFS enabled servers for that matter) have a Windows Server 2003 OS or lower, enable SiteCostedReferrals or bare with the fact that machines in sites with no DFS resource available will not comply with your site topology.

Yours truly,
Dani .H

Technical Reference