5/12/14

TCP will keep you alive

Recently I took some time looking into TCP keep-alive and how it works in a Windows environment. Seeing that I already had a summary written on the subject, I’ve decided this is an opportunity for a blog post.
So, here we go…
Applications are often client-server based. In some cases you’ll want the client to keep the connection to the server open, even if it’s been idle for a while. Usually, you’ll want that if making a new connection is too “costly”, being it time or performance or any other consideration.
Connections are usually interrupted by-design, considering servers usually have a timeout period set, after which they close an idle connection. This mechanism is in place to make sure connections are closed when there is no longer a need for them and the client didn’t take the initiative to close the connection.
There are also cases, when connections are interrupted by a third-party for other reasons. One example of this is firewalls. Firewalls have a security mechanism in place, which closes stale connections to make sure they will not be exploited for some sort of an attack.
Whatever the reason might be, if you find that you have a good enough reason to keep a connection alive, TCP keep-alive is one way to go at it.
A good explanation of TCP keep-alive can be found here: http://msdn.microsoft.com/en-us/library/aa925764.aspx
How TCP Keep-alive works –
First and foremost, TCP Keep-alive is not enabled by default. In order for it to be enabled, you have to enable it in the application layer, meaning you need an application to access this feature.
You can do this by one of the following:
·        setsockopt() with SO_KEEPALIVE option
·        WSAIoctl() with SIO_KEEPALIVE_VALS option  
Or, if you’re using a .NET application:
·        SetSocketOption method from Socket Class in System.Net.Sockets namespace
·        GetSocketOption method from Socket Class in System.Net.Sockets namespace
Otherwise, TCP Keep-alive will not be used.
In windows (2008/Vista and above) TCP Keep alive has 2 registry keys that influence its behavior:
1.      HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime
This key is a DWORD measured in milliseconds. It controls how often a TCP connections attempts to verify that an idle connection is still intact. It does so by sending a keep-alive packet.
2.      HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveInterval
This key is also a DWORD measured in milliseconds. It controls how long to wait for a response for the keep-alive packet before sending a retransmission. Once a response is received, keep-alive intervals go back to what was defined in the KeepAliveTime value. The connection will be aborted after 10 failed retransmissions (this number is hard-coded and can’t be modified).
To add to all of this, it’s important to underline that TCP keep-alive is a packet that contains null data, so its impact on the network traffic is minimal. Therefore, its bandwidth usage can and should be neglected.
If your application supports TCP keep-alive it will use it if the values above are configured. To make sure that this is the case, contact the vendor/owner of the application in question. Otherwise, you’ll need to modify to source code in order to accomplish the use of TCP keep-alive.
Until next time,
Dani .H 

3/20/14

Presentation is everything

Hi everyone,

I've made a few presentations recently I'd like to bring to your attention.

First up, is a presentation I made at a developers conference that covers 2 new features in Windows Server 2012. The driving idea behind it, was to highlight “sexy” features in Windows Server 2012 and maybe make the developers want to upgrade from whatever they're using right now.

Second one, is a short presentation I made covering the basics of Kerberos and ways to troubleshoot Kerberos issues. If you are familiar with Kerberos on any level, this should make things a bit clearer for you.

You can find it right here : http://1drv.ms/1eWdsde

Would love to hear your thoughts about them.

That's enough presenting for now
Signing out,
Dani .H

9/5/13

Error 258 and the tale of the Hidden attribute

Hi everyone,

It's been a while since I've read my last point, but now I'm back.

I will try to maintain this blog with a steady feed of interesting and relevant info concerning some of the products that Microsoft makes for enterprises and everything around that (Nothing new here).

This time, I would like to go into MS cluster and some issues regarding Physical Disk resources.
It all started a few days ago...

Background: Got a call from our Ops Center telling me that one of our Exchange 2003 (few of them around) could not fail-over to another node in a 2 node cluster.
We're talking about an MS Cluster that is hosted on 2 Windows Server 2003 SP2 x86 Nodes.
For future reference, let's call the currently operating server - Node 1 and the problematic server - Node 2.

Symptoms and Investigation: When I got to investigating the issues I saw that even though one of the nodes is completely operational, the other one is only able to mount 5 specific Physical Disk resources out of the needed 16. Said disks are hosted on a Netapp storage and are published as LUNs to both nodes.
First, we suspected that there might be an issue with the LUN mapping. This one turned out to be fine (too easy..), as both server nodes indicated the ability to view said LUNs.
Next, we made sure that no FC paths are broken and that there are no other issues on Node 2 that might be affecting the cluster.
When everything seemed to be fine, we, together with the Storage team guys assumed that there might be some kind of lock/reservation on some of the disks that are preventing Node 1 from mounting them.
Turns out, that in the case of such locks, It is a common practice to fully shut down all concerned machines and only boot one at a time to make sure that it is able to mount the disk that were previously locked. This practice was backed up by a successful solution of a previous issue that had common symptoms with ours.
After we carefully switched off all cluster resources we proceeded to shutdown both nodes. As soon as they both shutdown, we restarted all related LUNs (set LUN state offline and back online, just in case) and proceeded to boot up Node 2.
To our (not so big) surprise, after Node 2 came back to life it was still unable to mount 11 out of 16 Physical Disk cluster resources. At this point, I felt like there is some underlying issue that might require a much longer to investigate . Seeing that I was only counting on a 10 minute maintenance windows, I decided that I should go back to Node 1 and make the Exchange available again, before I continue exploring this issue.
We proceeded to shutdown Node 2 and turn Node 1 back on. This time, to our great surprise, Node 1 was also unable to mount 11 out of 16 Physical Disk resources.
Over the course of the hours that followed me and a few engineers from Microsoft Support were troubleshooting the issue and here is some of the "clues" we found : 
  • Both servers were able to see all LUNs and enumerate them as disks even though the cluster was not able to mount them.
  • Each time we tried to turn any of the problematic Physical Disk resources online we got the following error in cluster.log (located under %systemroot%\cluster) : 00000948.00000aa8::2013/09/3-17:49:49.075 INFO Physical Disk <Disk K:>: Online, returning final error 258 ResourceState 4 Valid 0
  • 258 is the error code and the 4 indicates that the cluster tried 4 times and failed to mount the Disk in question.
  • We didn't get any 1034 Event IDs in the system event log. Those might suggest that there's a disk signature issure (more about this in the Tech reference at the end of this post).
  • After disabling the Cluster Disk driver and turning off we were able to add letters to volumes on the problematic disks, and after that able to open and view all the data on said disks. Even so, the cluster still couldn't mount them.

Solution and the reason behind this entire mess: All of the "clues" listed above, led us to believe that KB886702 is describing the issue we're experiencing.
During my investigation process I've already checked some of the volumes attributes and came up blank (nothing wrong with them, and specifically - hidden attribute was set to NO).
All things so far were pointing to a storage issue. While troubleshooting all things storage related we tried mounting a clone of a snapshot for one of the problematic disks. Before we added it as a cluster resource we checked the volume attribute on him and found out that all of them are set to YES. 
When we failed to mount this disks after adding it as a Physical Disk resource, we thought that we should remove all the problematic disks from the cluster and check their attributes again. A word of advice - be sure to map all disk numbers to their respective volume's drive letters before you remove them as resources.
Turns out, that all of them had their respective attributes set to YES.
After switching the attributes back to NO on all disks (more on that later on), we added them as resources and they all were able to mount (same thing happened after turning on the other node and moving all resources there).

So... What exactly happened? Why were the attributes OK at some point and not OK a short time after?
Well, first we should address the possible reason for this whole mess. It seems that sometimes, when a backup/restore procedure of sorts fails, it might leave volumes with the wrong "attributes". Actually, they are not wrong, but rather - incompatible with day to day tasks. 
Next, let's try and understand why we were able to access the data on the volumes but weren't able to mount them through the cluster. The cluster service complies to a more strict set of rules and has to complete some tasks on the volume before it can be mounted. While the Hidden attribute (set to YES) does not interfere with us accessing the data on the volume, it does prevent the cluster from mounting said volume.
Lastly, I'll address the reason for the difference in attributes that we saw earlier. When a volume is acting as a part of a cluster, the cluster is responsible for it, hence it will not allow us to view (correctly) or change attributes of volumes under the cluster's "control". As soon as the problematic disks stopped being a part of the cluster, we were able to view the "true" state of the volume attributes and alter it accordingly.

To sum it all up, if you're having a disk related issue which requires you to check volume attributes - make sure you remove it as a cluster resource before you make any checks or changes. Knowing this, might have saved me a lot of trouble.

Tech reference

Leave your comments below.

Singing out,
Dani .H

7/10/11

Orphaned Sites and how to treat them

Hello again,

Today I'll be talking about time sync and what it has to do with GPO.
Recently, we've been experiencing some issues with station in certain sites not receiving GPO from the Domain.
We've been troubleshooting this issue for a while and there seemed to be no problem with our station, their logon process was correct and each station knew the which dc was the closest to her (or so we thought).
As it turns out, the common problem with all the station that were experiencing difficulties with GPO was W32Time Errors in the System Event Log.
We used the nltest tool to enable netlogon debugging (you can do so by typing the following in cmd - nltest /dbflag:0x2080ffff). Then, we restarted the W32Time service and tried to sync it manually. We noticed in the log, that the station knew it was in a certain site (and it was the right site) and it got the corresponding dc to our site topology, but the station declined that DC because it was from a different site.

Right now, I'd like to take a few lines to understand the "concept" of DC-less sites.
When you look through the basics of Active Directory Design, there's a basic concept that says that in most cases, each site should get at least one DC (considering the user "bandwidth" of the site in question). One of the exception for this case is the need for sites in order to provide other "site-oriented" services (like DFS Namespaces for example). 
In cases such as those, station in certain sites are left "orphaned" and seek out a DC to "adopt" them. Microsoft predicted this scenario and thought of an auto-coverage mechanism that depends on the site topology (i.e. site links), thus providing you with an automatic mechanism that returns you the closest resource of the relevant service ("closest" is a matter of minimal cost of site links). In short, you get the closest server that can provide the needed service (like DC got login requests).

Now that we've established a common baseline on "AD Site Auto-Coverage", I'll elaborate a little bit more about our situation. All the stations in question were located in DC-less sites, thus depending on the auto-coverage mechanism. As far as we could see, this (auto-coverage) worked perfectly. Netlogon logged that the station was in site "A" and got a DC from it Parent-Site ("B"), but for some reason, the station refused to accept the returned DC. After a few tries the station declared that there aren't any available Domain Controllers. This led to any cached info the station contained concerning a reliable dc  to contact, being flushed on a regular basis.
Thus far, we concluded that the station wasn't able to establish a reliable connection with a DC to get GPO information, but didn't know why.
Turns out, that the W32Time Service has a parameter called CrossSiteSyncFlags, that is responsible for the "Cross Site" behavior of W32Time. In short, it defines what's to be done in cases that the w32time need to rely on site topology. This certain parameter has 3 modes - The first, doesn't allow the time service to sync with any source outside the machine's site. The second, only allows syncing with the site the machine's located in and the PDC. The option we needed (and wasn't configured) was the third one - the machine can synchronize with a source outside it's site. This parameter can be configured via GPO (more about it in the Tech Reference). As soon as we configured the parameter to the third "mode", there were no more time issues on the stations, and it seemed that all was well. Sadly, after we reset the station, the same problem occurred. We started thinking if there's any other site related reasons that could prevent the station from getting GPO. Just to be clear - we have a very strict firewall policy concerning inter-site connections. The short version is - no station is allowed to communicate outside the site topology, meaning that station can only communicate with machines in sites that have a link to the station's own site. After some thinking, it was concluded (not by me) that since we're using mostly windows 2003 domain controllers, we probably don't have SiteCostedReferrals enabled. 
What SiteCostedReferrals essentially means is that the DFS namespace mechanism will work according to the site topology as defined by link costs. If you do not enable this, then sites that don't have a DFS resource available will access other available members of the DFS namespace randomly. As you all know, sysvol and netlogon folders are essential to the group policy process and are viewed like a regular DFS namespace. So, in our case, sites with no DC(no DFS resource) went on to access random sites that were blocked by our nice and friendly firewall. When we enabled the SiteCostedReferrals on all our DCs, the problem just sorted itself out and the station finally could make a stable connection with the \\domain\sysvol folder. This "issue" was resolved starting Windows Server 2008.

So, a few things you need to keep in mind if you are implementing DC-less sites in you Active Directory design : 
  • Make sure that the machines in your domain are accessing other resources on the domain based on the domain's site topology.
  • Make sure your time sync mechanism is functioning, and if not, look into your CrossSiteSyncFlags configuration.
  • If your DCs (or any DFS enabled servers for that matter) have a Windows Server 2003 OS or lower, enable SiteCostedReferrals or bare with the fact that machines in sites with no DFS resource available will not comply with your site topology.

Yours truly,
Dani .H

Technical Reference

6/27/11

A LDAP in VBScript's Clothing

Hello again,

It's been a while since my last post. Haven't been inspired much lately.
This post will be a short one, and hopefully the next one will be longer and will interest you more - I'm planning on doing a piece regarding DFS.

So, we had a situation today that required us to change a specific user attribute for all users in a specific group.
There are a couple of ways to deal with this situation. If this is a so-called "Mainstream" attribute, you can use the infamous dsmod cmd tool-  it allows you to change certain attributes, and if you combine it with dsquery and dsget it will come through for you. Sadly, not all cases are simple, forcing you to utilize an advanced set of solutions. 
First thing that comes to mind is Powershell. If you wish to use powershell you can do so using the DirectoryServices library. Another way, that's a little less complex is using the QAD (Quest Active Directory) add-on. 
There's also a handy tool called ADModify, enabling you to accomplish a vast variety of LDAP actions.

To my great disappointment, I had no Internet access and no machines with PS installed on them in the domain in question, so I did what any of you would do (I imagine) - wrote a VBScript. 

The common way to connect to your AD with VBScript is by creating an object that has a DN (Distinguished Name) string like so : 
dim objADUser
set objADUser = GetObject("LDAP://CN=username,OU=someou,DC=domain,DC=com")

In case you want to change a certain attribute of the user object, it is easily done in the following way :
objADUser.Put "AttributeName" = "NewValueForAttribute"

In my case, I neede to get all users from a certain group, which is done by writing :
dim objADGroup
set objADGroup = GetObject("LDAP://CN=groupname,OU=someou,DC=domain,DC=com")
objADGroup.GetInfo
(so far we've set up the group for exploration)
arrMembers = objADGroup.GetEx("member")
(this here, gets the group's membership and sets it into the arrMembers array)

After doing all this, we just need to combine it together into a simple script, like so :
dim objADUser
dim objADGroup
dim arrMembers
set objADGroup = GetObject("LDAP://CN=groupname,OU=someou,DC=domain,DC=com")
arrMembers = objADGroup.GetEx("member")
objADGroup.GetInfo
foreach strMember in arrMembers
dim objADUser
set objADUser = GetObject("LDAP://" & strMember)
objADUser.Put "AttributeName" = "NewValueForAttribute"
Next


I suppose this can be done in a more elegant and efficient way, but this few lines do the job well.
You're welcome to use them if you wish.

There are also other tools out there to modify AD objects, but seeing that making a script is so simple, there's no reason to use some third-party tools.

Hope you've learnt something from this post.
Until next time,
Dani .H

Technical Reference:
Great tutorials about VBScripting for AD - Users, Groups

5/5/11

Your Authentication is only as good as your last packet

Hello again,

It's seems a bit unfortunate, that the times when I'm really inspired to write are the times when everything (or almost everything) comes crashing.

So, in short, today I'll write a little Kerberos Authentication and how it's affected by the smallest things.

Just a while ago, one of our clients called with a problem - his web server was working awfully slow, to the point where certain actions got a timeout after about 20 minutes of being stuck on a post request.
In order to clarify, both servers are running on Win 2003 Std x86 with IIS 6. The application in question was using Kerberos Authentication in order to confirm credentials on certain actions.

First thing that we noticed was that when we switch IIS to work with anonymous authentication the request didn't hang (though the action itself didn't perform as needed because it had no credentials), and we had a page returned almost immediately. 
Second thing we noticed was that one of the servers took a really long time to get through the login process.

So the immediate thing to do was to run Authentication Diagnostics for IIS by Microsoft (Download x86 Version Here) - It's a pretty handy utility to troubleshoot with. We ran some diagnostics and everything seemed OK, except for a failure that said that some users that are running application services don't have an SPN available in Active Directory, which seemed strange because as far as I recall, SPNs are usually a thing reserved to computer objects.
The next thing I decided to do was to iisreset both of the web servers and see if reseting the app pools together did anything of use. After the iisreset on one of the servers, everything seemed to work fine. I was glad to see that at least I got stuff working again, and quickly switched to the second server to check it, but it was still doing iisreset so I had to wait. Once the iisreset was done on the second server I checked it and saw that it still didn't work, but when I went back to the "working one" it reverted to not working again. For some reason I had a feeling that the last success was connected to the fact that the second server was down, so I decided to stop the w3p service on the second server and guess what - the "working one" was working one. This seemed to be in tune with the fact that the server that had his w3p service stopped was the same server that took forever to login.
After trying to troubleshoot with Event viewer on both Web server and the DCs it was trying to contact and getting off with nothing in particular, we had one of our network architects telling us to use a network sniffer on both web server and dc and see what's going on - basically a great idea, seeing that we had no leads what so ever on the problem except the fact that it had something to do with authentication.
After some digging with the sniffer we've noticed that all the KRB requests were made using UDP. A short comparison to the working server showed that it was using TCP for KRB requests. I already knew that there's a way to force my server to use TCP for kerberos authentication, but I still didn't know what was the problem with using UDP for this process. 
I went online and stumbled upon KB 244474 (by Microsoft of course), which in turn explained that up to Windows 2003 the default protocol for Kerberos Authentication is using UDP, which means that UDP will be first to try out and TCP will be the second (as far as I can tell). The main issue with UDP according to the KB is that KRB Authentication using UDP is supported up to a certain packet size, and also, if a fragmented packet is received by the DC - it will be dropped. Knowing that we went back to the sniffer, and noticed that the web server was sending request in the following fashion - First, a fragmented packet on the verge of the size limit was sent, with a smaller packet following it. The sniffer on the DC logged that only the smaller packet was being received. From this I gathered the following - The web server was trying to send packets larger than the limit set for UDP, so the packets were fragmented, and sent according to the size limit, but the DC was dropping them on the basis of them being fragmented. This seemed to be the issue causing the hangs we've been having. Next thing I figured was that this process wasn't timing out because the DC was only receiving a part of every request and thus never completing the negotiation. This info was gathered mostly by relying on common sense and was based on past experience, if I get this confirmed I'll be sure to post back.
So, once we forced the problematic server to use TCP for KRB requests, the long login time was solved and so was the original problem.
This is what I think happened to cause the problem in the first place - 
Both web servers are separate but they also use the same back-end servers. I guess, that in order to perform actions involving the back-end servers they had the Authenticate with a DC. The problematic server was trying to access the back-end servers and once it established a transaction with the back-end servers it went on to authenticate with a DC, a process that it could not accomplish in a timely manner. In the same time, the second server was prevented from working with the back-end servers, because they already had an open transaction with another server. This seems to me like the most probable reason for this whole mess.
I've yet to confirm all the details, but I'm pretty sure I got most of it right (if not all of it), and solving the problem seems like enough evidence on that account.

My thoughts on the subject
  1. Starting with Windows 2008, the default protocol for kerberos authentication is TCP. I can't see why Microsoft didn't release a Hotfix to make Windows 2003 Servers act the same way.
  2. When working in a Front to Back end configuration using AD credentials, be sure to open a transaction after the credentials were verified by a DC.
  3. IIS Authentication Diagnostics should have tested for UDP authentication specifically, knowing that this is the default protocol in Windows 2003.
  4. Don't ignore "minor" problems while facing larger issues. These so called "minor" issues could very well be your lead on the bigger problem.
Anyways, now everything seems to be fine. If I gather any more info on the subject, I'll revisit the issue in future posts.

TCP is Power,
Dani.

4/15/11

When stations play Hide and Seek with DCs

Hey guys,

I hope to keep this post short, because it's not really something super complex. 
b.t.w I'll be on vacation this week, so don't expect anything out of the ordinary - muse usually comes to me at work :)

Today I'd like to talk about how a station chooses (or rather locates) a DC to communicate with. It's been on my mind this week, because I've had an opportunity to be a part of a technical job interview and the guy we interviewed didn't seem to know how this process works. I'd like to dedicate this post to him. 

To start off, I just want to point out that Microsoft probably has a more thorough explanation on their technet library, so I'm just going to simplify it a bit for those who want to complete a successful job interview and don't have the tolerance to read a lot of complex (and sometimes - good for nothing) technical terms.
So, first of all, the name of the process that commences the "search" is "DC Locator" (no surprises here I hope..). The DC locator works his "charm" through the netlogon service, which means that if netlogon is down locally, you won't be able to find a DC to authenticate with (but you'll probably get an error that extensively explains this). 
Next, your nifty DC locator will try to query a DNS server, and as you've probably guessed -if you don't have a connection with the DNS servers, you won't be able to locate a DC. So, as to the DNS query, if this is the first time your station attempts to query the DNS, it will first query the domain name and save it in the netlogon cache for future use (to save you some time on your net logon, hopefully). 
Now, the DC locator will query the DNS server of his choice for a dc, it will do so through the msdcs zone and it'll prefer a dc in the same subnet. Once a DC has been found, the "client" will establish communication with it using LDAP(Lightweight Directory Access Protocol), that is, to gain access to Active Directory. The DC will identify the site which the said "client" belongs to using client's IP subnet. If the current DC isn't the optimal choice (i.e. not a DC in the closest site), it will return the name of the client's optimal site - in case the client has already failed to communicate with a DC on that site, it will continue "working" with the current DC, else it will query the DNS with a site-specific query. Once the client establishes connection, it will "cache" the info for netlogon future usage. In a case where the client cached a non-optimal DC entry, it will flush its cache in 15 minutes and will reattempt this whole process from the top when needed.
Any further actions will include : Logon,Authenticaion etc.

Well, it turned out longer than I expected, so I hope it'll still simplify the whole process for you.

Good luck on your job interviews,
Dani ;)

Technical Reference : How Domain Controllers Are Located in Windows(Technet)