Hi everyone,
It's been a while since I've read my last point, but now I'm back.
I will try to maintain this blog with a steady feed of interesting and relevant info concerning some of the products that Microsoft makes for enterprises and everything around that (Nothing new here).
This time, I would like to go into MS cluster and some issues regarding Physical Disk resources.
It all started a few days ago...
Background: Got a call from our Ops Center telling me that one of our Exchange 2003 (few of them around) could not fail-over to another node in a 2 node cluster.
We're talking about an MS Cluster that is hosted on 2 Windows Server 2003 SP2 x86 Nodes.
For future reference, let's call the currently operating server - Node 1 and the problematic server - Node 2.
Symptoms and Investigation: When I got to investigating the issues I saw that even though one of the nodes is completely operational, the other one is only able to mount 5 specific Physical Disk resources out of the needed 16. Said disks are hosted on a Netapp storage and are published as LUNs to both nodes.
First, we suspected that there might be an issue with the LUN mapping. This one turned out to be fine (too easy..), as both server nodes indicated the ability to view said LUNs.
Next, we made sure that no FC paths are broken and that there are no other issues on Node 2 that might be affecting the cluster.
When everything seemed to be fine, we, together with the Storage team guys assumed that there might be some kind of lock/reservation on some of the disks that are preventing Node 1 from mounting them.
Turns out, that in the case of such locks, It is a common practice to fully shut down all concerned machines and only boot one at a time to make sure that it is able to mount the disk that were previously locked. This practice was backed up by a successful solution of a previous issue that had common symptoms with ours.
After we carefully switched off all cluster resources we proceeded to shutdown both nodes. As soon as they both shutdown, we restarted all related LUNs (set LUN state offline and back online, just in case) and proceeded to boot up Node 2.
To our (not so big) surprise, after Node 2 came back to life it was still unable to mount 11 out of 16 Physical Disk cluster resources. At this point, I felt like there is some underlying issue that might require a much longer to investigate . Seeing that I was only counting on a 10 minute maintenance windows, I decided that I should go back to Node 1 and make the Exchange available again, before I continue exploring this issue.
We proceeded to shutdown Node 2 and turn Node 1 back on. This time, to our great surprise, Node 1 was also unable to mount 11 out of 16 Physical Disk resources.
Over the course of the hours that followed me and a few engineers from Microsoft Support were troubleshooting the issue and here is some of the "clues" we found :
- Both servers were able to see all LUNs and enumerate them as disks even though the cluster was not able to mount them.
- Each time we tried to turn any of the problematic Physical Disk resources online we got the following error in cluster.log (located under %systemroot%\cluster) : 00000948.00000aa8::2013/09/3-17:49:49.075 INFO Physical Disk <Disk K:>: Online, returning final error 258 ResourceState 4 Valid 0
- 258 is the error code and the 4 indicates that the cluster tried 4 times and failed to mount the Disk in question.
- We didn't get any 1034 Event IDs in the system event log. Those might suggest that there's a disk signature issure (more about this in the Tech reference at the end of this post).
- After disabling the Cluster Disk driver and turning off we were able to add letters to volumes on the problematic disks, and after that able to open and view all the data on said disks. Even so, the cluster still couldn't mount them.
Solution and the reason behind this entire mess: All of the "clues" listed above, led us to believe that KB886702 is describing the issue we're experiencing.
During my investigation process I've already checked some of the volumes attributes and came up blank (nothing wrong with them, and specifically - hidden attribute was set to NO).
All things so far were pointing to a storage issue. While troubleshooting all things storage related we tried mounting a clone of a snapshot for one of the problematic disks. Before we added it as a cluster resource we checked the volume attribute on him and found out that all of them are set to YES.
When we failed to mount this disks after adding it as a Physical Disk resource, we thought that we should remove all the problematic disks from the cluster and check their attributes again. A word of advice - be sure to map all disk numbers to their respective volume's drive letters before you remove them as resources.
Turns out, that all of them had their respective attributes set to YES.
After switching the attributes back to NO on all disks (more on that later on), we added them as resources and they all were able to mount (same thing happened after turning on the other node and moving all resources there).
So... What exactly happened? Why were the attributes OK at some point and not OK a short time after?
Well, first we should address the possible reason for this whole mess. It seems that sometimes, when a backup/restore procedure of sorts fails, it might leave volumes with the wrong "attributes". Actually, they are not wrong, but rather - incompatible with day to day tasks.
Next, let's try and understand why we were able to access the data on the volumes but weren't able to mount them through the cluster. The cluster service complies to a more strict set of rules and has to complete some tasks on the volume before it can be mounted. While the Hidden attribute (set to YES) does not interfere with us accessing the data on the volume, it does prevent the cluster from mounting said volume.
Lastly, I'll address the reason for the difference in attributes that we saw earlier. When a volume is acting as a part of a cluster, the cluster is responsible for it, hence it will not allow us to view (correctly) or change attributes of volumes under the cluster's "control". As soon as the problematic disks stopped being a part of the cluster, we were able to view the "true" state of the volume attributes and alter it accordingly.
To sum it all up, if you're having a disk related issue which requires you to check volume attributes - make sure you remove it as a cluster resource before you make any checks or changes. Knowing this, might have saved me a lot of trouble.
Tech reference :
Leave your comments below.
Singing out,
Dani .H