One of my colleagues noticed rather strange behavior with one of the Windows Server 2008 R2 servers which is part of a 7-node Hyper-V cluster built with HP BL460c G6 blade servers. This behavior showed up during host level backups of Hyper-V virtual machines.

What problems were observed?

  • The System Reserved Partition did not have a label and was offline after a reboot
  • A CSV could be offline if it was owned by this particular cluster node

These problems could be temporarily solved by assigning a drive letter. Oddly enough Disk Manager and Diskpart did not agree with each other about the disk being online or offline. In Diskpart the problematic partition would show offline and in Disk Manager it was online. It would show offline if the disk had no label:

Without a disk label:
clip_image002

With a disk label:
clip_image002[5]

Other symptoms:

  • Host level (external) backups of Hyper-V child partitions failed (while internal backups of VM’s with a backup agents would succeed)
  • During this external backup multiple VHD’s are created but apparently nothing is written to it.

Ultimately the Hyper-V VSS writer faults with a non-tryable error. However, if we move the VM to another host, the backup completes successfully.

Multiple attempts to solve this problem were made:

  • Evict node from cluster and rejoin cluster
  • Remove backup agent (DPM2010) and add it again
  • Remove Server Backup feature and add it again
  • Remove DPM2010 Protection Group and recreate it
  • Backup with and without label on reserved partition

So far the only option left was to reinstall the server.

Just by coincidence I found a recent KB article called “System Partition goes offline on Windows Server 2008 and Windows Server 2008 R2 after installing some 3rd Party Disk or Storage Management Software” dated September 29, 2010: http://support.microsoft.com/kb/2419286/

This article names three issues:

Issue 1:

Hyper-V Role Cannot be Installed “Failure configuring Windows features”

Issue 2:

“Failed to prepare storage for testing on node 1 status 87” during Cluster Validation

Issue 3

“The boot configuration data store could not be opened. The system cannot find the file specified.

In our case it was issue 2 that troubled us.

image

The resolution was to online the System Reserved Partition:

Diskpart
List volume
Select volume n (n= the 100MB system partition)
Online volume
Exit

Or from Disk Management:

Diskmgmt.msc
Select the 100MB volume and Right-Click it
Change drive letter & path
Assign a drive letter

This is something we had already found out. The bad thing is that it returns after a reboot.

We checked the following:

  • No 3rd party disk or storage management software is installed
  • No anti-virus software is installed on the cluster node

______________________________________________________________
Update: October 6, 2010

Talking to several people in my network we’ve come up with the following (partial and possibly full) solutions:

  1. Assign a drive letter, and leave it during a reboot: result was that system partition kept online, but the host level backup of guest partitions with DPM 2010 failed (this idea was presented by several people, a.o. Kurt Roggen and an engineer at Microsoft)
  2. Run a chkdsk /r on the system drive (part of solution) and run the system readiness tool which replaces the corrupt mum files from a fully functional Hyper-V host to the one that was having issues:
    http://support.microsoft.com/kb/947821 (this tip was presented by Annur Sumar)

Unfortunately we were pressured to get the host running again so we just reinstalled it. That is solution #3 and although not very efficient, one that works in almost all situations Glimlach

So thanks for the great feedback to all that contributed!