Keeping your Virtual Active Directory Domain Controllers Safe

In my last blog I sent out a red alert on a killer Windows Update that had not been sufficiently tested. The net result was a full crash of a two-node System Center fabric management cluster. The fabric was still in the making and backups were only provisionally taken in the form of Virtual Machine exports of the most important virtual machines.  As fellow Hyper-V MVP Aidan Finn wrote unambiguously: “Something Has Gone Very Wrong With Microsoft Patch Testing

Where did it go wrong?

I was actually demonstrating the fantastic Cluster Aware Updating functionality in Windows Server 2012 clusters, which would automatically move all VMs off a host, update it, reboot it, live migrate the VMs back to the updated cluster node and move on to the next.

The problematic July Update Rollup KB2855336 – which was one of the updates to be processed – is actually a collection of originally 20 issues that solves problems in several areas. A still unidentified part of that rollup caused a 0x000000D1 Stop error while live migrating a VM on a Windows Server 2012-based server. So Cluster Aware Updating using the Live Migration mechanism to place a host in maintenance mode, combined with the mentioned update, sent shockwaves through the cluster. In this case both cluster nodes crashed within minutes.

Catch 22

Ironically enough this same July Update Rollup also contained an important fix for a problem that has been around for some time: Active Directory database becomes corrupted when a Windows Server 2012-based Hyper-V host server crashes (KB2853952).

Symptoms

Assume that you have a Windows Server 2012-based virtualized domain controller on a Windows Server 2012-based Hyper-V host server. When the Hyper-V host server crashes or encounters a power outage, the Active Directory database may become corrupted.

Cause

This issue occurs because the guest system requests the Hyper-V server to turn off disk caching on a disk. However, the Hyper-V server misinterprets the request and keeps disk caching enabled.

If you try to disable the write caching manually you will see this error: “Windows could not change the write-caching setting for the device. Your device might not support this feature or changing the setting.”  On a physical domain controller this has never been a problem.

image

CORRUPTED DOMAIN CONTROLLER

It takes very little imagination to guess what happened to Active Directory if you combine the full STOP of the fabric management cluster and the AD domain controllers that were virtualized on that same Hyper-V cluster without the required updates and hotfixes.

I ended up filing a case with Microsoft Support and although we were on the brink of last weekend, a really knowledgeable AD support specialist got in touch and helped me identify the damage that was done. The DC running all the FSMO roles including the Schema were on a VM that was totally broken. PowerShell, Server Manager or even the Event Viewer could no longer be started, although fortunately we could still open a command prompt. Remoting into the machine was also no longer possible.

In short we took exports of both Active Directory Controller VMs, rebooted, got SYSVOL replication working again which took at least two hours to repair, moved off the Schema and FSMO roles to a freshly installed DC. We then tried to demote the corrupted DC which was not possible in the GUI or interactive mode. Because the command prompt still worked, I was able to run:

dcpromo /unattend /username:<domain admin> /userdomain:<domain> /password:<Domain Admin password> /administratorpassword:<local admin password>

And this was only the domain controller. Meanwhile also a Cluster Shared Volumes disk had gone wrong during the crash and required a Chkdsk.

image

image

It turned out that one VM’s virtual hard disk was corrupt beyond repair. Several other VM’s had to be reinstalled because they showed unexpected behavior. At the end of the day I realized I had seen the largest corruption in my entire 30 year career working in the IT industry.

analysis

A full STOP of a cluster node, just like a power outage which is not protected by UPS or no-break system, can cause potential damage to your files and databases when data in the write cache is unprotected. When a virtual domain controller starts, it requests disabling of the disk cache on the virtual disk controller (IDE or SCSI). On a physical domain controller this is handled correctly, but the virtual AD controller would report success although in reality it could not actually disable the cache. This is the situation that could lead to corruption if the physical Hyper-V host would unexpectedly reboot. The unwanted result is that AD at the jet database layer does not request write-through I/O (writing directly to the disk without using disk cache) because AD assumed the disk cache had already been disabled.

KB2853952 actually fixes this situation so that the OS correctly reports that disk cache cannot be disabled. The jet database now being properly informed, requests I/O to be written directly to disk which is then correctly handled by the Hyper-V storage filter driver in the guest.

To enhance recoverability of a file or file system, applications such as Microsoft SQL and Microsoft JET can specify the FILE_FLAG_WRITE_THROUGH flag to instruct the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.

If a system failure occurs, NTFS has enough information in the log to complete or abort any partial NTFS transaction. During recovery operations, NTFS redoes each committed transaction found in the log file. Then NTFS locates in the log file the transactions that were not committed at the time of the system failure and undoes each metadata operation recorded in the log file. Because NTFS flushes the log to disk before any metadata changes are written to disk, NTFS has complete information available about any metadata changes that need to be rolled back during recovery.

I can also refer you to fellow Hyper-V MVP Didier van Hoye who wrote a blog on NTFS and the flush command:
http://workinghardinit.wordpress.com/tag/forced-unit-access/

Case closed you might think?

However even after applying the re-issued July Update Rollup KB2855336 including KB2853952, we still see an event in the Directory Service event log that could seem alarming to users:

Active Directory Domain Services could not disable the software-based write cache on the following hard disk. Hard disk: c:  Data might be lost during system failures.

clip_image002

GAINING BACK THE TRUST

As we speak, the KB2853952 article describing the problem is being rewritten to better describe what has been corrected. My strong suggestion to the Microsoft product teams is not only explain this in the KB article, but also write an Informational event in the Hyper-V server and virtual AD domain controller’s event logs. If the VM can confirm that the DC is running on a supported hypervisor, it should not be all that hard to also confirm that the Hyper-V storage filter is notified by the hypervisor that write caching has been disabled

image

As the English saying goes “the proof is in the pudding”. We’ll have to see if virtual domain controllers are now fully protected against a bug check or power failure. Although I have started to virtualize all AD domain controllers since we started rolling out Windows Server 2012 Hyper-V for customers, for the current project we have taken these measures:

  1. We’ve added one physical domain controllers holding Schema and all FSMO roles
  2. We converted our dynamic VHDX virtual disks to fixed size because we want to rule out the risk of running out of disk space.
  3. We’ve installed KB2853952 which is part of the July 2013 Update Rollup KB2855336.
  4. We are currently fully up-to-date with all updates using Cluster Aware Updating (CAU) and Windows Server Update Services (WSUS) which is dedicated to only the Hyper-V and Scale-Out File Servers.
  5. We first test updates and hotfixes on a research/test cluster with equivalent configurations
  6. We then run CAU manually for installing Windows updates with a delay of at least 1 month to capture disasters like with the July update. Hopefully we can pick up notification of problems via social media and blogs
  7. When necessary we also run CAU manually for installing Hotfixes with a delay of at least 1 month and triple check their necessity.
  8. We moved back from the vSCSI controller to using the virtual IDE controller and the c-drive as the default location for the AD database, even though generation 2 virtual machines will abandon the vIDE controller in favor of the vSCSI controller. When we gain more experience with the new generation 2 VMs we might convert to them in the future.

As it stands, we don’t take ANY risks with the database that can be truly called the foundation of everything else we run in our datacenters. I can only hope Microsoft reserves enough resources – and here I mean the intelligent people as well as plenty of equipment  and configuration we use in modern enterprises – to test, test and once again test updates so that we can gain back the trust we have lost.

image

Further reference:

See http://www.hyper-v.nu/archives/hvredevoort/2013/07/some-more-background-on-windows-update-kb2855336/

5 Comments

  1. July 22, 2013    

    Hi Hans,

    I have a quick question.

    When you mention
    “…We moved back from the vSCSI controller to using the virtual IDE controller …”

    Can you explain why? Isnt safer to let the AD DB on a vscsi disk to avoid corruption? Or you trust Microsoft it will not break again..

    Also with the hotfix applied do you still see error in the AD saying its not able to disable caching on the c drive?

    • adminHans's Gravatar adminHans
      July 22, 2013    

      Hi Emmanuel,

      I still see the same error after installing all updates and hotfixes on both host and guest.
      Until now I’ve always used vIDE and never saw problems, until this time when AD database was on vSCSI, but I don’t point directly to this configuration, but rather the combination of the bug check combined with the absence of the AD database fix. My point is that now this is solved, I want to see confirmation about the host disabling write-cache for the VHDX holding the AD database.

      Cheers, Hans

  2. July 24, 2013    

    Hi!
    I installed all updates, but I still error in log on DC – write-cache is not disabled. What change MS in KB2855336? Or MS do that now I see error in log, but problem not solved?

    • adminHans's Gravatar adminHans
      July 25, 2013    

      That is what I see too, according to MS and updated kb article this is solved
      -H

  3. Bororo's Gravatar Bororo
    August 22, 2013    

    I’m little bit confused if KB2855366 should be installed on both host and vm? I just installed Server 2012 Standard with Hyper-V role and KB2855336 was offered via Windows update. Then I installed virtualised DC and there is no KB2855336 installed (not offered via Windows update). When I checked write cache I found that host is OK (cache disabled) but virtual machine disks are NOT as cache is enabled and cannot be disabled. Should I install KB2855336 manually on the vm?

  1. A Very Important Article About Health Of Virtual DCs On Hyper-V on July 20, 2013 at 20:35
  2. Currupt Active Directory Database om virtual VMs | The ICW Datacenter Blog on July 21, 2013 at 17:00
  3. http://www.hyper-v.nu/archives/hvredevoort/2013/07/keeping-your-virtual-active-directory-domain-controllers-safe/ | JC's Blog-O-Gibberish on July 21, 2013 at 18:32
  4. Important : If you have virtual machines domain controllers on your Windows Server 2012 Hyper-V servers, install this update | buildwindows on July 22, 2013 at 15:00
  5. Microsoft Most Valuable Professional (MVP) – Best Posts of the Week around Windows Server, Exchange, SystemCenter and more – #38 - TechCenter - Blog - TechCenter – Dell Community on July 22, 2013 at 15:35
  6. Running DC’s on Hyper-V? – Please read this – Keeping your Virtual Active Directory Domain Controllers Safe « The Deployment Bunny on July 23, 2013 at 09:09

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>