Powered by System Center
Keeping your Virtual Active Directory Domain Controllers Safe
In my last blog I sent out a red alert on a killer Windows Update that had not been sufficiently tested. The net result was a full crash of a two-node System Center fabric management cluster. The fabric was still in the making and backups were only provisionally taken in the form of Virtual Machine exports of the most important virtual machines. As fellow Hyper-V MVP Aidan Finn wrote unambiguously: “Something Has Gone Very Wrong With Microsoft Patch Testing”
Where did it go wrong?
I was actually demonstrating the fantastic Cluster Aware Updating functionality in Windows Server 2012 clusters, which would automatically move all VMs off a host, update it, reboot it, live migrate the VMs back to the updated cluster node and move on to the next.
The problematic July Update Rollup KB2855336 – which was one of the updates to be processed – is actually a collection of originally 20 issues that solves problems in several areas. A still unidentified part of that rollup caused a 0x000000D1 Stop error while live migrating a VM on a Windows Server 2012-based server. So Cluster Aware Updating using the Live Migration mechanism to place a host in maintenance mode, combined with the mentioned update, sent shockwaves through the cluster. In this case both cluster nodes crashed within minutes.
Ironically enough this same July Update Rollup also contained an important fix for a problem that has been around for some time: Active Directory database becomes corrupted when a Windows Server 2012-based Hyper-V host server crashes (KB2853952).
Assume that you have a Windows Server 2012-based virtualized domain controller on a Windows Server 2012-based Hyper-V host server. When the Hyper-V host server crashes or encounters a power outage, the Active Directory database may become corrupted.
This issue occurs because the guest system requests the Hyper-V server to turn off disk caching on a disk. However, the Hyper-V server misinterprets the request and keeps disk caching enabled.
If you try to disable the write caching manually you will see this error: “Windows could not change the write-caching setting for the device. Your device might not support this feature or changing the setting.” On a physical domain controller this has never been a problem.
CORRUPTED DOMAIN CONTROLLER
It takes very little imagination to guess what happened to Active Directory if you combine the full STOP of the fabric management cluster and the AD domain controllers that were virtualized on that same Hyper-V cluster without the required updates and hotfixes.
I ended up filing a case with Microsoft Support and although we were on the brink of last weekend, a really knowledgeable AD support specialist got in touch and helped me identify the damage that was done. The DC running all the FSMO roles including the Schema were on a VM that was totally broken. PowerShell, Server Manager or even the Event Viewer could no longer be started, although fortunately we could still open a command prompt. Remoting into the machine was also no longer possible.
In short we took exports of both Active Directory Controller VMs, rebooted, got SYSVOL replication working again which took at least two hours to repair, moved off the Schema and FSMO roles to a freshly installed DC. We then tried to demote the corrupted DC which was not possible in the GUI or interactive mode. Because the command prompt still worked, I was able to run:
dcpromo /unattend /username:<domain admin> /userdomain:<domain> /password:<Domain Admin password> /administratorpassword:<local admin password>
And this was only the domain controller. Meanwhile also a Cluster Shared Volumes disk had gone wrong during the crash and required a Chkdsk.
It turned out that one VM’s virtual hard disk was corrupt beyond repair. Several other VM’s had to be reinstalled because they showed unexpected behavior. At the end of the day I realized I had seen the largest corruption in my entire 30 year career working in the IT industry.
A full STOP of a cluster node, just like a power outage which is not protected by UPS or no-break system, can cause potential damage to your files and databases when data in the write cache is unprotected. When a virtual domain controller starts, it requests disabling of the disk cache on the virtual disk controller (IDE or SCSI). On a physical domain controller this is handled correctly, but the virtual AD controller would report success although in reality it could not actually disable the cache. This is the situation that could lead to corruption if the physical Hyper-V host would unexpectedly reboot. The unwanted result is that AD at the jet database layer does not request write-through I/O (writing directly to the disk without using disk cache) because AD assumed the disk cache had already been disabled.
KB2853952 actually fixes this situation so that the OS correctly reports that disk cache cannot be disabled. The jet database now being properly informed, requests I/O to be written directly to disk which is then correctly handled by the Hyper-V storage filter driver in the guest.
To enhance recoverability of a file or file system, applications such as Microsoft SQL and Microsoft JET can specify the FILE_FLAG_WRITE_THROUGH flag to instruct the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.
If a system failure occurs, NTFS has enough information in the log to complete or abort any partial NTFS transaction. During recovery operations, NTFS redoes each committed transaction found in the log file. Then NTFS locates in the log file the transactions that were not committed at the time of the system failure and undoes each metadata operation recorded in the log file. Because NTFS flushes the log to disk before any metadata changes are written to disk, NTFS has complete information available about any metadata changes that need to be rolled back during recovery.
I can also refer you to fellow Hyper-V MVP Didier van Hoye who wrote a blog on NTFS and the flush command:
Case closed you might think?
However even after applying the re-issued July Update Rollup KB2855336 including KB2853952, we still see an event in the Directory Service event log that could seem alarming to users:
Active Directory Domain Services could not disable the software-based write cache on the following hard disk. Hard disk: c: Data might be lost during system failures.
GAINING BACK THE TRUST
As we speak, the KB2853952 article describing the problem is being rewritten to better describe what has been corrected. My strong suggestion to the Microsoft product teams is not only explain this in the KB article, but also write an Informational event in the Hyper-V server and virtual AD domain controller’s event logs. If the VM can confirm that the DC is running on a supported hypervisor, it should not be all that hard to also confirm that the Hyper-V storage filter is notified by the hypervisor that write caching has been disabled
As the English saying goes “the proof is in the pudding”. We’ll have to see if virtual domain controllers are now fully protected against a bug check or power failure. Although I have started to virtualize all AD domain controllers since we started rolling out Windows Server 2012 Hyper-V for customers, for the current project we have taken these measures:
- We’ve added one physical domain controllers holding Schema and all FSMO roles
- We converted our dynamic VHDX virtual disks to fixed size because we want to rule out the risk of running out of disk space.
- We’ve installed KB2853952 which is part of the July 2013 Update Rollup KB2855336.
- We are currently fully up-to-date with all updates using Cluster Aware Updating (CAU) and Windows Server Update Services (WSUS) which is dedicated to only the Hyper-V and Scale-Out File Servers.
- We first test updates and hotfixes on a research/test cluster with equivalent configurations
- We then run CAU manually for installing Windows updates with a delay of at least 1 month to capture disasters like with the July update. Hopefully we can pick up notification of problems via social media and blogs
- When necessary we also run CAU manually for installing Hotfixes with a delay of at least 1 month and triple check their necessity.
- We moved back from the vSCSI controller to using the virtual IDE controller and the c-drive as the default location for the AD database, even though generation 2 virtual machines will abandon the vIDE controller in favor of the vSCSI controller. When we gain more experience with the new generation 2 VMs we might convert to them in the future.
As it stands, we don’t take ANY risks with the database that can be truly called the foundation of everything else we run in our datacenters. I can only hope Microsoft reserves enough resources – and here I mean the intelligent people as well as plenty of equipment and configuration we use in modern enterprises – to test, test and once again test updates so that we can gain back the trust we have lost.
|Print article||This entry was posted by Hans Vredevoort on July 20, 2013 at 14:37, and is filed under Hans Vredevoort, Hyper-v. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site.|
- A Very Important Article About Health Of Virtual DCs On Hyper-V
- Currupt Active Directory Database om virtual VMs | The ICW Datacenter Blog
- http://www.hyper-v.nu/archives/hvredevoort/2013/07/keeping-your-virtual-active-directory-domain-controllers-safe/ | JC’s Blog-O-Gibberish
- Important : If you have virtual machines domain controllers on your Windows Server 2012 Hyper-V servers, install this update | buildwindows
- Microsoft Most Valuable Professional (MVP) – Best Posts of the Week around Windows Server, Exchange, SystemCenter and more – #38 – TechCenter – Blog – TechCenter – Dell Community
- Running DC’s on Hyper-V? – Please read this – Keeping your Virtual Active Directory Domain Controllers Safe « The Deployment Bunny
about 3 weeks ago - 5 comments
Have you ever wondered why you DO see performance data in Windows 8/8.1 under the performance tab in Task Manager, but DON’T see this same information in Windows Server 2012/2012 R2? Well I kind of missed seeing that information in Server but never really bothered to really investigate. Windows 8/8.1 Windows Server 2012/2012 R2 I…
about 3 weeks ago - 3 comments
A hotfix has been released today for Windows Server 2012 Hyper-V servers which are unable to access LUNs over a Synthetic Fibre Channel after a VM is live migrated to another host in the cluster. This problem can occur if the following conditions are met: You have two Windows Server 2012-based computers that have the…
about 1 month ago - 3 comments
If you are a regular reader of this blog you have noticed that a lot of content that is written is related to Windows Azure Pack (WAP for short), formerly known as Windows Azure Services for Windows Server. There are also numerous good other blogs as well as videos about WAP written by fellow MVPs…
about 4 months ago - 4 comments
Several months ago I had a very short encounter with a Dell Compellent storage array which had just been installed for a customer in their New Jersey office. In a short blog I showed that creation of a 250GB VHDX could be done in just a few seconds. This was the result of Compellent’s support…
about 4 months ago - 4 comments
Not very often do I remember a Windows Update KB article off the top of my head, but this time I have talked and written about KB2855336 so often that it was probably written into my short term read-cache.. If you have been careful and missed the first version of this update because you rather…
about 5 months ago - 2 comments
[Update July 13, 2013 - I was able to deploy the newly issued KB2855336 to all of my physical and guest cluster nodes. There have been no issues so far. The same KB will also show up in most of your VMs as it is a collection of 21 updates touching all kinds of bugs…
about 5 months ago - 3 comments
Today I received word from Microsoft that I am re-awarded as a Microsoft Most Valuable Professional for Virtual Machine. When I received my first award for Cluster back in 2009, I could not begin to understand its consequences and how much such an award would mean in terms of recognition, access to knowledge, direct contact…
about 5 months ago - 7 comments
Update June 26, 2013: Cristian Edwards notified me that he has updated the script to now support using a cluster name. That will save you some typing if you cluster counts 64 nodes. See end of blog Update July 1, 2013: Great to see that Niklas Akerlund and Trond Hindenes made great extensions on the…
about 6 months ago - No comments
We already know that we wouldn’t have to wait four years to get significant new features in Windows Server & Hyper-V. Looking at the list of builds since the first version of Hyper-V, we can observe there were considerable intervals between the releases of Windows Server 2008 (R2) and Windows Server 2012. We’ve seen three…
about 6 months ago - 2 comments
Some time ago I blogged about a great TechNet Wiki listing available hotfixes and updates for Windows Server 2012 Hyper-V and Failover Clustering: As many of you know, keeping your Windows Server 2012 Hyper-V clusters up-to-date has become a whole lot easier with Cluster Aware Updating, which assists not only in a managed installation of…