Update: The story continues: vNICs and VMs loose connectivity at random on Windows Server 2012 R2

See for the latest updates the end of this post.

In this post Marc van Eijk points out connectivity issues with VMs and vNICs. At random virtual machine or vNIC would loose connectivity completely. After a simple live migration the virtual machine would resume connectivity.

Marc has already logged a support case at Microsoft and HP and they are investigating this issue. Last week I also discovered this issue, here is my configuration:

Currently we experience network connectivity issues with one of our cluster networks in a Windows Server 2012 R2 Hyper-V cluster environment.

Our environment is as follows:

- Two HP BL460 G7 servers (name of the servers: Host01 and Host02)

- 6x HP NC553i Dualport Flexfabric 10GB Converged Networkadapters (only 2 active)

- Installed with Windows Server 2012 R2 Hyper-V (full edition)

- Configured in a Windows Failover Cluster

The NICs are installed with the following driver

Driver: Emulex, Driver date: 5-6-2013, Driver version: 4.6.203.1

We have configured a switch-independent NIC team with dynamic loadbalancing with 2 NIC team members. Upon this NIC team we have configured a vswitch.

In this vswitch we have created three vNICs of type Management OS:

- Management

- Live Migration

- Cluster CSV

Every NIC is configured in a separate VLAN. Only the Live Migration network may be used for Live Migration traffic (Configured in Windows Failover Clustering).

The initial installation and configuration of Hyper-V and the Windows Failover Cluster was OK. Over all the networks, communication between the hosts in the cluster was possible.

The Cluster Validation Wizard runs successfully without any warning or error.

After the installation of the Hyper-V cluster we start creating and installing the virtual machines. No problems at all, till we build a specific VM called VM06. This VM was created on the host Host01.

When the VM resides on this host everything is OK. As soon as we move this virtual machine (via Live Migration) to the host Host02 the cluster network called Live Migration went down and communication on this network between the two Hyper-V hosts is not possible anymore. When we move the virtual machine back to Host01 the cluster network called Live Migration comes back online. Also when we shut down the virtual machine when it resides on node Host02 the cluster network called Live Migration comes back online.

When we change the NIC teaming configuration to a Active/ Standby configuration, as Mark described in his blog, this network issue does not appear.

Microsoft requested us to disable Large Send Offloading: “Get-NetadapterLSO | Disabel-NetadapterLSO” (with NIC teaming in active/ active). However the issue is still there.

Update 11-26-2013 14:45: After disabling RSS and RSC (which does not change te situation) Hans suggest to disabling VMQ. We used PowerShell to disable VMQ on all interfaces: “Get-NetAdapterVmq | Disable-NetAdapterVmq” …. and yes disabling VMQ does the trick. Off course this is not a solution but only a workaround. These findings are logged in the case @ Microsoft and they will investigate this futher.

Update 12-02-2013 10:45: After applying update KB2887595-v2 to both of our Hyper-V nodes the network problems with our Live Migration network are gone. Even with VMQ enabled the network keeps up and running. However this update does fix the problem for our situation but not for the situation that Marc describes. So it seems that we’ve two different issues here.

We, Hans, Mark en me, will continue to investigate this issue and will update you on www.hyper-v.nu!

44 Comments

  1. November 26, 2013    

    Looks like it carried over from Windows 2008
    http://support.microsoft.com/kb/2681638

  2. November 27, 2013    

    Same issue on our end. This is the configuration we used, based on reading your article, that resolved the issue for us.

    [Teaming Configuration]
    Windows 2012 Native Teaming
    Switch Independent
    Address Hash
    Standby Adapter Defined

    [Disabled NIC Options]
    Receive Side Scaling
    Large Send Offload Version 2
    Recv Segment Coalescing
    Virtual Machine Queues

  3. November 28, 2013    

    The command “Disable-NetAdapterVmq” was no solutions. On the contrary. My hosts completly lost the network connections.

    • Chris's Gravatar Chris
      March 5, 2014    

      You also need to disable VMQ on the VM after disabling VMQ on the host, then rebooting.

      It’s under the “advanced” options for the network adapter configuration for the VM’s settings.

  4. November 28, 2013    

    no success with calvin’s solution.

  5. November 29, 2013    

    Regarding this issue we are experiencing we have just this evening found that HP have released an update this afternoon without much docs it would appear they have changed something. We will report back after we test it tomorrow.

    ftp://ftp.hp.com/pub/softlib2/software1/supportpack-generic/p1235385378/v89111

    • adminHans's Gravatar adminHans
      November 29, 2013    

      Hi Marc,

      Apparently HP has included the latest Emulex driver version 10.0.430.1003 (cp021450).
      We have already tried the Emulex version and gave identical results.
      Waiting for another update unfortunately.

      Best regards,
      Hans

  6. December 1, 2013    

    Thanks, We have 2 Support calls in one with Microsoft one with HP and I have been on to both Account Manager’s, Hope to get an answer on what will be done Monday, Watch this Space…

    • December 2, 2013    

      Hi guys, see my last update on the article!

      Peter

      • December 3, 2013    

        Still the same issue. As others HP and MS no help.

        • adminHans's Gravatar adminHans
          December 6, 2013    

          File a case with both HP and MS and we can connect you to the other cases.

          Hans

        • Chris's Gravatar Chris
          March 5, 2014    

          I also have cases open with both MSFT and HP; email me if you get any progress or just want to discuss the issue and verify we are both seeing the same symptoms; asdlkf@[spamfilter]asdlkf.net. Remove the spam filter and brackets part.

  7. December 3, 2013    

    We tried the last post update with no success, Just spoken to our account manager at HP who is now escalating.

    We also found a post on Microsoft which we tried and this had no success

    http://social.technet.microsoft.com/Forums/windowsserver/en-US/5c344e30-4fb0-4d29-a1d7-133267057206/teaming-in-windows-2012-causes-network-connection-timedout?forum=winserver8gen

    We also noted that we had 2 different media TechNet MSDN and Volume Licence we have tried both as they have slight differences and this produced the same results.

    HP are due on site to replace a blade M Board having made us trash it downgrading firmware maybe they will have seen it before. Hopefully we now have the escalation route in HP.

    Will keep you posted…

  8. December 3, 2013    

    We have seen that even when pings drop we can get to file shares on the server we cannot ping. Surely this is suggesting something with the networking stack. ipv6 pings but ipv4 it can’t. What are others seeing?

  9. December 4, 2013    

    Hi,
    Just found your blog today.

    We have been fightning this issue for 3+ weeks now. We have 2 x BL490G7 and 2 x BL460 Gen8 servers with the exact same issues in our PRODUCTION environment. They are running Windows Server
    On top of that we are fightning random BSOD from all 4 servers. When we analyze the memory dump, it points to EMULEX netcard drivers as the problem.
    We have a bunch of BL490G6 WITHOUT any of these problems, as they have the Netflex Broadcom Integrated netcards.

    We have reported the problem to HP and they have escalleted the case to their Level 2 and 3 engineers.

    We have followed a couple of actionplans from them to solve the issues. So far it have not solved anything.

    HP’s solutions have been to reflash the same firmware on the emulex netcards, and to reapply the same (latest) drivers. That didnt help much (reason was that HP SUM sometimes fails the installation)

    Next they wanted me to deactivate affinity on the VMQ settings. That didnt help either on the BSOD.
    Second I tried to disable VMQ on all adapters and that helped a bit with the “lost connection after Live Migration”
    But still no solution to the random BSOD.

    I will also post any updates I recieve from HP.

    /Kim

    • adminHans's Gravatar adminHans
      December 6, 2013    

      Thanks Kim,
      We are not seeing BSOD’s but seeing that the source is also related to Emulex makes me wonder. Perhaps you can send your case ID with MS and HP to me at hans AT hyper-v DOT nu

      >Hans

  10. December 4, 2013    

    Well another day has passed and still no further forward. Microsoft have spent a large amount of time connected but with no success.

    HP been a little thin on the ground and despite escalating they are not coming forward with much.

    It is likely we are going to have to abandon R2 as we have to deliver and cannot afford these holdups. My recommendation is if you are looking to run HP hardware on Windows 2012 R2 you look at alternatives or hold off until the resolve this in the future.

    We have a few more calls and if we do get more details we will post.

    • adminHans's Gravatar adminHans
      December 6, 2013    

      Hi Mark,
      Yes it is certainly a nasty one, but hold on, there is an updated Emulex driver underway. We have been stable so far switching VMQ off. If you can share your HP and MS case # I will hook them up to the collection we already have. There is plenty of focus now also from the MS product teams, HP and Emulex but no solution yet.
      >Hans

  11. December 5, 2013    

    This issue seems to pop up its ugly head at random during reboots of VM’s too not just live migration.. The VM boots back up, the nic within the guest is enabled but no traffic goes through. Just disabling the guests nic and enabling it again resolves the issue.

    Just seen this page so not tried much yet in terms of fixing it apart from the above..

    Kit used is HP DL385p G8 – Server 2012 R2 in a cluster.

    • adminHans's Gravatar adminHans
      December 6, 2013    

      Thanks for adding that. It is not something we noticed but cannot reproduce either while VMQ is off.
      >Hans

  12. December 6, 2013    

    We seem to have got a solution but it does look to disable some features for off loading that will/may have a performance.

    Our adapter details are:

    HP Flexfabric 10gb 2 port 554FLB Adapter
    Firmware version 4.6.247.5

    The driver version is 10.0.403.1003 and found in the Windows Server 2012 R2 Supplement for Service Pack for Proliant 2013.09.0 B
    ftp://ftp.hp.com/pub/softlib2/software1/supportpack-generic/p1235385378/v89111

    The settings on the advanced tab of the driver are:

    Class of service (802.1p) = Disable priority
    Enhanced transmission selection = Disabled
    CPU Affinity
    Preferred NUMA Mode = Not present
    Receive CPU = Not present
    Transmit CPU = Not present
    Interrupt Moderation = Adaptive 30k int/sec
    Maximum number of RSS queues = 8
    Receive buffers = 896
    TCP Offload optimisation = Optimise throughput
    Transmit buffers = 256
    Virtual machine queues
    Transmit = Disabled
    Virtual machine queues = Disabled
    Protocol offloads
    IPv4
    Checksum
    IP Checksum offload = Disabled
    TCP Checksum offload = Disabled
    UDP Checksum offload = Disabled
    Large send offload v1 = Disabled
    Large send offload v2 = Disabled
    Recv segment coalescing (IPv4) = Enabled
    TCP connection offload = Disabled

    We also had to run 2 commands:
    Netsh int tcp set global chimney=disabled
    Netsh int tcp set global rss=disabled

    If anyone would like to try this and let us know how you get on, we have run it all day on multiple hosts with many reboots.

  13. December 9, 2013    

    Has anyone tried this, we seem to have had it on all weekend with no adverse effects. We are just currently creating a 3 node Hyper-V cluster and all seems well.

    • adminHans's Gravatar adminHans
      December 9, 2013    

      Hi Mark,

      Did you install the Emulex driver manually from the CP02145.exe file as part of the latest supplement?
      We have been successful for over a week with just VMQ disabled on the LBFO team with the Hyper-V switch.
      Marc rebuilt after wiping the networking configuration, going back to the inbox driver.
      Installing MSw2k12r2-rtmsupplement-2013.09.0.B.win.exe does not recognize an updated driver, so the driver was installed manually.
      With the Emulex WS2012 R2 driver 10.0.430.1003 we could not set RSS queues
      With the HP provided Emulex driver in CP02145.exe in the latest supplement we could set the RSS queues to 16.
      When you install CP02145.exe manually, it discovers an old Emulex driver and updates it to the same version as the WS2012 R2 driver 10.0.430.1003 on the Emulex site.
      We now have a Repro again and guest VM cluster nodes lose their connection when Live Migrated between cluster nodes.

      Still under heavy investigation. A new driver is promised and is being tested.

      Best regards,
      Hans

  14. December 11, 2013    

    We’re actually having a similar issue here with KB2887595 and Intel X520 10Gb adapters. After installing the update live migrations will cause the host to crash. The dump file indicates the intel driver as the cause. Tried multiple drivers with the same result. Uninstalling the update resolves the issue and the cluster is solid again. Anxious to hear what comes of your MS case.

  15. December 12, 2013    

    Need to check, I have also had an excellent conversation with HP, they have escalated our case to UK Labs who are investigating. No time lines yet but they have assured me that a full fix will be confirmed.

    We have avoided playing with settings as we need to start some deployment work. We are now waiting out for HP.

  16. adminHans's Gravatar adminHans
    December 19, 2013    

    @Mark
    We’ve also heard that the problem has been identified and work is done to write an updates driver/firmware. I doubt we will see this before the end of 2013.
    -Hans

  17. January 8, 2014    

    Hi,
    Any news on the driver..??

    We got a beta driver from HP, but that was only for Windows Server 2012 and not for R2.

    We have seen this issue a couple of times in our newly updated R2 production cluster and we are now disabling VMQ on all adapters to get it more stable.

    Hope it will be fixed soon..

    -Kim

  18. January 15, 2014    

    Hi we have just received an update from HP on this very issue, I’ll let you know if it fixes the problem.

    • adminHans's Gravatar adminHans
      January 22, 2014    

      Can you be more specific on what you received and what actions to take?

      Thanks, Hans

  19. January 21, 2014    

    Hi,
    Anyone got any updates on this? I’m running 2x HP DL380 G6 with the onboard Broadcom BMC5709C NIC’s (also known as HP 382i Multi…).

    Due to this problems I cant install the vSwitch from VMM, and I do belive I have tried every bits and bytes there is to this… I am doing a fresh install, again, to test out the following…:

    1. Clean OS install, with all available MS Updates from Windows Updates (no driver update)
    2. Install the latest drivers from Broadcom web
    3. Disable VMQ on physical NICs
    4. Disable chimney, offloading and what not from above comments
    5. Test to create the vSwitch

    …I do suspect the BSOD in my face very soon… will post an update tomorrow

  20. January 22, 2014    

    …and still… no dice

  21. January 23, 2014    

    By disabling the PowerManagement options on the physical NIC’s I can now at least create a vSwitch with Hyper-V manager, so it is a step further… Anyhow, by pushing out the config from VMM this still doesnt work. I have tried all the above, in variuous ways…

    Troubleshooting this is very timeconsuming, so to be honest I think i just have to drop the 2012R2 atm and make my system work on 2012…

  22. January 24, 2014    

    We have a HP Proliant DL380p Gen8 using HP NC365T PCIe Quad Port Gigabit Server Adapter. Hyper-V Team consists of 3 ports, 1 port for dedicated management. It is working with VMQ and Offload enabled only if 1 adapter is in standby mode and 2 are active (switch independent, dynamic load). First had to take all(!) Windows Updates including the NIC driver from Microsoft (Intel LAN update, vers. 12.7.28.0, date 13.05.2013). Then ipconfig, ping and rdp of VMs working smoothly.

    Anyhow HP NCU was much better… o_O

    hope this helps someone.
    regards,
    david

  23. January 27, 2014    

    Try this hotfix http://support.microsoft.com/kb/2913659. It has fixed the issue for us thus far. We still have LSOv2 turned off and have statically assigned VMQ proc ranges to our NICs but everything else is running in default config.

  24. January 27, 2014    

    Forgot to add…we are also running the Emulex driver version 10.0.430.1047 from their website.

  25. January 28, 2014    

    Was the communication problem as a result of the Live Migration of a VM from one host to another bi-directional communication or only uni-directional communication? That is, I have seen it only affect all other VMs and hosts from communicating with the newly Live Migrated VM, but strangely, the newly Live Migrated VM can still communicate with everything else, in fact it never stopped communicating during the migration. After shutting down the migrated VM and restarting it, the other VM and hosts were able to communicate with it again.

  26. February 3, 2014    

    Same problem here: 2012r2 + Hyper-V, running on HP DL380pG8 with 4-port 331FLR 1GB adapter. MS-Hotfix didn’t help, now trying with VMQ disabled.
    All servers affected. Was running without problems on 2012r1.

    Klaus

  27. February 4, 2014    

    Hi

    The hotfix http://support.microsoft.com/kb/2913659 did not help for our problem. We have now removed the Emulex adapter and replaced them with IBM Qlogic 8200 Adapters. We have now run our server for one week without problems.

    -Robert

  28. spiros's Gravatar spiros
    February 16, 2014    

    Same problem, hyper-v 2012 r2 , on a HP ML 310 G8 V2 , high ping latency.( <=180ms )
    VMQ solution did not work.
    Downloaded HP SUM from : ftp://ftp.hp.com/pub/softlib2/software1/supportpack-generic/p1235385378/v89111/
    Copied and extract it on my Hyper-V , then remote and install it from the command line, updated the repository with latest updates from HP ftp, updated the broadcom NIc drivers and rebooted.

    Ping times returned to 1ms !

    Will use the HP SUM for future installations as well !!!

  29. Peter's Gravatar Peter
    March 25, 2014    

    LACP from one blade to two Flex modules is not supported. 2 Flex modules operate as two indepentend switches. LACP only works between two devices and not 3. So active/standby is the only correct way to go if u connect one LOM to one Flex and the other LOM to the other Flex. This is what your build looks like. Please correct me if i’m wrong.

    Greetz,
    Peter

  30. April 8, 2014    

    HP has release “critical” firmware for the Emulex based FlexFabric blade adapters. The HP link is really long. Just search for HP advisory c04218016

    We had BL460c Gen 8 Servers in a C7000 with Virtual connect 4.10 firmware. Before applying this critical fix, we were unable to even create a 2 host cluster. There would be various errors, but the most prevalent was WMI errors during the cluster qualification tests.

    Hopefully this VMQ nightmare is about to end………

    • April 9, 2014    

      Just checked with sources at JP:
      It does not relate to the VMQ issue.
      Hans

  31. April 15, 2014    

    Hi, this might be marginally off-topic as I’m not using Emulex or Broadcom adapters, but might help anyone who comes across this thread as I did when researching my issue.

    Using VMQ with HPs NC523SF 10GB nics causes high network latency when pinging our iSCSI san (3par 7400). If we disable VMQ the problem goes away. However, we don’t want our host bandwidth restricted so needed to find another solution.

    The NC523SF card is actually a QLogic QLE3242-CU. HPs latest offered driver is v 5.3.10.1101. QLogics latest version is v 5.3.12.0925, which fixed my problem.
    It obviously leads to potential HP support issues as its an unsanctioned driver but does appear to do the job.

    Cheers,
    Ross

  1. Update List for Windows Server 2012 R2 Hyper-V on December 13, 2013 at 14:03
  2. November 2013 UR KB2887595 causing network problems on December 22, 2013 at 17:52
  3. hyper-v.nu – Definitive Guide to Hyper-V 2012 R2 Network Architectures on March 6, 2014 at 12:51

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>