vNICs and VMs lose connectivity at random on Windows Server 2012 R2

A couple of week ago Windows Server 2012 R2 and System Center 2012 R2 reached the GA milestone. We started with a LAB environment for validation our designs. During the deployment we were experiencing connectivity issues with VMs and vNICs. At random virtual machine or vNIC would loose connectivity completely. After a simple live migration the virtual machine would resume connectivity. After verifying our VLAN configuration a couple of times things even got more weird. After live migrating the virtual machine back to the host were it lost connectivity, it still was accessible. Most virtual machines were functioning properly and there was no clear pattern in the what and when a virtual machine was having the issue. And without a way to reproduce the issue on demand it was complex to troubleshoot.

A week later I did an implementation at a customer site. The design was based on a two node Windows Server 2012 R2 Hyper-V cluster with System Center workloads and a five node Windows Server 2012 R2 Hyper-V cluster for production workloads. The nodes of the production cluster were deployed using the bare metal deployment process in System Center VMM 2012 R2. All the hosts were deployed successfully, but we were having issues creating a cluster from these nodes. The cluster validation wizard showed connectivity issues between the nodes. As you might know from my previous blog on bare metal deployment, System Center VMM 2012 R2 can only create a NIC Team with the Logical Switch if a vSwitch is created on top of the NIC team. This required vNICs in the ManagementOS for host connectivity.  After validating the VLAN configuration we rebooted the host. Connectivity resumed when a host was rebooted , but at random different hosts lost connectivity again. We were experiencing a similar situation as in our lab environment.

The was another similarity in the two environments. The customer site and our lab consisted of an HP BladeSystem c7000 with  BL460c Gen8 blades that contained HP FlexFabric 10Gb 2-port 554FLB Adapters. These BladeSystems use Virtual Connect technology for converged networking. We upgraded our Virtual Connect to the latest version 4.10 before implementing Windows Server 2012 R2, but the customer was still running version 3.75. The HP FlexFabric 10Gb 2-port 554FLB Adapter is based on Emulex hardware and an inbox driver was provided by Microsoft with version number 10.0.430.570. After contacting my friend Patrick Lownds at HP he provided me with a link to the Microsoft Windows Server 2012 R2 Supplement for HP Service Pack. Running this did not provide any update to drivers. The details of the HP FlexFabric 10Gb 2-port 554FLB Adapter showed that this is Emulex hardware. A search on the Emulex site provided an newer version of the driver. After installing the new driver with version 10.0.430.1003 the issue occurred again.

We submitted a case with Microsoft and I have been debugging this issue with a Software Development Engineer from Microsoft (who has verified my blog series on NIC Teaming about a year ago) for the last week. I must say Kudos to Silviu for his assistance every evening this week and Don Stanwyck for communicating with HP.  I also reached out to a couple of community members to know if the issue sounded familiar. Rob Scheepens (Sr. Support Escalation Engineer at Microsoft Netherlands) was aware of another customer with the same issue on exactly the same hardware and yesterday evening I was contacted by another one. Same issue, same hardware. This morning I was pinged by Kristian Nese who has a repro of the issue with 2x IBM OCe11102-N2-X Emulex 10GbE in a team (created from VMM) with Emulex driver version 10.0.430.570.

The issue is not solved yet but I though that a quick post would prevent a lot of people from wasting valuable time on troubleshooting. Please submit a case at the hardware vendor as this would create more priority at their site. I’ll update the blog with any progress or relevant information.

A possible temporary workaround seems to configure the NIC Team members in Active/Passive. I have not been able to test and confirm this.

137 Comments

  1. November 22, 2013    

    Hey, I noticed a similar problem with Qlogic CNA 8142 and R2.
    vNic sometimes stops, restart server or disable/enable of the vNic keeps it up and running.
    Disabled Large Send Offload v2 on all NIC and the problems seems to have disapered.

    • T|CK's Gravatar T|CK
      February 13, 2014    

      This did the trick for me. Thank you.

    • February 24, 2014    

      Quite alarming that network hardware offloads are so poorly tested.

      -H

    • Leandro's Gravatar Leandro
      March 27, 2014    

      It’s working!. Thanks, Johan.

  2. November 23, 2013    

    I am experience similar problems, Since I am able to RDP into the VM, I cannot connect to thru the network, ping and net view \ComputerName don’t work.

    Restart the host fix the problem until random happen again

  3. November 26, 2013    

    We have a very similar issue but with Windows 2012 but same exact hardware. Working with HP support for over 2 weeks now and no resolution yet.

  4. November 27, 2013    

    Hi, we have the same Problem on two HP DL 380 Gen8 Server insertet 2 x 4Port 1GB Cards..

    Restart the Host me fix Problem for only two of the three vNics..

  5. November 27, 2013    

    Same Problem with HP DL 380 Gen8 and Chelsio TR420-CR Cards.

  6. November 28, 2013    

    We are having the same issues on a new C7000 with HP gen 8 blades using 10G flex, we have 2 identical setups 1 or server 2012 R1 the other on Server 2012 R2, the R2 environment is having the exact same issues, we also have some DL360 Gen 8′s and the same issues.

    We have calls logged with HP but not getting much back. Has anyone had any joy?

  7. November 29, 2013    

    Did you try disabling Adaptive Interrupt Coalescing on the Emulex CNA? It’s normally enabled by default and should be manually disabled when building new hypervisor hosts (regardless of the hypervisor manufacturer)

    • adminHans's Gravatar adminHans
      November 29, 2013    

      Hi Christoph,
      No we did not try that. Without VMQ the problems went away, but we’ll re-test after HP/Emulex has released new drivers.

      Best regards, Hans

  8. December 4, 2013    

    Yes I have the same hardware, but due to time constraints not able to test.

    Missing factor in all the publications from both Peter and Marc is the LoadBalancingAlgorithm that that is being used. In Windows Server 2012 there was “Hyper-V Port” and “Address Hash” (with either TransportPorts, IPAddresses and MACAddresses). In Windows Server 2012 R2 there is a new Dynamic mode.

    It looks like the new default when creating the Team on physical servers via PowerShell or GUI is Dynamic.

    Not much reading on this new Dynamic mode, only found this: “Moves outbound streams from team member to team member as needed to balance team member utilization. [...] Inbound traffic is routed to a particular team member”.

    All 3 “Address Hash” methods have the following explaination added by Microsoft: “All inbound traffic arrives on the primary team member”

    • adminHans's Gravatar adminHans
      December 6, 2013    

      Hi Menno,
      One of our first troubleshooting attempts was to change LB back to Hyper-V port. It seemed to solve the problem, but only for a short time. The vNICs started communicating again until the VMs were live migrated. In our case a 2-node guest cluster which was spread across the cluster.

      Meanwhile we have collected several similar cases, mostly related to Emulex but also to Chelsio iWARP network adapters, which are connected and looked at by the server and networking product teams in Redmond. A new Emulex driver is in the process of being tested, but we may deal with an intrinsic problem at a deeper level. To be continued …

      Best regards,
      Hans Vredevoort

  9. December 12, 2013    

    We a running a 12 node cluster with bl460 G8 blades with the emulex adapters. we started having this issue when upgrading to virtual connect 4.10. after that if we live migrate a nlb vm into a host that resides in the enclosure with vc 4.10 nlb starts behaving strange. we also had vms losing network connectivity when moving them to a host with vc 4.10 no problem with the hosts running on vc 3.75. I have Another hyperv cluster with bl465 G7 blades they also have problem with nlb and vc 4.10. Any one else having issues with nlb and vc 4.10?

    • February 24, 2014    

      We have that combination (VC 4.10 and NLB) but virtual failover clusters have the same issues.
      Cannot test with older firmware unfortunately.
      -H

  10. December 16, 2013    

    I have just setup a 16 node 2012 R2 Hyper-V cluster using bl465c/nc554flb with the flexfabric 10/24 modules. Each blade has 8 NICs presented through VC 4.01, 2 of the NICs are for iSCSI, and then we have 3 teams of 2 for the management, host, and VM networks. The host team is created on each host using powershell and allocated 100 MB through VC. The management and server VM teams are applied through VMM using logical switches and uplink port profiles. There are several VLAN’s trunked on these NIC’s through VC. After setting up a few new production VM’s on the cluster we are now experiencing the VM network disconnect issue described above. Once a VM disconnects from the network we cannot migrate it off, disconnect/reconnect the NIC, nothing. If you try to perform any task against the unstable VM the it will lock up and the only way to get it back is to reboot the VM host through ILO (forcefully). A case has been opened with MS premier support. The first thing they had us do was break the two NIC teams applied by VMM, set VMQ on each pNIC for a specific starting proc and range, and rebuild the teams through VMM. This didn’t resolve the issue. Next, they sent me KB2887595 to apply to all 16 nodes. Also, any VM that didn’t have a port classification assigned got one. The port classifications are what signal the vNIC to use VMQ according to the VMM documentation. Not sure the last two fixed our issue but we haven’t had it occur since then, not yet anyway.

  11. December 16, 2013    

    Forgot to add. The load balancing method we are using for the teams is the default “Dynamic” for all teams. The host created team is using RSS and not VMQ since this team isn’t connected to a vSwitch. The server and management teams are both connected to a separate vSwitch and therefore using VMQ.

    • December 17, 2013    

      Thanks for your comments Will.
      We are coming closer to a solution. If you might encounter the issue again you can use the following workaround.
      get-netadapter | disable-netadaptervmq
      This will disable VMQ on all adapters. Unfortunately this will direct all incoming vNIC traffic to the first proc. But at least you will have connectivity to those VMs.
      Marc

  12. December 17, 2013    

    Upon further research I have found that HP’s quickspecs doc. on the nc554flb adapter does not mention support for VMQ. However, both the HP Flex-10 10Gb 2-port 530FLB Adapter and the HP Ethernet 10Gb 2-port 560FLB Adapter (Intel chipset) mention support for VMQ. The nc554flb adapter is the only one that supports the FlexFabric 10/24 interconnects we bought which offer the most flexibility (ethernet and/or fiber). I know for a fact that the previous generation nc553m works just fine with vSphere 5.1 and the flexfabric 10/24 modules having setup a production VMware cluster on the hardware at a previous job. The nc553m does mention compatibility with VMQ and VMware NetQueue. Just a hunch but I’m betting VMWare would work just fine with the nc554flb adapter.

    • December 20, 2013    

      Called HP to verify that the nc554FLB adapters do support VMQ since it isn’t specified in the quickspec. They do support it.

    • Chris Butler's Gravatar Chris Butler
      May 1, 2014    

      > “Just a hunch but I’m betting VMWare would work just fine with the nc554flb adapter.”

      Unfortunately on the vSphere side we have had PSOD and host NIC uplink outages for the past 14 months related to the nc554 adapter. NIC uplink failures occurred using ESXi 5.1 and 3 variants of bnx2 driver/firmware combos until emulex found and patched issue with mbox command timeouts. We recently upgraded to ESXi 5.5 which switched us to elxnet driver/firmware. On ESXi 5.5 we have seen PSOD based outages on 2 variants of firmware/driver. Over past week we have been upgrading to 3rd variant of driver/firmware. Today we experienced another PSOD but haven’t confirmed if it is still issue with emulex NIC’s.

      Glad to know these issues are hypervisor agnostic!

      • Chris Butler's Gravatar Chris Butler
        May 1, 2014    

        Forgot to mention we use stacked HP c7000′s with 4 Flex-Fabric modules in each. The blades are BL465c Gen8 with Emulex 554M and 554FLB adapters in each.

      • May 2, 2014    

        Hi Chris,
        Thanks for adding this information. I saw an earlier comment also referring to similar problems with Emulex. The fact that Emulex has serious problems on multiple platforms really makes me wonder if Emulex wants to be a serious candidate for the future. I have lost all confidence and will move to other vendors.

        Best regards, Hans

  13. December 18, 2013    

    We recently added a new enclosure with 460 Gen8 blades. Experiencing the same issues with 2012 R2 HyperV and machines dropping connection until live migrated to a different host.

    I just added a new enclosure last week. It arrived with VC software 3.60. I noticed this in the VC log after bringing it up to VC 4.10. Could this new SR-IOV functionality be the cause of the issues? If I check in BIOS, SR-IOV is disabled.

    2013-12-12T23:44:43-06:00 VCEFX3C4218012H vcmd: [PRO::6043:Info] SR-IOV Virtual Functions added : Profile: Profile_lxiddb10, Enet connection: 3, Number VFs: 24
    2013-12-12T23:44:43-06:00 VCEFX3C4218012H vcmd: [PRO::6043:Info] SR-IOV Virtual Functions added : Profile: Profile_lxiddb10, Enet connection: 4, Number VFs: 24

  14. December 18, 2013    

    Just spent most of the day with a MS PFE onsite diving into this issue. It appears that disabling large send offload v2 for IPv4 and IPv6 for all NICS and team adapters has resolved the issue. large send offload v2 is a known issue on windows 2003 and windows 2008 OS’s and if you are running those as guests you need to disable it on the host NICs.

    • December 20, 2013    

      I spoke too soon. LSOv2 didn’t fix the issue although it did appear to make the issue less frequent. We have found out since that most of the machines that are losing connectivity are sending lots of data. One VM is a SEP server and one is our solarwinds poller. These two always lose connectivity eventually and it’s seems to be because of the bandwidth utilized. We have tried putting the machine into port classifications that match their bandwidth but this doesn’t help. We have another call with MS premier support today and I will post anything we find later.

  15. December 19, 2013    

    I’ve the same here, LargeSendOffloadV2 is disabled on all my nics, however when a GuestOS is sending/receiving loads of data (backup2disk via LAN) the network connectivity of all guests is disabled.
    I have one nic for Hyper-V access and one for Guest network, which is not shared by the Host OS.
    When i disable and re-enable the second nic, the problem is solved instantly.
    I’ve removed antivirus software, disabled LargeSendOffloadV2 on my Hyper-V Server 2012 R2. which operates on a HP ML350p Gen8 using HP Ethernet 1Gb 4-port 331i Adapters

    • adminHans's Gravatar adminHans
      December 19, 2013    

      Thanks for your additional comments
      -Hans

  16. January 7, 2014    

    Last week I upgraded our 4 node HP Proliant DL360p Gen8 cluster to 2012 R2 using this as a guide due to some issues on one node with the CSV.
    All nodes have the onboard HP Ethernet 1Gb 4-port 331FLR Adapter for Live migration/management/LAN and internet connectivity. Also a 2-port 530T Adapter for connection to our SAN using iSCSI.
    Validation reports were fine when starting the upgrade. However after the upgrade I am facing several problems and the problem mentioned above one of them.
    At this moment I am still facing 2 problems and fixed one.
    The first fixed problem was very high latency on the 1GB network connections (ping times between 100 and 600ms). After some digging I found that the driver that Windows Server 2012 R2 installed for the 1Gb NIC’s caused this problem. The fix was installing the latest 2012 driver from the HP site.
    After this was fixed the next problem occurred after a couple of days. On one of the nodes all virtual machines lost network communication on one of the adapters. This happened twice up till now so for now this node is offline for further investigation. The other nodes seem to run fine up till now.
    As mentioned in your blog a live migration of the virtual machines and a reboot of the node will fix the connection problem but that brings me to yet another problem. After the reboot I am not able to see the content of the ClusterStorage folder (and fallback of the virtual machines is impossible). As mentioned I had this problem also when running 2012 (on another node). Then it only occurred on one node. Now I have it on all nodes.
    To fix this I have to stop the cluster service and manually remove all connections to the CSVs (iSCSI targets). Then reconnect all connections and start the cluster service again. If there was a connectivity problem this would not work I guess.
    At this moment I want to tackle the problems one by one and to fix the connectivity issue I was wondering if anybody has an idea if I can use the newly released drivers (16.2.0.4b) of Broadcom instead of the HP driver now installed (16.0.0.17)?

  17. January 12, 2014    

    Just an update. Our issues of having the guest vnic’s disconnect have not been resolved. It has been over a month with tickets escalated with HP and MS. However, it does appear that Emulex released a new driver on their website. http://www.emulex.com/downloads/emulex/drivers/windows/windows-server-2012-r2/drivers/

    The driver installs with a date of 11/17/2013 and version 10.0.430.1047. I’m going to go ahead and install the driver today and let you know if we have a fix.

    The problem appears to be a compatibility issue between the MS Inbox teaming driver and the manufacturers driver. VMware does not have this issue as far as I know.

  18. January 14, 2014    

    Thanks for posting this Mark.

    We are also facing the same issue with C7000 and BL460c G8en servers. Infact, the initial issue I had was with the flex fabric where formatting a 1 TB taking days to complete. After logging a call with HP, we got the flex module replaced. Even after replacing the faulty Flexfabric module, network disconnect was happening. Recently, we upgraded our core networking infrastructure to 10 G and after that my observation was this issue came down drastically.

    However, After few weeks of smooth running, Observed network disconnects with SEP server yesterday. I see the comment of Will Moor who also having issues with SEP servers.

    Will Moore – Please let us know if the Emulex driver upgraded helped on fixing this issue.

    Cheers
    Shaba

  19. January 14, 2014    

    Hi
    I have been in contact with Emulex regarding driver 10.0.430.1047 and they have confirmed that this driver does not fix this Issue. We cant wait anymore for Emulex and have order new Qlogick nic’s.
    -Robert

  20. January 16, 2014    

    Well, the Emulex 10.0.430.1047 driver DID NOT fix the issue for us. We decided to evict a cluster node, delete all the teaming adapters and setup a single vswitch with a single NIC, still with VLAN trunking from HP VC. We put one of the machines that has the network disconnect issue on this standalone host, tagged the NIC with the correct VLAN and let it run over the weekend last weekend. The machine disconnected from the network on Sunday morning accept this time all you had to do was disconnect/reconnect the guest NIC from the network through hyper-v settings and that brought the machine back online with no hanging or host reboots (This was all prior to the Emulex driver update). I updated the driver on Sunday after the machine disconnected and made sure the machine was back on the network after the driver update (verified by others as well). On Monday morning, I checked the machine and it was again disconnected from the network only, this time, the guest NIC settings had completed disappeared (No VLAN tag, No network connected, nothing.), very strange. So, we decided to completely eliminate VLAN tagging from the equation and setup HP VC to present only one network on the NIC. I reconfigured the guest NIC with no tagging and it has not disconnected once since Monday. This is the longest the machine has been on the network since our issue occurred. At this point, it appears that the issue may be related to the way hyper-v virtual switches and VM networks handle trunking/VLAN tagging and stripping those tags. It’s too early to say it doesn’t have something to do with teaming also but we sort of eliminated that by reproducing the same issue on a switch connected to a single NIC (with tagging at the vNIC level). I am going to test the same scenario with teamed NICs connected to a vSwitch without VLAN tagging and see if this works. In the end, if VLAN tagging/trunking is the culprit, we will have to re-evaluate our design and there will be a lot of functionality (converged networking) lost until there is a fix. VMware has been doing this for years with little or no issues. We have decided to bail on 2012 R2 for the moment. Microsoft should be paying us to beta test their software….

  21. January 20, 2014    

    We have 1 out of 4 hosts that has this problem. Since last week I gradually increasing the load on the host. Started with 1 vm and after a couple of days added one more. Up till now no disconnect and all seems to run smooth. I just added 2 more vm’s to the host and let’s see what that brings.
    FYI, we do not use VLAN tagging/trunking and also no teaming at this point. Allthough at least the teaming is one of the reasons we upgraded to R2.

    • January 22, 2014    

      Yesterday the second node started having this problem. I guess we will have to add some NICs too and stop using the onboard NICs

    • January 22, 2014    

      I just upgraded the node that started to show this problem first to the latest version of the Microsoft Windows Server 2012 R2 Supplement for HP Service Pack. Date on the FTP server is 1/11/2014. ftp://ftp.hp.com/pub/softlib2/software1/supportpack-generic/p1235385378/v89111

      It updated drivers and firmware of the all NICs! We have 530T dual port NIC in the server that lost all the settings and had to be reconfigured. So if you upgrade keep that In mind.

      Lets hope this fixes the problem. It also fixed our iSCSI problem! Host is back in production and hopefully it will stay online.

      If so I will update the rest of the nodes and keep you posted.

      • adminHans's Gravatar adminHans
        January 22, 2014    

        Hi Frank,

        We used the Emulex driver in that Nov 1 2013 update, but to no avail in our case.

        Best regards, Hans

        • January 23, 2014    

          We have no hardware that needs the Emulex driver.

          We have broadcom based NICs that have the same problem.

          • January 23, 2014    

            We installed the supplement from HP right after it came out. In our case, it didn’t update the Emulex driver or firmware so it didn’t fix our issue. However, MS tier 3 support gave us two new things to try yesterday. We implememented the hotfix on a standalone node (no teaming or VLAN tagging) and the powershell offload script on the cluster (teamed NICs, converged networking) hoping to narrow down the fix. I will update if the VM’s still disconnect.

            Ran this on cluster nodes: Set-NetOffloadGlobalSetting -TaskOffload Disabled

            Installed this on standalone: http://support.microsoft.com/kb/2913659

          • February 4, 2014    

            If you install using HPSum it will not install cp021705.exe, it tells me the server is already up to date. However running it manual will update the nic firmware, bootcode etc.

            The server I updated a week ago has not gone down since the update.

  22. January 21, 2014    

    We have built a windows 2012 4 node cluster using no teaming, no VLAN tagging, and gone back to using the traditional 5 networks (CSV, HB, LM, Host, VMs) for Hyper-V. The problem of the guest vNICs disconnecting from the network still remains. This is on the same bl465c gen8 hardware using HP virtual connect. Needless to say, this is beyond frustrating. Not sure where to go from here.

  23. January 27, 2014    

    This hotfix http://support.microsoft.com/kb/2913659 seems to have fixed the issue we were having with guest vNICs disconnecting from the network after an intermittent amount of time. After patching our cluster nodes with the hotfix, we haven’t had a VM guest lose network connectivity for over 24 hours. It was happening quit regularly with several VM’s that are sending/recieving lots of network traffic . If you haven’t applied this hotfix and you are experience this issue and/or others with your virtual switches, do it!

  24. January 30, 2014    

    so the fix fixed the problem for us, we wore using Broadcom nics, the pain is over

  25. January 31, 2014    

    The patch didn’t help – 3 days everything were good, and then problems again began. While only migration helps out

    • February 4, 2014    

      It appears I spoke too soon. The patch did help the situation by prolonging the time in between failures but we have had 2 instances of machines disconnecting from the network since applying the patch. Live migrating the machines will bring them back online but this is not fixed obviously. The suggestion made to me by MS tier 3 support was to disable task offloading on all the NICs on the cluster nodes. This setting, along with the patch should work around the issue.

      Set-NetOffloadGlobalSetting -TaskOffload Disabled

      • February 5, 2014    

        Task offload doesn’t seem to have fixed the issue either. I think its officially time to call hyper-v 2012 R2 a bust until MS can retool it to actually work the way it was intended. VMware or Xen server I’m sure work just fine. Unbelievable.

        • Andre's Gravatar Andre
          March 21, 2014    

          Hi,

          with VMWare 5.x we’re having the same issues as Hyper-V. Disconnects, high latency issues, etc.

  26. February 3, 2014    

    Have any of you managed to results with this hotfix?

  27. February 6, 2014    

    Hi Will,

    What is the performance impact of this setting?

    Up till this morning I had high hopes the issue was fixed by applying the latest supplement from HP and in particular the broadcom update. But this morning around half past 11 after 2 weeks one of our servers lost connection on one port again. So my last resort is this setting but I need to know what the performance impact will be.

    • February 7, 2014    

      There is a performance impact but it seems negligible with modern hardware. That being said, the setting is there to enhance performance obviously so disabling it is really a “workaround” IMO. The patch definitely prolonged the outages but it didn’t fix the underlying issue. We are only seeing the issue now on guest VM’s that have really high network utilization.

      The issue is still there for us WITH the patch and task offloading disabled but only happening with 2 machines in particular now.

      • February 12, 2014    

        Hi Will,

        We escalated it to HP support, they are familiar with the problem and the engineer I spoke had several cases he worked on.

        They started with Action plan 1:

        Please perform the below activity in the Bios.
        Go to RBSU > Power Management Options > Advanced Power Management Options > Minimum Processor Idle Power State .
        Select No C-states

        Next a lot of disabling:
        Disable the VMQ (virtual machine queue) in the bios
        Disable the Large Send Offload,
        Disable TCP offload,
        Disable IP SEC Offloading,
        Disable Receive Side Scaling

        I have disabled all of them and going to increase the load on this node to see if that fixes our problem. However I agree with you that this is at best a workaround.

        • February 12, 2014    

          Thanks Frank. Please keep us updated on your progress.

          I’ve pretty much given up on R2 at this point. Fortunate for Microsoft is that fact that it’s a lot harder to justify soft costs than the financial ones, otherwise, I would’ve made a case for VMWare about 2 months ago.

          This is my first experience with Hyper-V and it has been such a bad experience that I’m not sure I will ever recommend it to anyone. I ran the same hardware setup with VMware 5.1 and vCenter at my previous job and it ran FLAWLESSLY. I had high hopes for SCVMM and Hyper-V on 2012 R2 with all the new features (stuff VMware has done for YEARS) but this whole experience has pretty much ruined my opinion of Hyper-V.

          • February 13, 2014    

            Hi Will,

            We run HyperV from the beginning and up till now it has been rock solid stable. We had some minor issues but we also had issues with VMWare and for us HyperV was the solution.

            Today I received a mail from HP support to send them some pictures of the NIC’s

            Need more information:
            1. Has the customer tried with different CAT5 or CAT6 cables.
            2. Please ask the customer to try with different cables to confirm whether it is a cable issue or NIC Adapter issue.
            4. Please capture high resolution pictures of the NIC Adapter PORTs.
            5. Please capture high resolution pictures of the NIC Adapter PORTs with CAT5 cable connected to it.
            6. Please capture high resolution pictures of the NIC Adapters showing the HP Labels on them.
            7. What exactly happens when the NIC Cable shakes, when touched or pulled gently, when connected to the NIC Adapter?
            8. Any Link LEDs on the NIC Adapter when there is a link loss due to loose connectivity?

            Guess the ghost in the datacenter is pulling some cables at night.

          • February 13, 2014    

            BTW I am adding a Intel based NIC to every node today to see if that solves our problem. We cannot wait for a solution from HP or MS at this moment.

  28. Arnold's Gravatar Arnold
    February 13, 2014    

    we have the seme issue, and we allready upgraded our cluster machines tu gen2, so our Walkaround is that we just bought an Intel card and no problems,
    so the problem is with brodcom and emulex nic’s
    more info:
    http://www.hypervtx.in/archives/marcve/2013/11/vnics-and-vms-loose-connectivity-at-random-on-windows-server-2012-r2/

  29. February 13, 2014    

    I know it isn’t completely fair to bash Hyper-V because it does work well in certain instances but up until the 2012 release it was nowhere near the enterprise product that every other company had offered. Microsoft has done a good job of making the new release competitive in the market but if the new features don’t work what good are they? It’s too bad that we couldn’t get it to work in our environment.

    Good luck and I hope they eventually figure out why this is happening.

    Is there anyone on here running a 2012 R2 hyper-v cluster with SCVMM networking applied, teaming, VLAN trunking, etc. in a production environment?

  30. Richard's Gravatar Richard
    February 13, 2014    

    I am very concerned by this problem as I am in the planning phase for a multi-cluster deployment of Hyper-V 2012 R2 with SCVMM 2012 R2, with a full-converged networking model (multiples vNICs). The target hardware is IBM x3650 M4 Servers with… Emulex 10 GbE dual-port NICs. @richardlemelin

  31. Joerg Maletzky's Gravatar Joerg Maletzky
    February 13, 2014    

    Hi Will,

    did you disabled encapsulated packet task offload per Disable-NetAdapterEncapsulatedPacketTaskOffload cmdlet
    http://msdn.microsoft.com/en-us/library/windows/hardware/dn144775(v=vs.85).aspx

    You can check it per Disable-NetAdapterEncapsulatedPacketTaskOffload cmdlet.

    Greetings
    Joerg

  32. Tim Fleming's Gravatar Tim Fleming
    February 13, 2014    

    We ran into exactly the same issue in server 2012 RTM. We’re running 2 x 5 node HP DL360p Gen8 clusters with 4 x 10gig Emulex NICs (NC552 & NC554), Load balanced failover teams, logical virtual switches etc.

    We started noticing problems in early September which seems to correlate to when we started to load the clusters up with more VMs. This was supported by being able to reproduce the issue by loading up a server with lots of smallish VMs, so number of VMs not load.

    Live migrating the guest to another node temporarily restored network connectivity, we ended up writing a powershell script to ping the VMs and migrate them if they stopped responding.

    We logged our MS “Premier” support job on the 16th of Oct and after many hours (20+ wasted premier support) of emails and calls with both HP and MS, we discovered our own work around and closed the job on the 4th of December. We found that it was VMQ causing issues and after running the below script we thankfully haven’t seen a reoccurrence.


    get-netadapter -name | disable-netadaptervmq
    sleep 30
    while((get-netlbfoteam |where-object {$_.name -like "*"}).Status -notlike "up"){ sleep 5}
    get-netadapter -name | disable-netadaptervmq

    Like the above posts we went thru different firmware, drivers, TCP offload settings and even bcdedit.exe /set USEPLATFORMCLOCK,all with no result. The annoying thing with testing is that we think we’d finally found a fix and then had the issue occur again after a day or 2.

    It is very disappointing to hear that this is still an issue in 2012 R2, however I’m glad we didn’t waste our time going to R2 which is what we had talked about as a possible work around.

    As an aside we’re running the exact same hardware for 2 VMWare clusters which haven’t missed a beat. It certainly makes it hard to justify the ‘cost savings’ of running Hyper-V when the result has been significant disruption to businesses and incredibly higher support costs. Our planned migration from VMware to Hyper-V has been in progress for almost 3 years now, if VMWare didn’t cost us significantly more than Hyper-V I’d be pushing for the opposite.

  33. Joerg Maletzky's Gravatar Joerg Maletzky
    February 14, 2014    

    Hi Will,

    did you disabled encapsulated packet task offload per Disable-NetAdapterEncapsulatedPacketTaskOffload cmdlet
    http://msdn.microsoft.com/en-us/library/windows/hardware/dn144775(v=vs.85).aspx

    You can check it per Get-NetAdapterEncapsulatedPacketTaskOffload cmdlet.

    Greetings
    Joerg

    • Samirf's Gravatar Samirf
      February 16, 2014    

      Hi,
      We experienced the issue with WIndows Serer 2012 (not R2), we use HP FlexFabric 10Gb 2-port 554FLB Adapters on GEN8 HP blades. The workaround (according to the Windows Server 2012 NIC Teaming (LBFO) White Paper) is to configure the core assignement when using VMQ and Hyper-V port, so the processors don’t overlap.
      ———————————————————
      • If the team is in Sum-of-Queues mode the team members’ processors should be, to the extent possible, non-overlapping. For example, in a 4-core host (8 logical processors) with a team of 2 10Gbps NICs, you could set the first one to use base processor of 2 and to use 4 cores; the second would be set to use base processor 6 and use 2 cores.
      • If the team is in Min-Queues mode the processor sets used by the team members must be identical.
      ——————————————————–
      Maybe it may help.

      Samir FARHAT
      Infra Consultant

      • February 25, 2014    

        Hi Samir,

        That might be an important detail to add as often mistakes are made with teams of this kind.

        -H

  34. Hayden's Gravatar Hayden
    February 14, 2014    

    Hi Guys,
    We are tracking this with HP also – current workaround is as state; disabling VMQ has stabilised this for us. HP have provided us with a Beta Emulex driver for testing purposes. I will post any results/findings.

    Cheers,
    Hayden

    • February 25, 2014    

      Hi Hayden,

      Could you please verify if these beta drivers are the same firmware and driver version as the one released on Feb 18:
      cp022157.exe was released with the HP Service Pack for ProLiant (SPP) Version 2014.02.0

      What we observed is that VMQ is fixed but that Live Migration is very slow on an LBFO team without a vswitch (so we should benefit from RSS)

      Cheers, Hans

  35. February 17, 2014    

    I added the Intel NIC’s Thursday and up till now no more disruptions.
    The Intel based cards use a MS driver instead of a HP driver. So mayby that will help also.

    HP support is giving it a 50/50 chance it will work. They promised to bring a fix for the broadcom NIC’s in a future release. However ETA is unknown.

    • February 18, 2014    

      A new Emulex driver was launched today

  36. February 18, 2014    
  37. Joerg Maletzky's Gravatar Joerg Maletzky
    February 19, 2014    
    • February 19, 2014    

      I saw this and installed it last night. There are no notes as to what it fixes but my HP rep is looking into it. We’ll see…..

      • February 19, 2014    

        The new Emulex drivers don’t fix the issue, however, with the help from some of my collegues we have figured out a way to repro the issue on command. This will hopefully lead to some relevant memory dumps that MS can use to get us a final fix!

        • Theodoor van der Kooij's Gravatar Theodoor van der Kooij
          February 20, 2014    

          I follow this article since the start of this year because we upgraded our cluster servers to 2012 R2 end of December. A few weeks later we got problems with the network interfaces. It looks like we have the same problems everybody else here have (we use the standard Broadcom adapter). If you really can reproduce the issue on command i would be very interested in how to do that.
          We have logged cases with HP and Microsoft and perhaps they can do something with the information.

        • February 20, 2014    

          Hi Will.

          Could you tell us how to repro the issue on command?

          That way I rather quick know if our Intel NIC solution will work.

          • February 24, 2014    

            Here is how we have successfully reproduced the issue. We setup 3 test VM’s and are using D-ITG to generate network traffic between two of the machines while at the same time using the SQL replay agent to simulate SQL transactions. More often than not, D-ITG is enough to recreate the issue without running the replay agent at the same time.

            You have to make sure that the sender and receiver computers for D-ITG are not on the same host. If sender and receiver are on the same host the issue will not occur since the traffic never leaves the hyper-v switch.

            For the SQL replay agent we setup computer A as the controller and replay agent, Computer B is a replay agent, and Computer C is setup as the SQL server.

            SQL replay setup: http://blogs.msdn.com/b/mspfe/archive/2012/11/08/using-distributed-replay-to-load-test-your-sql-server-part-1.aspx

            D-ITG Download: http://traffic.comics.unina.it/software/ITG/download.php

            Setup D-ITG to send 20000 or more packets per second with the max packet size (65535). This is the quickest way to make the issue occur.

          • February 24, 2014    

            Hi Will,

            Thanks for submitting your repro.
            My repro is fairly simple:
            Create a simple 2-node WS2012 R2 guest cluster (no roles configured)
            Use a file share witness
            Spread the 2 nodes across the Hyper-V cluster nodes
            Do a couple of live migrations of the guest cluster nodes
            With VMQ enabled, we soon see network disconnects, partitioned networks and cluster node eviction.

            Best regards,
            Hans

        • February 25, 2014    

          See my answer to Hayden (25 Feb 14)
          -H

  38. February 25, 2014    

    In the documentation of cp022157 (latest Emulex fw/driver) only these improvements are mentioned:
    • The maximum supported transmit queue depth is now 2048
    • This driver addresses an issue that prevents enumeration of Virtual Functions connections in an HP Virtual Connect version 4.10 environment.
    This is more SR-IOV related and not VMQ unless the increased transmit queue depth is responsible for VMQ fix.

    • February 25, 2014    

      Are you still able to make the disconnect issue occur after updating the drivers?

      • AJ's Gravatar AJ
        February 26, 2014    

        Hi all,

        The HP cp022157 does not address any VMQ issues. I have it on good authority that a new driver+firmware combination is scheduled be released by the end of March to address a number of VMQ issues that have been identified. Some of you may have already been provided a beta of this package and I think we would all welcome you to share your experience.

        Also please note that the major server vendors such as Dell, HP, IBM, etc. do not create the firmware or drivers for these NICs. Instead they obtain the firmware and drivers from the OEM such as Broadcom, Emulex, Qlogic, Intel, Melonox, etc. Many of the server vendors do test the firmware and driver combination provided by the OEM against their specific equipment – and 99% of the time if an issue is found it affects all server vendors.

        In the case of Windows and Hyper-V, most testing is limited to that required in Windows HCK to achieve logo certification. Given the significant number of issues that have been identified in regard to NICs and their offload capabilities, one could infer that the Windows HCK tests used by these hardware manufactures may be less than “complete”.

      • February 26, 2014    

        Yes, some of our customer are still able to reproduce the disconnect issue aka VMQ issue after updating to version 1109 of the driver.

        • February 27, 2014    

          Why is it that you can’t add anything (except disk) to a running VM? How long has VMware been able to do this? Hyper-V fail.

      • March 1, 2014    

        Hey Will,

        I’m not running the latest drivers from SP 2014.02. The cluster has been really stable since VMQ was disabled. I will wait until VMQ is addressed in the upcoming update.
        Have you updated and seeing disconnects? If so, what is LM speed?

        -H

        • March 3, 2014    

          I bailed on R2 and went back to 2012 R1 and it is very stable. We have an 8 node cluster with no issues.

  39. Matthew's Gravatar Matthew
    February 28, 2014    

    Definitely still an issue with us. We are using Intel X520s NICs, Cisco Nexus 3524s, and HP servers for our hosts.

    This is very frustrating, as you all are experiencing. We get intermittent drops using switch independent mode for certain VMs. We’ve ended up removing adapters from our teams for the mean time. I’m considering moving to a switch dependent mode with LACP. Any thoughts.

    Thanks,

    Matthew

    • March 3, 2014    

      Go back to 2012 R1 if you have to stay on Hyper-V. R2 has too many issues. If you are using SCVMM R2 you can still use all the VMM networking, teaming, etc. with R1. LM is a little slower but that trumps being unstable.

  40. Lars R's Gravatar Lars R
    March 10, 2014    

    http://support.microsoft.com/kb/2913659/en-us Does this have anything to do with this problem?

    • March 10, 2014    

      Hi Lars,

      Although an important hotfix, it is not the solution for the VMQ problem with Emulex adapters. A driver/firmware pair is being tested by MS/HP/Emulex and is expected to be available end of March/early April.

      Best regards, Hans

      • Valdi Hafdal's Gravatar Valdi Hafdal
        March 12, 2014    

        This is unbelievable,,,,, this is a very serious issue and the time it has taken to get a resolution is pretty much unacceptable “this thread started 22. November”. I was starring the migration process to W2012R2 and host number 2 has this strange behaviour. i Really hope the hardware manufacturer releases as soon as possible a fix to correct this!
        I disabled VMQ on the network adapters to see if that would bypass the issue but that was not the case. i un-teamed the adapters and created a VM switch directly on to one adapter that did not do anything to help. installed Service Pack 2014.02 and that did not solve the issue.
        removed VM switch and created a Logical Network to see if that would help but the same behaviour.
        But i bet if i turn off all features of all the network adapters everything will be great………. hummmm why are we buying
        expensive hardware in the first place if we can use it ? it like buying a condom that has the tip cut off !

        • Valdi Hafdal's Gravatar Valdi Hafdal
          March 12, 2014    

          maybe i should add our environment is HP Blade System c7000 with BL460c Gen8 blades! :)

        • March 13, 2014    

          Yes it is very sad that this problem is still unsolved. Most people I talked to were successful by disabling VMQ via the Emulex driver properties (both options) for EACH FlexNIC and again via PowerShell Get-NetAdapterVMQ -name “name of adapter” | Disable-NetAdapterVmq. Do not Hide Unused FlexNICs in the Virtual Connect Server Profile.
          -H

          • March 18, 2014    

            Disabling VMQ doesn’t fix the issue. It will come back. VMware fixes the issue. You know why this hasn’t become a bigger deal that it is to MS? No one runs hyper-v 2012 R2 in production! You know why? It sucks!

  41. Mark House's Gravatar Mark House
    March 14, 2014    

    I can confirm at least the BSOD issue on Dell servers with Intel x520 10Gb nics. We have a six server farm on 2012R2, and whenever a live migration was triggered manually or by VMM, the BSOD would follow. We never had the issue with guests losing connectivity though.

    At first, the crashes didn’t point to anything in particular but Dell support tracked it down to the intel nic driver.

    We tried different driver versions (Dell, Intel, and MS), but no fix. It wan’t until reading this blog and disabling VMQ that we were able to stop the crashes and it’s been 100% stable since.

    Thank you to hyper-v.nu and everyone that replied.

    • Leonid's Gravatar Leonid
      March 16, 2014    
      • Mark House's Gravatar Mark House
        March 17, 2014    

        Not yet, Leonid. Unfortunately I’ll have to wait until my next maintenance window to try it out. We’ve been rock solid stable (knock on wood) for a good three weeks now with VMQ disabled.

        That being said, I’d rather have it (VMQ) on and working properly…

        • Mark House's Gravatar Mark House
          March 18, 2014    

          Hmm… read this link over at Aidan Finn’s site regarding that update: http://www.aidanfinn.com/?p=16031 .

          He’s saying the update is not be related to the issues with the 10gb cards.

          Now I get worried… Perhaps there will be a hotfix to the hotfix mixed with a driver update?

          • Jeff Graves's Gravatar Jeff Graves
            March 27, 2014    

            There’s two issues in play. We’re have the BSOD on live migration problem with KB2887595 on Intel X540-T2 nics as well. Uninstalling that hotfix resolves that issue. The same driver for LBFO is in KB2913659, so if you install that, you’ll have the same BSOD problem. Right now, only workaround for BSOD’s during live migration with the Intel nics is to remove those updates.

  42. Egils Kaupuzs's Gravatar Egils Kaupuzs
    March 18, 2014    

    I have WS 2012 R2 HyperV cluster with latest current MS updates including http://support.microsoft.com/kb/2913659 installed.
    Servers have Emulex OneConnect OCe11102-N-X 10GbE NICs.
    VMs were experiencing network disconnections for 4-8 seconds during Live Migration. Apparently long enough for SCOM to send alerts each time a VM was live migrated. Interesting fast I noticed is that VMs running WS 2008 R2 as guest OS, experience shorter disconnection as ones running WS 2012 R2 as guest OS.
    In hope to fix this, I installed latest Emulex driver (10.0.718.26-9) and firmware (10.0.803.19), both released Feb 2014 (http://www.emulex.com/downloads/emulex/drivers/windows/windows-server-2012-r2/drivers/).
    After the update previous problem remains and another one, a far more critical, appeared.
    Now when changing “Enable virtual machine queues” setting in VMs vNIC properties, the operation takes long time, times out, HyperV host locks up, all VMs on it lose network. Only fix is to restart Hyperv host, which crash all VMs on it.
    As suggsted, I will try to disable VMQ in Emulex drivers and on hosts and see if it helps.
    -Egils

    • March 18, 2014    

      Hi Egils,
      It doesn’t seem to get any better. The driver you downloaded states that only WS2012, WS2008 and WS2008 R2 are supported.
      What server hardware do you use? We are close to retesting with current HP driver and new firmware.
      For now disabling VMQ is the safest workaround.
      Best regards, Hans

      • Egils Kaupuzs's Gravatar Egils Kaupuzs
        March 18, 2014    

        Well on the driver page under OSes, WS 2012 R2 is not listed, but in User Manual under Operating System Requirements, it is listed.
        -Egils

        • Egils Kaupuzs's Gravatar Egils Kaupuzs
          March 18, 2014    

          Btw, servers are X3690 X5. Our other WS 2012 R2 cluster on Dell PowerEdge servers and Intel X520 NICs doesnt have such issues.
          -Egils

          • March 18, 2014    

            Thanks for update. Intel is usually ok, although some have reported problems with Intel in this thread with HP drivers
            -H

        • March 18, 2014    

          So Emulex forgot to update compatibility list then. What’s new ;-)

      • Alexander Eriksson's Gravatar Alexander Eriksson
        March 18, 2014    

        This issue occured for us after we migrated to 2012R2.

        We are using HP DL360G8 machines with HP Ethernet 1Gb 4-port 331FLR Adapter in a LBFO Team for the Virtual Switch. We are only running VM Guest traffic on this NIC Team.

        When is a fix expected?

  43. Egils Kaupuzs's Gravatar Egils Kaupuzs
    March 25, 2014    

    Hi,
    After disabling VMQ in Emulex drivers things look better now. 0-1 pings lost during live migration and no host lockups when changing VMQ on VMs.
    Egils

  44. Dale's Gravatar Dale
    March 25, 2014    

    HI All,

    Did you manage to resolve this?
    I have the same issue.

    My guests running in a 2012 R2 environment just intermittently drop the LAN.
    Fix is to edit the guest settings, disconnect the NIC from the vSwtich, Apply and then reconnect.

    Regards

    • March 25, 2014    

      Unfortunately it is not so simple. The connectivity issues will come back.
      The only workaround in our current configuration with HP Virtual Connect Flex-10 and HP/Emulex 554FLC 10GbE NICs is to disable VMQ in the driver and with PowerShell.
      Be careful when you do this as it will disconnect all your VMs from the network. So evacuate all VMs first before you do this.
      Best regards, Hans

    • Nathan Raper's Gravatar Nathan Raper
      March 25, 2014    

      Just chiming in for the sake of solidarity. I am a VMware user currently experiencing the exact same behavior as outlined above. I spoke to VMware and the engineer I spoke to was leaning toward it being a Windows issue as well. This is the first time I’ve experienced it, all with a fleet of 10+ Windows Server 2012 R2 machines I just provisioned in the last week.

      I’m trying to figure out a potential fix involving delaying the probing response times on the network level. I’ve found a possible fix involving iOS, but not NX-OS (and we’re using Nexus switches).

      • March 26, 2014    

        Hi Nathan,
        I appreciate your solidarity and as some other commenters have observed, this Emulex/VMQ issue does not seem unique to Windows Server 2012 R2/Hyper-V.
        If you could keep us updated about your observations, please report back!
        Best regards, Hans

      • March 26, 2014    

        Hi Nathan,
        Correct me if I’m wrong, but VMware applies NetQueue which also leverages the RSS hardware queues, similar to VMQ. If there then is a firmware issue in the Emulex NIC related to the RSS queues, then it would not be more than logical that VMware users will also see the same problems. We could verify this theory by disabling NetQueue.
        Could you give that a try?
        Best regards, Hans

      • Bjorn Lagace's Gravatar Bjorn Lagace
        March 27, 2014    

        Hi all,

        We also own a c7000 – flexfabric 3.75 – bl460c g8 setup and experienced a simular issue.

        My symptom was the following :

        New server profile in flexfabric -> assigned to new blade.
        2 netwerkcards on default public lan.
        both received dhcp address.
        One worked perfect, the other dropped random pings and sometimes seconds timeout.

        After a half day troubleshooting I found out the following :
        Problem stayed within the chassis, placing that server in another chassis no problem.
        Problem moved from slot, same server and profile but other slot…same problem.
        Problem stayed within profile !!!!! I left the profile unassigned and created a new one….problem gone!
        Assigned the old profile to another server…and yes that server dropped pings.

        My guess :
        Each time you create a server profile, it picks a mac address out of the virtual available pool.
        Delete the corrupt profile and create a new profile, ended in having that same mac address in the server again and failing.
        So I named that profile ‘dont_use’ and left it unassigned, took a new profile and solved my problem.
        Dunno if it’s 100% related to this case, but if you need details, just contact me.

        Bjorn Lagace
        System Engineer
        bjorn.lagace@terbeke.be
        0032 495 58 62 84

        ps : I’m planning to upgrade my flexfabric to the 4.x release…but after reading the above horror story…i’ll wait a bit.
        ps2: so far I havent contacted HP, cause their support is a disaster as the first level never puts me thru with second/third level. And what’s the use having to talk to somebody that knows less about it than you to solve a problem ?

        • March 27, 2014    

          Hi Bjorn,
          That is an interesting find. Please let me know it you don’t get vNIC disconnects after your move to VC 4.x
          Thanks, Hans

          • Bjorn Lagace's Gravatar Bjorn Lagace
            March 28, 2014    

            Will do, but I decided to give it a wait.
            Already have a ‘incompatible’ upgrade of my IRF stack this weekdend :-)

        • AJ's Gravatar AJ
          March 29, 2014    

          At least for my environment I am using factory MACs (not VC assigned MACs) and still have the problem. Recreating the Server Profiles in my case does not impact the issue.

        • April 3, 2014    

          I have many C7000 enclosures. The ones in non prod that are running 4.10 Virtual connect have this issue. The ones in production running 3.70 do not have this issue.

          • April 4, 2014    

            Hi Bjorn,
            That is an interesting comment you made, but could you please provide a little bit more detail for the production/working situation:
            OS Version
            Blade model
            Emulex NIC name/model
            Emulex firmware version
            Emulex driver version
            Team configuration
            vSwitch configuration

            Thanks, Hans

          • Bruce Lautenschlager's Gravatar Bruce Lautenschlager
            April 10, 2014    

            @Jim – that’s interesting. We started testing with VC firmware 3.75, but had such bad pause frame issues we were forced to go to VC 4.xx firmware mid last year. So I never got far enough to test it with the earlier VC firmware.

            But we’re definitely suffering from the VMQ issue (C7000s, G8s with Emulex, Flex 10, 2012 R2/Hyper-V) and we can’t go back to older VC firmware – pause frame issues will just plain disable one of the Flex-10s in the chassis within a day.

            Honestly we’d have this in production months ago were it not for these 2 issues, but can’t get past POCs because it’s so flaky, and disabling VMQ is indeed a workaround, but a poor one – can’t see spreading this around our data centers in this fashion.

            Anyone hear anything new on firmware/drivers from Emulex? Will it be just drivers, or both?

            Thanks,
            Bruce

  45. Nils's Gravatar Nils
    April 4, 2014    

    Like Mark House we also get BSODs with Intel X520 10GBs cards and IBM x3550M4 servers.
    I disabled VMQ on all nics and that fixed the problem. Thnx Hyper-V for me a lot of time debugging.
    I hope the issue will be fixed in the new update pack from MS which comes at april 10th.

    • April 4, 2014    

      Hi Nils,
      Although disabling VMQ is a workaround which we should avoid as soon as possible, it is good to read you can now run without BSODs.
      -H

      • Alexander Eriksson's Gravatar Alexander Eriksson
        April 29, 2014    

        We have been working with MS Premier support on this case for a couple of weeks now.
        Issue is that the Syntetic vnic driver crashes in the guests. Using Legacy vnic driver makes it work which indicates driver/firmware problem.

        So according to MS we have to wait on HP and until then disable VMQ

        RESOLUTION: Permanent solution is pending on the firmware and driver that needs to be created and released by HP and Emulex.
        The workaround is disabling VMQ on the Hyper-V level and use the synthetic vNIC.

        • April 29, 2014    

          Hi Alexander,
          Thanks for your update. We realize it is taking an unusually long time to get this problem fixed. Last thing we heard was that Emulex has been testing beta firmware but has not been made available to HP yet. It is quite sad that so little communication comes from Emulex as to the complexity of the problem. Meanwhile we are quite used to disabling VMQ and fully agree with the resolution mentioned.
          Best regards,
          Hans

  46. April 8, 2014    

    HP has release “critical” firmware for the Emulex based FlexFabric blade adapters. The HP link is really long. Just search for HP advisory c04218016

    We had BL460c Gen 8 Servers in a C7000 with Virtual connect 4.10 firmware. Before applying this critical fix, we were unable to even create a 2 host cluster. There would be various errors, but the most prevalent was WMI errors during the cluster qualification tests.

    Hopefully this VMQ nightmare is about to end………

    • AJ's Gravatar AJ
      April 9, 2014    

      The aforementioned fix does not specifically address VMQ.

  47. Valdi Hafdal's Gravatar Valdi Hafdal
    April 11, 2014    

    Things are looking up.
    I contacted HP and they said taht this was a MS problem and i should contact them.
    I contacted MS and they started to go overa all sorts of things related to the nic VMQ,RSS and off loading and nothing helped…… then i did a litle test, i changed the server profile in virtual connect. and connected a adpater. and created a switch in hyper-V to use that NIC….. that behaved the same way…
    Then the amazing thing happedned – I restarted the server and a message was display Virtual connect doing configuration changes- system will reboot….. after the reboot i have full connection on all virtual machine using both the original switch and the new one.
    I will report more after more detailed testing. BUT IT LOOKS LIKE HP IS TO BLAME !
    Question if somebody else can change their server profile and reboot to see if that is the case for more users.

    • AJ's Gravatar AJ
      April 13, 2014    

      I’m glad the “reconfig” in VC appears to have resolved your problem. (It would be nice if you could outline what you changed in the Server Profile.) However I suspect that your problem will reappear. If HP said it was a Microsoft problem, typically that means they found evidence of something (though that would surprise me). Maybe you have an HP case number you can share that the rest of us can reference to HP Support?

      • Valdi Hafdal's Gravatar Valdi Hafdal
        April 14, 2014    

        Hi,
        I had nic 1,2,3,4 connected to a network.
        I connected 5,6 but i still had connection issues then i created a VM switch on 5,6
        Rebooted and got the message “Boot device is neing configured by Virtual Connect. System will reboot shortly….” After the reboot connectivity looked normal on 5,6(Team) changed the VM to uses the first Vm Switch 1,2(Team) and that was also working.
        The HP CASE:4724723384

        I have put four VM´s back into production and i really hope this is the end of this :)

  48. Eric's Gravatar Eric
    April 17, 2014    

    We experienced the disconnect issue today with all the same symptoms as described in this post and thread.

    I will attempt to disable VMQ tonight and see if this happens again. We have been running this 2-node Hyper-V 2012 R2 cluster for a few months and this is the first time this has occurred.

    2x Dell PowerEdge R620 with Broadcom NetXtreme 57xx NICs
    Two NIC are teamed for cluster traffic with Switch Independent configuration / Dynamic distribution as recommended by Microsoft.

  49. June 10, 2014    

    We also have the same problems on 4 Proliant ML360 Gen8 machines running Server 2012 R2 and Hyper-V. We found out the IPv6 connection is still working when IPv4 connection has lost connectivity. Problem is still not solved.

    Regards, Oli

  50. Choy's Gravatar Choy
    June 16, 2014    

    Hi All,

    I having the same issue but not on HP, It was happened on my old Intel I350-T2 card.
    My system also installed 2012 R2 and windows and drivers are all up to date. Temp solution are also turn off VMQ feature in Intel driver advance settings page.

    Beside this, I also notice that if I limit the port to use only one processor then connection lost start happening. However, the connection will stable if I limit it to 2, 3, or more processor (>1 processor).

    Strange…

  51. Ryan's Gravatar Ryan
    July 10, 2014    

    Still occurs with VMQ Disabled, no NIC teaming, albeit problem has taken two weeks to reoccur instead of daily.

    HP ML350 Gen8 with 4 Port NIC 331i with 16.4.0.1 HP driver as the latest.

  52. Np's Gravatar Np
    August 5, 2014    
  1. The story continues: vNICs and VMs loose connectivity at random on Windows Server 2012 R2 on November 26, 2013 at 13:17
  2. Monthly Roundup: November 2013 on December 4, 2013 at 17:45
  3. November 2013 UR KB2887595 causing network problems on December 22, 2013 at 17:51
  4. Virtual Networking Problems With Emulex NICs And Windows Server 2012 R2 on February 13, 2014 at 17:13
  5. hyper-v.nu – Definitive Guide to Hyper-V R2 Network Architectures on February 28, 2014 at 14:28
  6. What the heck is wrong with Hyper-V/SCVMM 2012? | The Admin Rules on March 18, 2014 at 22:13
  7. Hyper-V 2012 R2 virtual machines lose randomly network connections . Be carefull with Emulex NICs! | UP2V on June 16, 2014 at 16:36
  8. hyper-v.nu – Emulex driver and firmware update on June 20, 2014 at 08:11

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

Our Sponsors





Powered by