Press "Enter" to skip to content

High KAVG and disk latency on MSCS virtual servers

I came across this at a clients site the other day,

2 MSCS clusters which were connected to SYMMETRIC storage were seeing 2000ms latency on the RDMs attached to the servers.

After looking at the issue it appeared to be set up correctly, looking at ESXTOP stats I was seeing all the time was being spent in the Kernel I was noticing upwards of 2000ms in KAVG, and with that there was queuing, then it would go and return.

This made me look at the PSP configuration for the RDMs, This was set to Round Robin which is not support for use in MSCS I am guessing for this very reason and makes sense when you think about it, as the virtual machine has a lock on the LUN when the path would switch the the queue would build up and this would be the time its waiting in the kernel for the path to switch back.

Please see THIS article from VMware for MSCS setup.

This had not been a problem prior to upgrading to ESXi 5.1, this is because any new devices added in 5.1 that are in an active/active configuration it automatically assigned to round robin by default. prior to 5.1 it would assign a fixed PSP for active/active which is what it should be in a MSCS setup.

7 Comments

  1. rardoe
    rardoe August 20, 2013

    Helped me out today, thanks.

    • Scott Norris
      Scott Norris August 21, 2013

      no worries, glad it was of some help 🙂

  2. abhishek
    abhishek March 20, 2014

    Hi, please let me know if on ESXi 5.1, Fixed PSP will be applicable for the Exchange Server 2007 with CCR using RDM ( non- shared)

    Regards,
    Abhishek Agarwal

    • Scott Norris
      Scott Norris March 20, 2014

      Hi Abhishek,

      No i dont believe so, with CCR both nodes active/passive i believe have their own storage and do not run shared storage.
      This is only applied if 2 windows servers have access to the same LUNs generally through MSCS, example – this would be applicable if Exchange was running in SCC.

      If you look at KB1037959 there is a small section of CCR
      “Non-shared storage clustering

      Non-shared storage clustering refers to configurations where no shared storage is required to store the application’s data or quorum information. Data is replicated to other cluster nodes (for example, CCR) or distributed among the nodes (for example, DAG).

      These configurations do not require additional VMware considerations regarding a specific storage protocol or number of nodes, and can be deployed on virtual in the same way as physical.”

      from this I would say RR should be fine but if you are seeing high KAVG latency then change it to fixed and see if it goes away, it can be changed live with no impact to the virtual or the host.

  3. abhishek
    abhishek March 20, 2014

    Hi Scott,

    Thanks for your swift reply, I hope this should help.
    though i am getting KAVG value of about 40-50 ms on couple of RDMs, i think i need to relook at queue depth value also.

    Regards,
    Abhishek

    • Scott Norris
      Scott Norris March 20, 2014

      40 – 50ms is way to high but I don’t think its high enough to be a PSP issue.
      If you look at ESXTOP hit “u” and look at all the LUNS if you see it queue and the latency go up then the queue goes down and the latency goes down then its a PSP problem.

      If it is just a steady 40-50ms then it is almost guaranteed that its a queue problem

      Cheers

  4. Marc
    Marc December 11, 2015

    Thanks a lot 😉

Leave a Reply to Scott Norris Cancel reply

Your email address will not be published. Required fields are marked *

Anti SPAM BOT Question * Time limit is exhausted. Please reload CAPTCHA.