SQL Server IO handling mechanism can be severely affected by high CPU usage
Are you using SSD or SAN / NAS based storage solution and sporadically observe SQL Server experiencing high IO wait times or from time to time your DAS / HDD becomes very slow according to SQL Server statistics? Read on… I need your help to up vote my connect item – https://connect.microsoft.com/SQLServer/feedback/details/744650/sql-server-io-handling-mechanism-can-be-severely-affected-by-high-cpu-usage. Instead of taking few seconds, queries could take minutes/hours to complete when CPU is busy.
In SQL Server when a query / request needs to read data that is not in data cache or when the request has to write to disk, like transaction log records, the request / task will queue up the IO operation and wait for it to complete (task in suspended state, this wait time is the resource wait time). When the IO operation is complete, the task will be queued to run on the CPU. If the CPU is busy executing other tasks, this task will wait (task in runnable state) until other tasks in the queue either complete or get suspended due to waits or exhaust their quantum of 4ms (this is the signal wait time, which along with resource wait time will increase the overall wait time). When the CPU becomes free, the task will finally be run on the CPU (task in running state).
The signal wait time can be up to 4ms per runnable task, this is by design. So if a CPU has 5 runnable tasks in the queue, then this query after the resource becomes available might wait up to a maximum of 5 X 4ms = 20ms in the runnable state (normally less as other tasks might not use the full quantum).
In case the CPU usage is high, let’s say many CPU intensive queries are running on the instance, there is a possibility that the IO operations that are completed at the Hardware and Operating System level are not yet processed by SQL Server, keeping the task in the resource wait state for longer than necessary. In case of an SSD, the IO operation might even complete in less than a millisecond, but it might take SQL Server 100s of milliseconds, for instance, to process the completed IO operation. For example, let’s say you have a user inserting 500 rows in individual transactions. When the transaction log is on an SSD or battery backed up controller that has write cache enabled, all of these inserts will complete in 100 to 200ms. With a CPU intensive parallel query executing across all CPU cores, the same inserts might take minutes to complete. WRITELOG wait time will be very high in this case (both under sys.dm_io_virtual_file_stats and sys.dm_os_wait_stats). In addition you will notice a large number of WAITELOG waits since log records are written by LOG WRITER and hence very high signal_wait_time_ms leading to more query delays. However, Performance Monitor Counter, PhysicalDisk, Avg. Disk sec/Write will report very low latency times.
Such delayed IO handling also occurs to read operations with artificially very high PAGEIOLATCH_SH wait time (with number of PAGEIOLATCH_SH waits remaining the same). This problem will manifest more and more as customers start using SSD based storage for SQL Server, since they drive the CPU usage to the limits with faster IOs. We have a few workarounds for specific scenarios, but we think Microsoft should resolve this issue at the product level. We have a connect item open – https://connect.microsoft.com/SQLServer/feedback/details/744650/sql-server-io-handling-mechanism-can-be-severely-affected-by-high-cpu-usage - (with example scripts) to reproduce this behavior, please up vote the item so the issue will be addressed by the SQL Server product team soon.
Thanks for your help and best regards,