Monday, November 18, 2013

The precise meaning of I/O wait time in Linux

Some time ago I had a discussion with some systems guys about the exact meaning of the I/O wait time which is displayed by top as a percentage of total CPU time. Their answer was that it is the time spent by the CPU(s) while waiting for outstanding I/O operations to complete. Indeed, the man page for the top command defines this as the "time waiting for I/O completion".

However, this definition is obviously not correct (or at least not complete), because a CPU never spends clock cycles waiting for an I/O operation to complete. Instead, if a task running on a given CPU blocks on a synchronous I/O operation, the kernel will suspend that task and allow other tasks to be scheduled on that CPU.

So what is the exact definition then? There is an interesting Server Fault question that discussed this. Somebody came up with the following definition that describes I/O wait time as a sub-category of idle time:

iowait is time that the processor/processors are waiting (i.e. is in an idle state and does nothing), during which there in fact was outstanding disk I/O requests.

That makes perfect sense for uniprocessor systems, but there is still a problem with that definition when applied to multiprocessor systems. In fact, "idle" is a state of a CPU, while "waiting for I/O completion" is a state of a task. However, as pointed out earlier, a task waiting for outstanding I/O operations is not running on any CPU. So how can the I/O wait time be accounted for on a per-CPU basis?

For example, let's assume that on an otherwise idle system with 4 CPUs, a single, completely I/O bound task is running. Will the overall I/O wait time be 100% or 25%? I.e. will the I/O wait time be 100% on a single CPU (and 0% on the others), or on all 4 CPUs? This can be easily checked by doing a simple experiment. One can simulate an I/O bound process using the following command, which will simply read data from the hard disk as fast as it can:

dd if=/dev/sda of=/dev/null bs=1MB

Note that you need to execute this as root and if necessary change the input file to the appropriate block device for your hard disk.

Looking at the CPU stats in top (press 1 to get per-CPU statistics), you will see something like this:

%Cpu0  :  3,1 us, 10,7 sy,  0,0 ni,  3,5 id, 82,4 wa,  0,0 hi,  0,3 si,  0,0 st
%Cpu1  :  3,6 us,  2,0 sy,  0,0 ni, 90,7 id,  3,3 wa,  0,0 hi,  0,3 si,  0,0 st
%Cpu2  :  1,0 us,  0,3 sy,  0,0 ni, 96,3 id,  2,3 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu3  :  3,0 us,  0,3 sy,  0,0 ni, 96,3 id,  0,3 wa,  0,0 hi,  0,0 si,  0,0 st

This clearly indicates that a single I/O bound task only increases the I/O wait time on a single CPU. Note that you may see that occasionally the task "switches" from one CPU to another. That is because the Linux kernel tries to schedule a task on the CPU it ran last (in order to improve CPU cache hit rates). The taskset command can be used to "pin" a process to a given CPU so that the experiment becomes more reproducible (Note that the first command line argument is not the CPU number, but a mask):

taskset 1 dd if=/dev/sda of=/dev/null bs=1MB

Another interesting experiment is to run a CPU bound task at the same time on the same CPU:

taskset 1 sh -c "while true; do true; done"

The I/O wait time now drops to 0 on that CPU (and also remains 0 on the other CPUs), while user and system time account for 100% CPU usage:

%Cpu0  : 80,3 us, 15,5 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  4,3 si,  0,0 st
%Cpu1  :  4,7 us,  3,4 sy,  0,0 ni, 91,3 id,  0,0 wa,  0,0 hi,  0,7 si,  0,0 st
%Cpu2  :  2,3 us,  0,3 sy,  0,0 ni, 97,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu3  :  2,7 us,  4,3 sy,  0,0 ni, 93,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st

That is expected because I/O wait time is a sub-category of idle time, and the CPU to which we pinned both tasks is never idle.

These findings allow us to deduce the exact definition of I/O wait time:

For a given CPU, the I/O wait time is the time during which that CPU was idle (i.e. didn't execute any tasks) and there was at least one outstanding disk I/O operation requested by a task scheduled on that CPU (at the time it generated that I/O request).

Note that the nuance is not innocent and has practical consequences. For example, on a system with many CPUs, even if there is a problem with I/O performance, the observed overall I/O wait time may still be small if the problem only affects a single task. It also means that while it is generally correct to say that faster CPUs tend to increase I/O wait time (simply because a faster CPU tends to be idle more often), that statement is no longer true if one replaces "faster" by "more".

4 comments:

  1. great article! but I found a question in my practice, that is when I run dd if=/dev/sda of=/dev/null bs=1MB, I saw not only one CPU core has I/O wait time, but all. see blow:

    %Cpu0 : 5.7 us, 5.7 sy, 0.0 ni, 50.5 id, 34.8 wa, 3.3 hi, 0.0 si, 0.0 st
    %Cpu1 : 3.0 us, 3.3 sy, 0.0 ni, 72.5 id, 20.9 wa, 0.3 hi, 0.0 si, 0.0 st
    %Cpu2 : 7.0 us, 4.3 sy, 0.0 ni, 62.0 id, 26.7 wa, 0.0 hi, 0.0 si, 0.0 st
    %Cpu3 : 4.3 us, 2.3 sy, 0.0 ni, 89.6 id, 3.7 wa, 0.0 hi, 0.0 si, 0.0 st

    my Linux box kernel version is 3.12.13-gentoo, have you saw it, can you explain it? thanks!

    ReplyDelete
    Replies
    1. As explained in the article, the Linux kernel tries to schedule a task on the CPU it ran last. For some reason, on your system, the kernel switches the dd task more frequently from one CPU to the other. Nevertheless, assuming that dd is the only task doing I/O, there can be at most one CPU in state I/O wait at any given point in time. Indeed, 34.8+20.9+26.7+3.7=86.1 which is close to but lower than 100.

      Delete
  2. Hi Andreas, after reading your article and doing my own study, I concluded the following:
    When an idle core (100.0%id) is given a task, the core will try to allocate all of its resources to finish the task as soon as possible (shown with 100.0%us), but in some occasions, the core just can't allocate all of its resources to finish the task (maybe 40.0%us or lower) due to the fact that the core will have to wait for the task's io if it has any (shown with 60.0%wa).


    During my own study, I encountered a situation as the following:
    When an idle core (100.0%id) is given a task, the core only allocated a fraction of its resources to finish the task (around 20.0%us), while spent the most of its remaining resources in doing nothing (around 78.0%id), fyi, it is a low io task (the core only spent roughly 2.0%wa in waiting for its io).


    So, my question is what the core will spend its resources to other than using them on executing a task (shown in %us and %sy) and waiting on a task's io (shown in %wa)?
    Could it be the network (tcp/ip) or anything else? Are there any tools to confirm these assumptions?

    ReplyDelete
    Replies
    1. That depends on the type of process, but generally involves getting stack traces of the running process. For Java this can be easily done using JMX, for C/C++ you can use gdb, but you will need to load the debug symbols for the process and the libraries it uses (in particular glibc).

      Delete