Objectives
- Explain the importance of I/O scheduling
- Understand how Completely Fair Queue and Deadline algorithms work
I/O Scheduling
It is the interface between the generic block layer and the low-level physical device. Both Virtual Memory and Virtual File system layers submit I/O request to block devices, its the job of the I/O scheduling layer to prioritize and order these requests before they are given to the block devices.
Any I/O scheduling system has to satisfy certain conflicting requirements:
- Hardware access time should be minimized
- Request most be organized according to physical location on the disk
- Requests should be merged to the extent possible to get as big a contiguous region as possible, which also minimizes disk access time
- Request should be satisfied with as low a latency as is feasible
- Write operations can usually wait to migrate from caches to disk without stalling processes. Read operations however almost always require a process to wait for completion before proceeding further. Favoring reads over writes leads to better parallelism and system responsiveness
- Process should share the I/O bandwith in a fair, or at least consciously prioritized fashion; even it means some overall performance slowdown of the I/O layer, process throughput should not suffer inordinately
I/O Scheduler Choices
Since these demands can be conflicting, different I/O schedulers may be appropiate for different workloads, a large data base server vs a desktop system. Furthermore, different hardware may mandate different strategy. In order to provide flexibility, the Linux kernel has an object oriented scheme, in which pointers to the various needed functions are suppplied in a data structure, the particular one of which can be selected at boot on the kernel command line as in:
linux... elevator=[cfq|deadline|noop]
At least one of the I/O scheduling algorithms must be compiled into the kernel. The current choises are :
- Completely Fair Queueing (CFQ)
- Deadline Scheduling
- noop (a simple scheme)
Modern distributions choose either CFQ or Deadline.
I/O Scheduling and SSD Devices
The gradual introduction of SSD (Solid State Drive) devices which use flash memory to emulate hard disks has important implications for I/O scheduling.
Such devices do not require an elevator scheme and benefit from wear leveling to spread I/O over the devices which have limited write/erase cycles. One can examine
/sys/block/<device>/queue/rotational
To see whether or not the device is an SSD or not, as in:
$ cat /sys/block/sda/queue/rotational 1
Tunables and Switching The I/O Scheduler at Runtime
Each of the I/O schedulers exposes parameters which can be used to tune behavior at run time. The parameters are accessed through the pseudo-filesystem mounted at
/sys
In addition, it is possible to use different I/O schedulers for different devices. The choice can be made easily through the command line. For example :
$ cat /sys/block/sda/queue/scheduler noop deadline [cfq] $ echo noop > /sys/block/sda/queue/scheduler $ cat /sys/block/sda/queue/scheduler [noop] deadline cfq
The actual tunables vary according to the particular I/O scheduler, and can be found under:
/sys/block/<device>/queue/iosched
For a disk using CFQ:
$ ls -l /sys/block/sda/queue/iosched
CFQ (Completely Fair Queue Scheduler)
The CFQ method has the goal of equal spreading of I/O bandwith among all processes submitting requests.
Theoretically each process has its own I/O queue, which works together with a dispatch queue which receives the actual requests on the way to the device. In actual practice, the number of queues is fixed (at 64) and a hash process based on the process ID is used to select a queue when a request is submitted.
Dequeuing of request is done round robin style on all the queues, each one of which works in FIFO (First In First Out) order. Thus the work is spread out. To avoid excessive seeking operations, an entire round is selected, and the sorted into the dispatch queue before actual I/O requests are issued to the device.
CFQ Tunables
In the example below the parameter HZ is a kernel-configured quantity, that corresponds to the number of jiffies per second, which the kernel uses as a coarse measure of time.
- quantum
- Maximum queue lenght is one round of service (Default = 4)
- queued
- Minimum request allocation per queue (Default = 8)
- fifo_expire_sync
- FIFO timeout for sync request (Default = HZ/2)
- fifo_expire_async
- FIFO timeout for async request (Default = 5 * HZ)
- fifo_batch_expire
- Rate at which the FIFO´s expire (Default = HZ/8)
- back_seek_max
- Maximum backwards seek, in KB (Default = 16K)
- back_seek_penalty
- Penalty for a backwards see (Default = 2)
Deadline Scheduler
The Deadline I/O scheduler aggressively reorders requests with the simultaneous goals of improving overall performance and preventing large latencies for individuals requests limiting starvation.
With each and every request the kernel associates a deadline. Read request get higher priority than write requests.
Five separete I/O queues are maintained:
- Two sorted lists are maintained, one for reading and one for writing, and arranged by starting block
- Two FIFO lists are maintained, again one for reading and one for writing. These lists are sorted by submission time
- A fifth queue contains the requests that are to be shoveled to the device driver itself. This is called the dispatch queue.
Exactly how the requests are peeled off the first four queues and placed on the fifth (dispatch queue) is where the art of the algorithm is
Deadline Tunables
- read_expire
- How long (in milliseconds) a read request is guaranteed to occur within (Default = HZ/2 = 500)
- write_expire
- How long (in milliseconds) a write request is guaranteed to occur within (Default = 5*HZ = 5000)
- writes_starved
- How many requests we should give preference to reads over writes (Default = 2)
- fifo_batch
- How many requests should be moved from the sorted scheduler list to the dispatch queue, when the deadlines have expired (Default = 16)
- front_merges
- Back merges are more common than front merges as a contiguous request usually continues to the next block. Setting this parameter to 0 disables front merges and can give a boost if you know they are unlikely to be needed (Default = 1)