Chapter 24. I/O Monitoring and Tuning

Monitoring

Introduction

The relationship among I/O and system performance is complex, in order to tackle the challenge of resource monitoring we can use iostatiotop and ionice.

Objectives

  • Use iostat to monitor system I/O device activity
  • Use iotop to display a constantly updated table of current I/O usage
  • Use ionice to set both the I/O scheduling class and the priority for a given process

iostat

Is the workhorse for I/O device activity monitoring on the system.

$ iostat 
Linux 4.2.3-300.fc23.x86_64 (localhost.localdomain)     07/24/2016     _x86_64_    (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.52    0.01    0.59    0.07    0.00   95.81

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               4.22       104.74        18.39     621056     109057
sdb               0.05         0.67         0.00       3984          1
sdc               0.03         0.26         0.00       1546          2
sdd               0.02         0.16         0.00        967          2
scd0              0.00         0.01         0.00         64          0
dm-0              5.17       103.49        18.39     613625     109048
dm-1              0.02         0.13         0.00        760          0
loop0             0.01         0.18         0.00       1049          6
dm-2              0.01         0.19         0.00       1106          1
md0               0.01         0.11         0.00        665          1
loop1             0.01         0.19         0.00       1112          0
dm-3              0.01         0.17         0.00       1036          0

​These data means

  • tps
    • Transactions per second
  • blocks read and written per unit time
    • Where blocks are generally sectors of 512 bytes
  • total blocks read and written

Information is broken out by disk partition (and if LVM is being used also by dm (device mapper) logical partitions ).

Iostat Options

A different display is generated by the -k option, which shows results in KB instead of blocks

$ iostat -k

In order to display information by device name with the N option

​$ iostat -N
Linux 4.2.3-300.fc23.x86_64 (localhost.localdomain)     07/24/2016     _x86_64_    (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.08    0.01    0.20    0.03    0.00   98.68

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               1.34        28.25         9.94     624720     219757
sdb               0.01         0.18         0.00       3984          1
sdc               0.01         0.07         0.00       1546          2
sdd               0.01         0.04         0.00        967          2
scd0              0.00         0.00         0.00         64          0
fedora-root       1.70        27.91         9.94     617289     219748
fedora-swap       0.00         0.03         0.00        760          0
loop0             0.00         0.05         0.00       1049          6
myvg-mylvm        0.00         0.05         0.00       1106          1
md0               0.00         0.03         0.00        665          1
loop1             0.00         0.05         0.00       1112          0
secret-disk1      0.00         0.05         0.00       1036          0

A much more detailed report can be obtained by using the -x option (for extended):

$ iostat -xk
Linux 4.2.3-300.fc23.x86_64 (localhost.localdomain)     07/24/2016     _x86_64_    (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.01    0.20    0.03    0.00   98.66

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.07     0.33    0.85    0.48    27.65     9.88    56.61     0.00    1.85    1.64    2.21   0.79   0.10
sdb               0.00     0.00    0.01    0.00     0.18     0.00    25.38     0.00    0.86    0.86    1.00   0.57   0.00
sdc               0.00     0.00    0.01    0.00     0.07     0.00    15.80     0.00    0.76    0.75    1.33   0.65   0.00
sdd               0.00     0.00    0.01    0.00     0.04     0.00    14.15     0.00    0.61    0.59    1.33   0.53   0.00
scd0              0.00     0.00    0.00    0.00     0.00     0.00     6.40     0.00    0.60    0.60    0.00   0.60   0.00
dm-0              0.00     0.00    0.91    0.78    27.32     9.88    44.13     0.00    2.01    2.18    1.81   0.61   0.10
dm-1              0.00     0.00    0.00    0.00     0.03     0.00    16.00     0.00    1.38    1.38    0.00   1.38   0.00
loop0             0.00     0.00    0.00    0.00     0.05     0.00    35.17     0.00    2.82    1.35   16.00   2.75   0.00
dm-2              0.00     0.00    0.00    0.00     0.05     0.00    35.71     0.00    0.87    0.87    1.00   0.50   0.00
md0               0.00     0.00    0.00    0.00     0.03     0.00    20.18     0.00    0.00    0.00    0.00   0.00   0.00
loop1             0.00     0.00    0.00    0.00     0.05     0.00    35.32     0.00    1.41    1.41    0.00   0.81   0.00
dm-3              0.00     0.00    0.00    0.00     0.05     0.00    48.19     0.00    1.88    1.88    0.00   1.07   0.00

Extended iostat fields

Field Meaning
Device Device or partition name
rrqm/s Number of read requests merged per second, queued to device
wrqm/s Number of write requests merged per second, queued to device
r/s Number of read request per second, issued to the device
w/s Number of write requests per second, issued to the device
rkB/s KB read from the device per second
wkB/s KB written to the device per second
avgrq-sz Average request size in 512 byte sectors per second
avgqu-sz Average queue length of requests issued to the device
await Average time (in msecs) I/O requests between when a request is issued and when it is completed: 
queue time plus service time
svctm Average service time (in msecs) for I/O requests
%util Percentage of CPU time during the device serviced requests

Note that if the utilization percentage approaches 100 the system is saturated.

iotop

Another useful utility is iotop, which must be run as root. It displays a table of current I/O usage and updates periodically like top.

$ sudo iotop

Within the output the PRIO colum will display values like

  • be
    • Best Effort
  • rt
    • Real Time

Using ionice to set I/O priorities

The ionice utility lets you set both the I/O scheduling class and priority for a given process. It takes the form:

$ ionice [-c class] [-n priority] [-p pid] [COMMAND [ARGS] ]

If a pid is given with the -p argument results apply to the requested process, otherwise it is the process that will be started by COMMAND with possible arguments. If no arguments are given, ionice returns the scheduling class and priority of the current shell process, as in:

​$ ionice

The -c parameter specifies the I/O scheduling class, which can have the following 3 values:

I/O Scheduling Class -c value Meaning
Idle 1 No access to disk I/O unless no other program has asked for it for a defined period.
Best effort 2 All programs serviced in round-robin fashion, according to priority settings. The Default
Real Time 3 Get first access to the disk, can starve other processes. The priority defines how big a time slice each process gets.

Either Best effort or real time takes as a parameter in order to set priority, which can range from 0 to 7 with 0 being the highest priority. An example

$ ionice -c 2 -n 3 -p 30078

Note: ionice works only when using the CFQ I/O Scheduler talked about in the next section

Lab 24.1: bonnie++

bonnie++ is a widely available benchmarking program that tests and measures the performance of drives and filesystems. It is descended from bonnie, an earlier implementation.

Results can be read from the terminal window or directed to a file, and also to a csv format (comma separated value). Companion programs, bon csv2html and bon csv2txt, can be used convert to html and plain text output formats. We recommend you read the man page for bonnie++ before using as it has quite a few options regarding which tests
to perform and how exhaustive and stressful they should be. A quick synopsis is obtained with:

$ bonnie++ -help
bonnie++: invalid option -- ’h’
usage:
bonnie++ [-d scratch-dir] [-c concurrency] [-s size(MiB)[:chunk-size(b)]]
[-n number-to-stat[:max-size[:min-size][:num-directories[:chunk-size]]]]
[-m machine-name] [-r ram-size-in-MiB]
[-x number-of-tests] [-u uid-to-use:gid-to-use] [-g gid-to-use]
[-q] [-f] [-b] [-p processes | -y] [-z seed | -Z random-file]
[-D]

Version: 1.96
A quick test can be obtained with a command like:

$ time sudo bonnie++ -n 0 -u 0 -r 100 -f -b -d /mnt

where:

  • -n 0 means don’t perform the file creation tests.
  • -u 0 means run as root.
  • -r 100 means pretend you have 100 MB of RAM.
  • -f means skip per character I/O tests.
  • -b means do a fsync after every write, which forces flushing to disk rather than just writing to cache.
  • -d /mnt just specifies the directory to place the temporary file created; make sure it has enough space, in this case 300 MB, available.

If you don’t supply a figure for your memory size, the program will figure out how much the system has and will create
a testing file 2-3 times as large. We are not doing that here because it takes much longer to get a feel for things.

On an RHEL 7 system:

$ time sudo bonnie++ -n 0 -u 0 -r 100 -f -b -d /mnt
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start ’em...done...done...done...done...done...
Version 1.96 ------Sequential Output------ --Sequential Input- --RandomConcurrency
1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
q7 300M 99769 14 106000 12 +++++ +++ 257.3 1
Latency 226us 237us 418us 624ms
1.96,1.96,q7,1,1415992158,300M,,,,99769,14,106000,12,,,+++++,+++,257.3,1,,,,,,,,,,,,,,,,,,,226us,237us,,418us,624ms,,,,,,

On an Ubuntu 14.04 system, running as a virtual machine under hypervisor on the same physical machine:

$ time sudo bonnie++ -n 0 -u 0 -r 100 -f -b -d /mnt
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start ’em...done...done...done...done...done...
Version 1.97 ------Sequential Output------ --Sequential Input- --RandomConcurrency
1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
ubuntu 300M 70000 61 43274 31 470061 96 2554 91
Latency 306ms 201ms 9276us 770ms
1.97,1.97,ubuntu,1,1415983257,300M,,,,70000,61,43274,31,,,470061,96,2554,91,,,,,,,,,,,,,,,,,,,306ms,201ms,,9276us,770ms,,,

You can clearly see the drop in performance.

Assuming you have saved the previous outputs as a file called bonnie++.out, you can convert the output to html:

$ bon_csv2html < bonnie++.out > bonnie++.html

or to plain text with:

$ bon_csv2txt < bonnie++.out > bonnie++.txt

After reading the documentation, try longer and larger, more ambitious tests. Try some of the tests we turned off. If
your system is behaving well, save the results for future benchmarking comparisons when the system is sick.

Lab 24.2: fs mark

The fs mark benchmark gives a low level bashing to file systems, using heavily asynchronous I/O across multiple directories and drives. It’s a rather old program written by Ric Wheeler that has stood the test of time. It can be downloaded from http://sourceforge.net/projects/fsmark/ Once you have obtained the tarball, you can unpack it and compile it with:

$ tar zxvf fs_mark-3.3.tgz
$ cd fs_mark
$ make

Read the README file as we are only going to touch the surface.
If the compile fails with an error like:

$ make
....
/usr/bin/ld: cannot find -lc

it is because you haven’t installed the static version of glibc. You can do this on Red Hat-based systems by doing:

$ sudo yum install glibc-static

and on SUSE-related sytems with:

$ sudo zypper install glibc-devel-static

On Debian-based systems the relevant static library is installed along with the shared one so no additional package needs to be sought. For a test we are going to create 1000 files, each 10 KB in size, and after each write we’ll perform an fsync to flush out to disk. This can be done in the /tmp directory with the command:

$ fs_mark -d /tmp -n 1000 -s 10240

While this is running, gather extended iostat statistics with:

$ iostat -x -d /dev/sda 2 20

in another terminal window.

The numbers you should surely note are the number of files per second reported by fs mark and the percentage of CPU time utilized reported by iostat. If this is approaching 100 percent, you are I/O-bound. Depending on what kind of filesystem you are using you may be able to get improved results by changing the mount options. For example, for ext3 or ext4 you can try:

$ mount -o remount,barrier=1 /tmp

or for ext4 you can try:

$ mount -o remount,journal_async_commit /tmp

See how your results change. Note that these options may cause problems if you have a power failure, or other ungraceful system shutdown; i.e., there is likely to be a trade-off between stability and speed. Documentation about some of the mount options can be found with the kernel source under Documentation/filesystems and the man page for mount.