Chapter 22. System Monitoring

Introduction

We will cover common utilities for process monitoring like ps, top and sar. Most of this commands make heavy access to the pseudo-filesystems mounted at /proc and /sys.

Objectives

  • Recognize and use available system monitoring tools
  • Use /proc and /sys pseudo-filesystems
  • Use sar to gather system activity and performance data
  • Create reports

Available monitoring tools

Process and Load Monitoring Utilities
Utility Purpose Package
top Display process activity, dinamically updated procps
uptime How long the system is running and the average load procps
ps Detailed information about processes procps
pstree A tree of processes and their connections procps
mpstat Multiple processor usage sysstat
iostat CPU utilization and I/O statistics sysstat
sar Display and collect information about system activity sysstat
numastat Information about NUMA (Non Uniform Memory Architecture) numactl
strace Information about all system calls and process makes strace

Memory Monitoring Utilities

Utility Purpose Package
free Brief summary of memory usage procps
vmstat Detailed virtual memory statistics and block I/O procps
pmap Process memory map procps

I/O Monitoring Utilities

Utility Purpose Package
iostat CPU utilization and I/O statistics sysstat
sar Display and collect information about system activity sysstat
vmstat Detailed virtual memory statistics and block I/O, dynamically updated procps

Network Monitoring Utilities

Utility Purpose Package
netstat Detailed networking statistics netstat
iptraf Gather information on network interfaces iptraf
tcpdump Detailed analysis of network packets and traffic tcpdump
wireshark Detailed network traffic analysis wireshark

/proc Basics

This directory contains information about processes, over time it grew to contain a lot of information about system properties, such as interrupts, memory, networking, etc.

A survey of /proc

Let's take a look at what resides in the /proc pseudo-filesystem

$ ls -F /proc
1/     1465/  1739/  1966/  2213/  2401/  2748/  615/  752/  836/  89/        devices      latency_stats  swaps
10/    1476/  1751/  1970/  2218/  2406/  2777/  621/  755/  837/  895/       diskstats    loadavg        sys/
1014/  1486/  1759/  2/     2222/  2409/  29/    630/  756/  838/  899/       dma          locks          sysrq-trigger
1027/  15/    1762/  20/    2229/  2411/  3/     641/  76/   84/   9/         driver/      mdstat         sysvipc/
1043/  1517/  177/   2013/  2231/  2414/  30/    643/  77/   846/  909/       execdomains  meminfo        thread-self@
1046/  1528/  1770/  2018/  2233/  2431/  3060/  644/  78/   847/  916/       fb           misc           timer_list
1050/  1534/  1775/  2078/  2248/  2470/  3503/  647/  79/   848/  917/       filesystems  modules        timer_stats
1072/  1536/  1780/  2083/  2249/  2475/  3511/  664/  797/  85/   92/        fs/          mounts@        tty/
1081/  1555/  1784/  21/    2261/  25/    3516/  687/  8/    850/  93/        interrupts   mtrr           uptime
11/    16/    1791/  2102/  2272/  2549/  3526/  688/  80/   854/  94/        iomem        net@           version
12/    1601/  1794/  2107/  2286/  2561/  354/   695/  807/  855/  acpi/      ioports      pagetypeinfo   vmallocinfo
1269/  1609/  1796/  2111/  2293/  2563/  461/   696/  809/  86/   asound/    irq/         partitions     vmstat
1293/  1611/  18/    2123/  23/    2588/  462/   7/    81/   861/  buddyinfo  kallsyms     sched_debug    zoneinfo
13/    1621/  1855/  2145/  2301/  26/    472/   727/  82/   862/  bus/       kcore        schedstat
1300/  1696/  19/    2155/  2308/  2631/  473/   728/  828/  864/  cgroups    keys         scsi/
1319/  1697/  1910/  2172/  2389/  2640/  492/   73/   83/   868/  cmdline    key-users    self@
1326/  17/    1919/  2182/  2395/  2643/  493/   74/   830/  869/  consoles   kmsg         slabinfo
135/   1732/  1921/  2187/  2399/  2644/  5/     75/   834/  87/   cpuinfo    kpagecount   softirqs
14/    1737/  1930/  22/    24/    2693/  593/   750/  835/  879/  crypto     kpageflags   stat

Let's take a look at a single process 2111 for example

$ ls -lF /proc/2111
total 0
dr-xr-xr-x. 2 abernal abernal 0 Jul 23 22:48 attr/
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 autogroup
-r--------. 1 abernal abernal 0 Jul 23 22:48 auxv
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 cgroup
--w-------. 1 abernal abernal 0 Jul 23 22:48 clear_refs
-r--r--r--. 1 abernal abernal 0 Jul 23 17:26 cmdline
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 comm
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 coredump_filter
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 cpuset
lrwxrwxrwx. 1 abernal abernal 0 Jul 23 22:48 cwd -> /home/abernal/
-r--------. 1 abernal abernal 0 Jul 23 22:48 environ
lrwxrwxrwx. 1 abernal abernal 0 Jul 23 17:34 exe -> /usr/libexec/at-spi2-registryd*
dr-x------. 2 abernal abernal 0 Jul 23 22:48 fd/
dr-x------. 2 abernal abernal 0 Jul 23 22:48 fdinfo/
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 gid_map
-r--------. 1 abernal abernal 0 Jul 23 22:48 io
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 latency
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 limits
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 loginuid
dr-x------. 2 abernal abernal 0 Jul 23 22:48 map_files/
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 maps
-rw-------. 1 abernal abernal 0 Jul 23 22:48 mem
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 mountinfo
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 mounts
-r--------. 1 abernal abernal 0 Jul 23 22:48 mountstats
dr-xr-xr-x. 6 abernal abernal 0 Jul 23 22:48 net/
dr-x--x--x. 2 abernal abernal 0 Jul 23 22:48 ns/
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 numa_maps
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 oom_adj
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 oom_score
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 oom_score_adj
-r--------. 1 abernal abernal 0 Jul 23 22:48 pagemap
-r--------. 1 abernal abernal 0 Jul 23 22:48 personality
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 projid_map
lrwxrwxrwx. 1 abernal abernal 0 Jul 23 22:48 root -> //
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 sched
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 schedstat
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 sessionid
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 setgroups
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 smaps
-r--------. 1 abernal abernal 0 Jul 23 22:48 stack
-r--r--r--. 1 abernal abernal 0 Jul 23 17:27 stat
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 statm
-r--r--r--. 1 abernal abernal 0 Jul 23 17:27 status
-r--------. 1 abernal abernal 0 Jul 23 22:48 syscall
dr-xr-xr-x. 5 abernal abernal 0 Jul 23 22:48 task/
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 timers
-rw-r--r--. 1 abernal abernal 0 Jul 23 22:48 uid_map
-r--r--r--. 1 abernal abernal 0 Jul 23 22:48 wchan

Let's see the content of the status folder

$ cat /proc/2111/status
Name:    at-spi2-registr
State:    S (sleeping)
Tgid:    2111
Ngid:    0
Pid:    2111
PPid:    1
TracerPid:    0
Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000
FDSize:    64
Groups:    10 1000
NStgid:    2111
NSpid:    2111
NSpgid:    1966
NSsid:    1966
VmPeak:      290744 kB
VmSize:      225208 kB
VmLck:           0 kB
VmPin:           0 kB
VmHWM:        7340 kB
VmRSS:        7340 kB
VmData:      148024 kB
VmStk:         136 kB
VmExe:          84 kB
VmLib:       11108 kB
VmPTE:         180 kB
VmPMD:          12 kB
VmSwap:           0 kB
Threads:    3
SigQ:    0/11754
SigPnd:    0000000000000000
ShdPnd:    0000000000000000
SigBlk:    0000000000000000
SigIgn:    0000000000001000
SigCgt:    0000000180000000
CapInh:    0000000000000000
CapPrm:    0000000000000000
CapEff:    0000000000000000
CapBnd:    0000003fffffffff
Seccomp:    0
Cpus_allowed:    1
Cpus_allowed_list:    0
Mems_allowed:    00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:    0
voluntary_ctxt_switches:    12428
nonvoluntary_ctxt_switches:    67

Let's also check the interrupts

$ cat /proc/interrupts
           CPU0       
  0:        137   IO-APIC   2-edge      timer
  1:       4739   IO-APIC   1-edge      i8042
  8:          0   IO-APIC   8-edge      rtc0
  9:          0   IO-APIC   9-fasteoi   acpi
 12:        155   IO-APIC  12-edge      i8042
 14:          0   IO-APIC  14-edge      ata_piix
 15:      19586   IO-APIC  15-edge      ata_piix
 19:     104062   IO-APIC  19-fasteoi   enp0s3
 21:      29455   IO-APIC  21-fasteoi   0000:00:0d.0, snd_intel8x0
 22:      16037   IO-APIC  22-fasteoi   ohci_hcd:usb1
NMI:          0   Non-maskable interrupts
LOC:     985933   Local timer interrupts
SPU:          0   Spurious interrupts
PMI:          0   Performance monitoring interrupts
IWI:          0   IRQ work interrupts
RTR:          0   APIC ICR read retries
RES:          0   Rescheduling interrupts
CAL:          0   Function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
DFR:          0   Deferred Error APIC interrupts
MCE:          0   Machine check exceptions
MCP:         67   Machine check polls
HYP:          0   Hypervisor callback interrupts
ERR:          0
MIS:          0
PIN:          0   Posted-interrupt notification event
PIW:          0   Posted-interrupt wakeup event

Most of the tunnable parameters can be found at

$ ls -F /proc/sys
abi/  crypto/  debug/  dev/  fs/  kernel/  net/  sunrpc/  vm/
  • ​abi
    • Contains files with application binary information, rarely used
  • debug
    • Debugging parameters, for now just some control of exception reporting
  • dev
    • Device parameters including subdirectories for cdrom, scsi, raid and parport
  • ​fs
    • Filesystem parameters, including quota, file handles used and maximums, inode and directory information, etc.
  • kernel
    • Kernel parameters
  • net
    • Network parameters, with subdirectories for ipv4, netfilter, etc
  • vm
    • Virtual memory parameters

Viewing and changing the parameters can be done with simple commands. For example the maximum number of threads allowed an the system can be seen by looking at

$ ls -l /proc/sys/kernel/threads-max
$ cat /proc/sys/kernel/threads-max
129498

We can modify the value and verify the change

$ sudo bash -c 'echo 100000 > /proc/sys/kernel/threads-max'
$ cat /proc/sys/kernel/threads-max
100000

The same can be accomplished with sysctl by

$ sudo sysctl kernel.threads-max=100000

/sys Basics

The /sys pseudo-filesystem is an integral part of the Unified Device Model. Conceptually it is based on a device tree, one can walk through it and see the buses, devices, etc. It also now contains information whcih may or may not be strictly related to devices, such as kernel modules.

A survey of /sys

Support for the sysfs virtual filesystem is built into all modern kernels.

$ls -F /sys
block/  bus/  class/  dev/  devices/  firmware/  fs/  hypervisor/  kernel/  module/  power/

Network devices can be examined with

$ ls -lF /sys/class/net

Looking at the first Ethernet card

$ ls -l /sys/class/net/eth0

In order to check the mtu parameter we can type

$ cat /sys/class/net/eth0/mtu
1500

The intention with sysfs is to have one text value per line, although this is not expected to be rigorously enforced.

Checking the device of the Ethernet card by

$ ls -l /sys/class/net/eth0/device

​sar

Systems Activity Reporter, an all purpose tool for gathering system activity and performance data, also useful to create reports.

The backend tool for sar is sadc (system activity data collector), which accumulates the statistics. Stores information in /var/log/sa directory, with a daily frecuency by default, but which can be adjusted. Data collection can be started by the command line and regular periodic collection is usually started as a cron job stored in /etc/cron.d/sysstat

The sar tool then reads in this data and then produces a report.

Syntaxis

$ sar [options] [interval] [count]

Sample

$ sar 3 3
Linux 4.2.3-300.fc23.x86_64 (localhost.localdomain)     07/23/2016     _x86_64_    (1 CPU)

11:47:04 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:47:07 PM     all      2.05      0.00      0.00      0.00      0.00     97.95
11:47:10 PM     all      2.04      0.00      0.34      0.00      0.00     97.62
11:47:13 PM     all      1.71      0.00      0.00      0.00      0.00     98.29
Average:        all      1.93      0.00      0.11      0.00      0.00     97.95

This command gives a report about the CPU usage

Sar Options​

Option Meaning
-A Almost all information
-b I/O and transfer rate statistics (similar to iostat)
-B Paging statistics including page faults
-x Block device activity
-n Network statistics
-P Per CPU statistics (as in sar -P ALL 3)
-q Queue lengths (run queue, processes and threads)
-r Swap and memory utilization statistics
-R Memory statistics
-u CPU utilization
-v Statistics about inodes and files and file handlers
-w Context switching statistics
-W Swapping statistics, pages in and out per second
-f Extract information from specified file, created by the -o option
-o Save readings in the file specified, to be read in later with -f option

Giving paging statistics

​$ sar -B 3 3
Linux 4.2.3-300.fc23.x86_64 (localhost.localdomain)     07/23/2016     _x86_64_    (1 CPU)

11:58:45 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff
11:58:48 PM      0.00      4.14     26.21      0.00   1861.72      0.00      0.00      0.00      0.00
11:58:51 PM      0.00      0.00      7.51      0.00   1412.63      0.00      0.00      0.00      0.00
11:58:54 PM      0.00     13.56     40.00      0.00   1587.80      0.00      0.00      0.00      0.00
Average:         0.00      5.92     24.60      0.00   1619.82      0.00      0.00      0.00      0.00

And giving I/O transfer statistics

$ sar -b 3 3
Linux 4.2.3-300.fc23.x86_64 (localhost.localdomain)     07/24/2016     _x86_64_    (1 CPU)

12:00:52 AM       tps      rtps      wtps   bread/s   bwrtn/s
12:00:55 AM      0.00      0.00      0.00      0.00      0.00
12:00:58 AM      0.00      0.00      0.00      0.00      0.00
12:01:01 AM      0.00      0.00      0.00      0.00      0.00
Average:         0.00      0.00      0.00      0.00      0.00

​Lab 22.1: Using stress

stress is a C language program written by Amos Waterland at the University of Oklahoma, licensed under the GPL v2. It is designed to place a configurable amount of stress by generating various kinds of workloads on the system.

If you are lucky you can install stress directly from your distribution’s packaging system. Otherwise, you can obtain the
source from http://people.seas.harvard.edu/~apw/stress, and then compile and install by doing:

$ tar zxvf stress-1.0.4.tar.gz
$ cd stress-1.0.4
$ ./configure
$ make
$ sudo make install

There may exist pre-packaged downloadable binaries in the .deb and .rpm formats; see the home page for details and locations.

Once installed, you can do:

$ stress --help

for a quick list of options, or

$ info stress

for more detailed documentation.

As an example, the command:

$ stress -c 8 -i 4 -m 6 -t 20s

will:

  • Fork off 8 CPU-intensive processes, each spinning on a sqrt() calculation.
  • Fork off 4 I/O-intensive processes, each spinning on sync().
  • Fork off 6 memory-intensive processes, each spinning on malloc(), allocating 256 MB by default. The size can be changed as in --vm-bytes 128M.
  • Run the stress test for 20 seconds.

After installing stress, you may want to start up your system’s graphical system monitor, which you can find on your application menu, or run from the command line, which is probably gnome-system-monitor or ksysguard.

Now begin to put stress on the system. The exact numbers you use will depend on your system’s resources, such as the as the number of CPU’s and RAM size.

For example, doing

$ stress -m 4 -t 20s

puts only a memory stressor on the system. Play with combinations of the switches and see how they impact each other. You may find the stress program useful to simulate various high load conditions.