CPU--平均负载

二龙2年前 (2023-03-06)技术文章1314

1、原理概述

平均负载是指单位时间内，系统处于可运行状态和不可中断状态的平均进程数，也就是平均活跃进程数，它和 CPU 使用率并没有直接关系。
  * 可运行状态的进程，是指正在使用 CPU 或者正在等待 CPU 的进程，也就是我们常用 ps 命令看到的，处于 R 状态（Running 或 Runnable）的进程。
  * 不可中断状态的进程则是正处于内核态关键流程中的进程，并且这些流程是不可打断的，比如最常见的是等待硬件设备的 I/O 响应，也就是我们在 ps 命令中看到的 D 状态（Uninterruptible Sleep，也称为 Disk Sleep）的进程。

CPU 使用率，是单位时间内 CPU 繁忙情况的统计，跟平均负载并不一定完全对应。比如：
  * CPU 密集型进程，使用大量 CPU 会导致平均负载升高，此时这两者是一致的；
  * I/O 密集型进程，等待 I/O 也会导致平均负载升高，但 CPU 使用率不一定很高；
  * 大量等待 CPU 的进程调度也会导致平均负载升高，此时的 CPU 使用率也会比较高。
  
举个例子：当平均负载为 2 时，意味着什么呢？
在只有 2 个 CPU 的系统上，意味着所有的 CPU 都刚好被完全占用。
在 4 个 CPU 的系统上，意味着 CPU 有 50% 的空闲。
而在只有 1 个 CPU 的系统中，则意味着有一半的进程竞争不到 CPU。

2、CPU密集型进程

a、stress 命令 模拟八个 CPU 使用率 100% 的场景
[root@172-16-104-112 ~]# stress --cpu 1 --timeout 600

b、mpstat 查看 CPU 使用率的变化情况
## -P ALL 表示监控所有CPU，后面数字5表示间隔5秒后输出一组数据
[root@172-16-104-112 ~]# mpstat -P ALL 5
Linux 3.10.0-1127.el7.x86_64 (172-16-104-112) 	2021年12月15日 	_x86_64_	(8 CPU)

22时03分14秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
22时03分19秒  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
22时03分19秒    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
22时03分19秒    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
22时03分19秒    2  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
22时03分19秒    3  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
22时03分19秒    4  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
22时03分19秒    5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
22时03分19秒    6  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
22时03分19秒    7  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

c、查看导致cpu使用率100%的进程
## 间隔5秒后输出一组数据
[root@172-16-104-112 ~]# pidstat -u 5 1
Linux 3.10.0-1127.el7.x86_64 (172-16-104-112) 	2021年12月15日 	_x86_64_	(8 CPU)

22时04分53秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
22时04分58秒     0     13699  100.00    0.00    0.00  100.00     2  stress
22时04分58秒     0     13700  100.00    0.00    0.00  100.00     7  stress
22时04分58秒     0     13701   99.80    0.00    0.00   99.80     0  stress
22时04分58秒     0     13702  100.00    0.00    0.00  100.00     5  stress
22时04分58秒     0     13703   99.80    0.00    0.00   99.80     4  stress
22时04分58秒     0     13704   99.80    0.00    0.00   99.80     6  stress
22时04分58秒     0     13705   99.80    0.00    0.00   99.80     3  stress
22时04分58秒     0     13706  100.00    0.00    0.00  100.00     1  stress
22时04分58秒     0     13710    0.00    0.20    0.00    0.20     4  pidstat

分析：
从b中还可以看到八个 CPU 的使用率为 100%，但它们的 iowait 只有 0。这说明，平均负载的升高正是由于 CPU 使用率为 100% 。

3、I/O密集型进程

a、stress/stress-ng 命令，但这次模拟 I/O 压力，即不停地执行 sync
stress -i 1 --timeout 600（这个命令不一定能压出来，stress使用的是 sync() 系统调用，它的作用是刷新缓冲区内存到磁盘中。对于新安装的虚拟机，缓冲区可能比较小，无法产生大的IO压力，这样大部分就都是系统调用的消耗了）
stress-ng -i 1 --hdd 1 --timeout 600（--hdd表示读写临时文件）
[root@172-16-104-112 bin]# stress-ng -i 8 --hdd 1 --timeout 600
stress-ng: info:  [13718] dispatching hogs: 1 hdd, 8 io

b、mpstat 查看 CPU 使用率的变化情况：
## 显示所有CPU的指标，并在间隔5秒输出一组数据
[root@172-16-104-112 ~]# mpstat -P ALL 5 1
Linux 3.10.0-1127.el7.x86_64 (172-16-104-112) 	2021年12月15日 	_x86_64_	(8 CPU)

22时12分20秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
22时12分25秒  all    0.28    0.00   34.31    6.48    0.00    0.51    0.69    0.00    0.00   57.75
22时12分25秒    0    0.20    0.00   31.79    6.04    0.00    0.20    0.80    0.00    0.00   60.97
22时12分25秒    1    0.40    0.00   35.83    3.04    0.00    0.00    0.61    0.00    0.00   60.12
22时12分25秒    2    1.01    0.00   47.69    3.82    0.00    0.00    0.80    0.00    0.00   46.68
22时12分25秒    3    0.40    0.00   32.86    5.65    0.00    0.20    0.60    0.00    0.00   60.28
22时12分25秒    4    0.21    0.00   40.88    0.21    0.00    3.77    0.63    0.00    0.00   54.30
22时12分25秒    5    0.00    0.00   32.73   17.17    0.00    0.00    0.81    0.00    0.00   49.29
22时12分25秒    6    0.20    0.00   27.85    9.96    0.00    0.00    0.61    0.00    0.00   61.38
22时12分25秒    7    0.00    0.00   24.54    5.68    0.00    0.00    1.01    0.00    0.00   68.76

c、查看事哪个进程导致 iowait 较高
## 间隔5秒后输出一组数据，-u表示CPU指标
[root@172-16-104-112 ~]# pidstat -u 5 1
Linux 3.10.0-1127.el7.x86_64 (172-16-104-112) 	2021年12月15日 	_x86_64_	(8 CPU)

22时13分26秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
22时16分17秒     0     13719    1.99   97.81    0.00   99.80     7  stress-ng-hdd
22时16分17秒     0     13720    0.00   20.68    0.00   20.68     5  stress-ng-io
22时16分17秒     0     13721    0.00   19.88    0.00   19.88     4  stress-ng-io
22时16分17秒     0     13722    0.20   19.88    0.00   20.08     1  stress-ng-io
22时16分17秒     0     13723    0.00   20.87    0.00   20.87     0  stress-ng-io
22时16分17秒     0     13724    0.00   20.28    0.00   20.28     0  stress-ng-io
22时16分17秒     0     13725    0.20   20.28    0.00   20.48     0  stress-ng-io
22时16分17秒     0     13726    0.00   20.87    0.00   20.87     1  stress-ng-io
22时16分17秒     0     13727    0.00   19.68    0.00   19.68     3  stress-ng-io

分析：
c中可以看到cpu使用率升高，b中观察到iowait也开始增加，这说明，平均负载的升高时由于iowait的升高引起

4、大量进程的场景

a、[root@172-16-104-112 bin]# stress -c 24 --timeout 600
stress: info: [14301] dispatching hogs: 24 cpu, 0 io, 0 vm, 0 hdd

b、运行 pidstat 来看一下进程
[root@172-16-104-112 ~]# pidstat -u 5 1
Linux 3.10.0-1127.el7.x86_64 (172-16-104-112) 	2021年12月15日 	_x86_64_	(8 CPU)

22时20分04秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
22时20分09秒     0     13439    0.39    0.58    0.00    0.97     7  top
22时20分09秒     0     14302   32.75    0.00    0.00   32.75     4  stress
22时20分09秒     0     14303   32.75    0.00    0.00   32.75     3  stress
22时20分09秒     0     14304   32.36    0.00    0.00   32.36     7  stress
22时20分09秒     0     14305   32.75    0.00    0.00   32.75     4  stress
22时20分09秒     0     14306   32.75    0.00    0.00   32.75     5  stress
22时20分09秒     0     14307   32.75    0.00    0.00   32.75     0  stress
22时20分09秒     0     14308   31.98    0.00    0.00   31.98     7  stress
22时20分09秒     0     14309   32.56    0.00    0.00   32.56     0  stress
22时20分09秒     0     14310   32.56    0.00    0.00   32.56     5  stress
22时20分09秒     0     14311   32.75    0.00    0.00   32.75     5  stress
22时20分09秒     0     14312   32.75    0.00    0.00   32.75     6  stress
22时20分09秒     0     14313   32.75    0.00    0.00   32.75     2  stress
22时20分09秒     0     14314   32.75    0.00    0.00   32.75     1  stress
22时20分09秒     0     14315   32.56    0.00    0.00   32.56     0  stress
22时20分09秒     0     14316   32.75    0.00    0.00   32.75     2  stress
22时20分09秒     0     14317   32.75    0.00    0.00   32.75     6  stress
22时20分09秒     0     14318   32.75    0.00    0.00   32.75     1  stress
22时20分09秒     0     14319   32.56    0.00    0.00   32.56     1  stress
22时20分09秒     0     14320   32.56    0.00    0.00   32.56     2  stress
22时20分09秒     0     14321   32.75    0.00    0.00   32.75     3  stress
22时20分09秒     0     14322   32.56    0.00    0.00   32.56     4  stress
22时20分09秒     0     14323   32.17    0.00    0.00   32.17     7  stress
22时20分09秒     0     14324   32.56    0.00    0.00   32.56     6  stress
22时20分09秒     0     14325   32.75    0.00    0.00   32.75     3  stress
22时20分09秒     0     14334    0.19    0.58    0.00    0.78     7  pidstat

分析：
24 个进程在争抢 8 个 CPU
这里CentOS默认的sysstat稍微有点老，看不到没有%wait
源码或者RPM升级到11.5.5版本以后就可以看到了