Published at 2024-02-15 | Last Update 2024-02-15
整理一些 Linux 服务器性能相关的 CPU 硬件基础及内核子系统知识。
水平有限,文中不免有错误或过时之处,请酌情参考。
idle=poll
的潜在风险 5.15
内核文档 “CPU Idle Time Management”5.15
内核文档 “NO_HZ: Reducing Scheduling-Clock Ticks”5.15
内核文档 “AMD64 Specific Boot Options”前面已经介绍过,idle=poll
就是强制处理器工作在 C0,保持最高性能。
但内核文档中好几个地方提示这样设置是有风险的,这里整理一下。
5.15
内核文档 “CPU Idle Time Management”using ``idle=poll`` is somewhat drastic in many cases, as preventing idle
CPUs from saving almost any energy at all may not be the only effect of it.
For example, on Intel hardware it effectively prevents CPUs from using
P-states (see |cpufreq|) that require any number of CPUs in a package to be
idle, so it very well may hurt single-thread computations performance as well as
energy-efficiency. Thus using it for performance reasons may not be a good idea
at all.]
这段写的比较晦涩,基于本系列前几篇的基础,尝试给大家翻译一下:
idle=poll
除了功耗高,还有其他后果;例如,
另外,这个文档是 Intel 的人写的,但看过超频原理就应该明白,这个问题不仅限于 Intel CPU。
5.15
内核文档 “NO_HZ: Reducing Scheduling-Clock Ticks”NO_HZ: Reducing Scheduling-Clock Ticks:
Known Issues
d. On x86 systems, use the "idle=poll" boot parameter.
However, please note that use of this parameter can cause
your CPU to overheat, which may cause thermal throttling
to degrade your latencies -- and that this degradation can
be even worse than that of dyntick-idle. Furthermore,
this parameter effectively disables Turbo Mode on Intel
CPUs, which can significantly reduce maximum performance.
这是归类到了已知问题,写的比前一篇清楚多了:
idle=poll
effectively 禁用了 Intel Turbo Mode,
也就是无法超频到 base frequency 以上,因此峰值性能显著变差。5.15
内核文档 “AMD64 Specific Boot Options”这个是启动项说明,里面以 Intel CPU 为例但问题不仅限于 Intel, AMD 的很多在用参数和功能这个文档里都没有,
Idle loop
=========
idle=poll
Don't do power saving in the idle loop using HLT, but poll for rescheduling
event. This will make the CPUs eat a lot more power, but may be useful
to get slightly better performance in multiprocessor benchmarks. It also
makes some profiling using performance counters more accurate.
Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
CPUs) this option has no performance advantage over the normal idle loop.
It may also interact badly with hyperthreading.
idle=poll
在某些场景下能提升 multiple benchmark 的性能,也能让某些 profiling 更准确一些;MONITOR/MWAIT
的平台上,这个配置并不会带来性能提升;Dell Whitepaper: Controlling Processor C-State Usage in Linux, 服务器厂商 Dell 的技术白皮书,其中一段,
If a user wants the absolute minimum latency, kernel parameter “idle=poll” can be used to keep the
processors in C0 even when they are idle (the processors will run in a loop when idle, constantly
checking to see if they are needed). If this kernel parameter is used, it should not be necessary to
disable C-states in BIOS (or use the “idle=halt” kernel parameter).
Take care when keeping processors in C0, though--this will increase power usage considerably.
Also, hyperthreading should probably be
disabled, as keeping processors in C0 can interfere with proper operation of logical cores
(hyperthreading). (The hyperthreading hardware works best when it knows when the logical processors
are idle, and it doesn’t know that if processors are kept busy in a loop when they are not running
useful code.)
超线程硬件的工作原理:
用户报告 idle=poll + hyperthreading
导致并发性能显著变差,
Linus 回复说,
I really don’t think you should really ever use “idle=poll” on HT-enabled hardware,
HT
是超线程的缩写。
看起来 idle=poll
与 turbo-frequency/hyperthreading 存在工作机制的冲突。
需要一些场景和 testcase 来验证。有经验的专家大佬,欢迎交流。
一台惠普机器:
$ dmesg -T
kernel: ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180810/exfield-393)
kernel: ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180810/psparse-516)
kernel: ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180810/power_meter-338)
...
这是 HP 的 BIOS 实现没有遵守协议,实际上这个报错不会产生硬件性能影响之类的(但是打印的日志量可能很大,每分钟十几条,不间断)。
一台联想机器:
$ dmesg -T
kernel: power_meter ACPI000D:00: Found ACPI power meter.
kernel: power_meter ACPI000D:00: Found ACPI power meter.
...
如果是 k8s node 遇到以上问题,可能是部署了 prometheus/node_exporter 导致的 [2], 试试关闭其 hwmon collector。