TWSD-676: AVX enhancement for rebooting issues
Review Request #799 — Created April 16, 2025 and updated
| Information | |
|---|---|
| wli | |
| AVX2 | |
| rel_avx_2_7_4 | |
| TWSD-676 | |
| Reviewers | |
| jasonchang, timlai | |
We've received reports from several customers experiencing AVX reboot issues.
However, due to the lack of debugging tools and logs, we've been unable to identify the root cause or provide a detailed RCA to the customer.
In this ticket, we plan to add the necessary tools and logging mechanisms to thoroughly track and analyze any reboot-related issues.Added the following tools to record detailed system status:
- Kdump for capturing kernel crash dumps.
- SAR for monitoring overall system metrics, including CPU, memory, network, and disk usage.
- ping_gw.sh script to check node-to-gateway connectivity every 30 seconds.
- ping_gw test:
[2025-04-16 01:21:31] skip due to gateway not setup
[2025-04-16 01:22:01] skip due to gateway not setup
[2025-04-16 01:22:11] ping to 10.10.50.1 succeeded, latency=0.304 ms
[2025-04-16 01:22:31] skip due to gateway not setup
[2025-04-16 01:22:54] ping to 10.10.50.1 succeeded, latency=0.245 ms
[2025-04-16 01:23:01] skip due to gateway not setup
[2025-04-16 01:23:29] ping to 10.10.50.1 succeeded, latency=0.198 ms
[2025-04-16 01:23:31] skip due to gateway not setup
[2025-04-16 01:24:01] skip due to gateway not setup
- SAR:
AVX#ls /var/log/sa/
sa15 sa16 sar15AVX#sar -u -f /var/log/sa/sa16
Linux 3.10.0-327.28.2.11.el7.x86_64 (O7Cvxy1NC3) 04/16/2025 x86_64 (20 CPU)12:00:01 AM CPU %user %nice %system %iowait %steal %idle
12:10:01 AM all 0.01 0.00 0.46 0.01 0.00 99.51
12:20:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
12:30:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
Average: all 0.01 0.00 0.47 0.01 0.00 99.5112:33:38 AM LINUX RESTART
12:40:01 AM CPU %user %nice %system %iowait %steal %idle
12:50:01 AM all 0.01 0.00 0.46 0.01 0.00 99.51
01:00:01 AM all 0.01 0.00 0.46 0.01 0.00 99.51
Average: all 0.01 0.00 0.46 0.01 0.00 99.5101:07:18 AM LINUX RESTART
01:08:01 AM CPU %user %nice %system %iowait %steal %idle
01:09:01 AM all 0.25 0.00 0.81 0.02 0.00 98.92
01:10:01 AM all 0.03 0.00 0.44 0.01 0.00 99.51
01:11:01 AM all 0.02 0.00 0.45 0.02 0.00 99.51
01:12:01 AM all 0.01 0.00 0.47 0.01 0.00 99.52
01:13:01 AM all 0.01 0.00 0.45 0.01 0.00 99.54
01:14:01 AM all 0.01 0.00 0.47 0.03 0.00 99.49
01:15:01 AM all 0.01 0.00 0.49 0.01 0.00 99.49
01:16:01 AM all 0.01 0.00 0.48 0.01 0.00 99.50
01:17:01 AM all 0.01 0.00 0.47 0.06 0.00 99.46
01:18:01 AM all 0.01 0.00 0.46 0.01 0.00 99.52
01:19:01 AM all 0.01 0.00 0.45 0.01 0.00 99.53
01:20:01 AM all 0.02 0.00 0.46 0.01 0.00 99.51
01:21:01 AM all 0.02 0.00 0.46 0.01 0.00 99.51
01:22:01 AM all 0.01 0.00 0.43 0.01 0.00 99.54
01:23:01 AM all 0.01 0.00 0.48 0.01 0.00 99.50
01:24:01 AM all 0.02 0.00 0.46 0.01 0.00 99.51
01:25:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
01:26:01 AM all 0.02 0.00 0.44 0.01 0.00 99.53
01:27:01 AM all 0.01 0.00 0.46 0.02 0.00 99.50
01:28:01 AM all 0.01 0.00 0.49 0.01 0.00 99.49
01:29:01 AM all 0.01 0.00 0.46 0.01 0.00 99.51
01:30:01 AM all 0.01 0.00 0.48 0.03 0.00 99.48
01:31:01 AM all 0.02 0.00 0.47 0.02 0.00 99.49
01:32:01 AM all 0.01 0.00 0.45 0.01 0.00 99.52
01:33:01 AM all 0.01 0.00 0.46 0.01 0.00 99.52
01:34:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
01:35:01 AM all 0.01 0.00 0.47 0.01 0.00 99.50
01:36:01 AM all 0.02 0.00 0.46 0.01 0.00 99.52
01:37:01 AM all 0.01 0.00 0.49 0.01 0.00 99.49
01:38:01 AM all 0.01 0.00 0.47 0.01 0.00 99.50
01:39:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
01:40:01 AM all 0.01 0.00 0.48 0.01 0.00 99.50
Average: all 0.02 0.00 0.48 0.01 0.00 99.49
AVX#
- Kdump:
AVX#
AVX#systemctl status kdump
● kdump.service - Crash recovery kernel arming
Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2025-04-16 01:48:10 CST; 3s ago
Process: 11603 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
Main PID: 11603 (code=exited, status=0/SUCCESS)Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: drwxr-xr-x 3 root root 0 Apr 16 01:47 usr/share/zoneinfo
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: drwxr-xr-x 2 root root 0 Apr 16 01:47 usr/share/zoneinfo/Asia
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: -rw-r--r-- 1 root root 388 Oct 8 2015 usr/share/zoneinfo/Asia/Shanghai
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: drwxr-xr-x 2 root root 0 Apr 16 01:47 var
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: lrwxrwxrwx 1 root root 11 Apr 16 01:47 var/lock -> ../run/lock
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: lrwxrwxrwx 1 root root 6 Apr 16 01:47 var/run -> ../run
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: ========================================================================
Apr 16 01:48:10 O7Cvxy1NC3 kdumpctl[11603]: kexec: loaded kdump kernel
Apr 16 01:48:10 O7Cvxy1NC3 kdumpctl[11603]: Starting kdump: [OK]
Apr 16 01:48:10 O7Cvxy1NC3 systemd[1]: Started Crash recovery kernel arming.
AVX#
