TWSD-676: AVX enhancement for rebooting issues

Review Request #799 — Created April 16, 2025 and updated — Latest diff uploaded

wli
AVX2
rel_avx_2_7_4
TWSD-676
jasonchang, timlai

We've received reports from several customers experiencing AVX reboot issues.
However, due to the lack of debugging tools and logs, we've been unable to identify the root cause or provide a detailed RCA to the customer.
In this ticket, we plan to add the necessary tools and logging mechanisms to thoroughly track and analyze any reboot-related issues.

Added the following tools to record detailed system status:

  1. Kdump for capturing kernel crash dumps.
  2. SAR for monitoring overall system metrics, including CPU, memory, network, and disk usage.
  3. ping_gw.sh script to check node-to-gateway connectivity every 30 seconds.
  1. ping_gw test:

[2025-04-16 01:21:31] skip due to gateway not setup
[2025-04-16 01:22:01] skip due to gateway not setup
[2025-04-16 01:22:11] ping to 10.10.50.1 succeeded, latency=0.304 ms
[2025-04-16 01:22:31] skip due to gateway not setup
[2025-04-16 01:22:54] ping to 10.10.50.1 succeeded, latency=0.245 ms
[2025-04-16 01:23:01] skip due to gateway not setup
[2025-04-16 01:23:29] ping to 10.10.50.1 succeeded, latency=0.198 ms
[2025-04-16 01:23:31] skip due to gateway not setup
[2025-04-16 01:24:01] skip due to gateway not setup

  1. SAR:
    AVX#ls /var/log/sa/
    sa15 sa16 sar15

AVX#sar -u -f /var/log/sa/sa16
Linux 3.10.0-327.28.2.11.el7.x86_64 (O7Cvxy1NC3) 04/16/2025 x86_64 (20 CPU)

12:00:01 AM CPU %user %nice %system %iowait %steal %idle
12:10:01 AM all 0.01 0.00 0.46 0.01 0.00 99.51
12:20:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
12:30:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
Average: all 0.01 0.00 0.47 0.01 0.00 99.51

12:33:38 AM LINUX RESTART

12:40:01 AM CPU %user %nice %system %iowait %steal %idle
12:50:01 AM all 0.01 0.00 0.46 0.01 0.00 99.51
01:00:01 AM all 0.01 0.00 0.46 0.01 0.00 99.51
Average: all 0.01 0.00 0.46 0.01 0.00 99.51

01:07:18 AM LINUX RESTART

01:08:01 AM CPU %user %nice %system %iowait %steal %idle
01:09:01 AM all 0.25 0.00 0.81 0.02 0.00 98.92
01:10:01 AM all 0.03 0.00 0.44 0.01 0.00 99.51
01:11:01 AM all 0.02 0.00 0.45 0.02 0.00 99.51
01:12:01 AM all 0.01 0.00 0.47 0.01 0.00 99.52
01:13:01 AM all 0.01 0.00 0.45 0.01 0.00 99.54
01:14:01 AM all 0.01 0.00 0.47 0.03 0.00 99.49
01:15:01 AM all 0.01 0.00 0.49 0.01 0.00 99.49
01:16:01 AM all 0.01 0.00 0.48 0.01 0.00 99.50
01:17:01 AM all 0.01 0.00 0.47 0.06 0.00 99.46
01:18:01 AM all 0.01 0.00 0.46 0.01 0.00 99.52
01:19:01 AM all 0.01 0.00 0.45 0.01 0.00 99.53
01:20:01 AM all 0.02 0.00 0.46 0.01 0.00 99.51
01:21:01 AM all 0.02 0.00 0.46 0.01 0.00 99.51
01:22:01 AM all 0.01 0.00 0.43 0.01 0.00 99.54
01:23:01 AM all 0.01 0.00 0.48 0.01 0.00 99.50
01:24:01 AM all 0.02 0.00 0.46 0.01 0.00 99.51
01:25:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
01:26:01 AM all 0.02 0.00 0.44 0.01 0.00 99.53
01:27:01 AM all 0.01 0.00 0.46 0.02 0.00 99.50
01:28:01 AM all 0.01 0.00 0.49 0.01 0.00 99.49
01:29:01 AM all 0.01 0.00 0.46 0.01 0.00 99.51
01:30:01 AM all 0.01 0.00 0.48 0.03 0.00 99.48
01:31:01 AM all 0.02 0.00 0.47 0.02 0.00 99.49
01:32:01 AM all 0.01 0.00 0.45 0.01 0.00 99.52
01:33:01 AM all 0.01 0.00 0.46 0.01 0.00 99.52
01:34:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
01:35:01 AM all 0.01 0.00 0.47 0.01 0.00 99.50
01:36:01 AM all 0.02 0.00 0.46 0.01 0.00 99.52
01:37:01 AM all 0.01 0.00 0.49 0.01 0.00 99.49
01:38:01 AM all 0.01 0.00 0.47 0.01 0.00 99.50
01:39:01 AM all 0.01 0.00 0.47 0.01 0.00 99.51
01:40:01 AM all 0.01 0.00 0.48 0.01 0.00 99.50
Average: all 0.02 0.00 0.48 0.01 0.00 99.49
AVX#

  1. Kdump:
    AVX#
    AVX#systemctl status kdump
    ● kdump.service - Crash recovery kernel arming
    Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
    Active: active (exited) since Wed 2025-04-16 01:48:10 CST; 3s ago
    Process: 11603 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
    Main PID: 11603 (code=exited, status=0/SUCCESS)

Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: drwxr-xr-x 3 root root 0 Apr 16 01:47 usr/share/zoneinfo
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: drwxr-xr-x 2 root root 0 Apr 16 01:47 usr/share/zoneinfo/Asia
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: -rw-r--r-- 1 root root 388 Oct 8 2015 usr/share/zoneinfo/Asia/Shanghai
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: drwxr-xr-x 2 root root 0 Apr 16 01:47 var
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: lrwxrwxrwx 1 root root 11 Apr 16 01:47 var/lock -> ../run/lock
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: lrwxrwxrwx 1 root root 6 Apr 16 01:47 var/run -> ../run
Apr 16 01:48:10 O7Cvxy1NC3 dracut[11987]: ========================================================================
Apr 16 01:48:10 O7Cvxy1NC3 kdumpctl[11603]: kexec: loaded kdump kernel
Apr 16 01:48:10 O7Cvxy1NC3 kdumpctl[11603]: Starting kdump: [OK]
Apr 16 01:48:10 O7Cvxy1NC3 systemd[1]: Started Crash recovery kernel arming.
AVX#

    Loading...