Bug 1177800 - [Ten64] getsysinfo caused kernel error (synchronous external abort)
[Ten64] getsysinfo caused kernel error (synchronous external abort)
Status: NEW
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
aarch64 openSUSE Tumbleweed
: P5 - None : Major (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-10-16 12:54 UTC by Andreas Färber
Modified: 2022-07-08 05:40 UTC (History)
7 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Färber 2020-10-16 12:54:10 UTC
Running `getsysinfo` (via ssh) on Tumbleweed 20201011 with kernel-default 5.9.0 from Kernel:HEAD repository caused a kernel error, with ssh getting stuck and reconnections failing. Serial login still worked.

zehn:~ # getsysinfo
/proc/bus/input
/proc/cpuinfo
/proc/device-tree
/proc/devices
/proc/fb
/proc/filesystems
/proc/interrupts
/proc/iomem
/proc/ioports
/proc/meminfo
/proc/modules
/proc/net/dev
/proc/partitions
/proc/scsi
/proc/tty
/proc/version
/sys
/usr/sbin/getsysinfo: line 23:  2761 Segmentation fault      cp -x -a --parents "$i" "$dir/$host" 2> /dev/null
/var/lib/hardware/udi
/proc/mounts

System data written to: /tmp/zehn.tar.gz
zehn:~ # 


[  544.956478] Internal error: synchronous external abort: 96000210 [#1] SMP
[  544.963295] Modules linked in: af_packet ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_tcpudp ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security iscsi_ibft iscsi_boot_sysfs ip_set nfnetlink ebtable_filter ebtables rfkill ip6table_filter ip6_tables iptable_filter ip_tables x_tables fsl_dpaa2_ptp ptp_qoriq fsl_dpaa2_eth phylink xgmac_mdio hid_generic fsl_mc_dpio usbhid cdc_acm i2c_mux_pca954x i2c_mux tpm_i2c_atmel spi_fsl_qspi qoriq_thermal leds_gpio optee tee uio_pdrv_genirq uio qoriq_cpufreq nls_iso8859_1 nls_cp437 vfat fat drm xhci_plat_hcd xhci_hcd usbcore caam_jr mmc_block libdes authenc caamhash_desc caamalg_desc crypto_engine rtc_ds1307 mp886x aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce dpaa2_console nvme nvme_core dwc3 sdhci_of_esdhc caam sdhci_pltfm ulpi
[  544.963547]  error sdhci udc_core roles mmc_core i2c_imx btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[  545.066303] CPU: 0 PID: 2761 Comm: cp Not tainted 5.9.0-1.g11733e1-default #1 openSUSE Tumbleweed (unreleased)
[  545.076303] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-g1d4a3d9d5c 07/28/2020
[  545.084393] pstate: 60000085 (nZCv daIf -PAN -UAO BTYPE=--)
[  545.089967] pc : dw_pcie_read+0x48/0xc0
[  545.093799] lr : dw_pcie_access_other_conf.isra.0+0xc4/0x120
[  545.099452] sp : ffff800012203bf0
[  545.102759] x29: ffff800012203bf0 x28: 0000000000000400 
[  545.108067] x27: ffff00833e683400 x26: ffff008348240000 
[  545.113375] x25: ffff800010085000 x24: 0000000000000000 
[  545.118683] x23: ffff00834b239880 x22: ffff800012203cc4 
[  545.123991] x21: 0000000000000004 x20: 0000000000000400 
[  545.129299] x19: ffff00834b2398a8 x18: 0000000000000000 
[  545.134607] x17: 0000000000000000 x16: ffffafadaf1be170 
[  545.139915] x15: 0000000000000000 x14: 0000000000000000 
[  545.145222] x13: 0000000000000000 x12: 0000000000000040 
[  545.150530] x11: ffff00834bd8c920 x10: 0000000000000000 
[  545.155837] x9 : ffffafadaf12e8f4 x8 : 0000000002080000 
[  545.161145] x7 : 0000000000000000 x6 : ffff800010f00000 
[  545.166452] x5 : 0000000000000000 x4 : 0000000000000908 
[  545.171759] x3 : 0000000000000003 x2 : ffff800012203cc4 
[  545.177066] x1 : 0000000000000004 x0 : ffff800010085400 
[  545.182374] Call trace:
[  545.184816]  dw_pcie_read+0x48/0xc0
[  545.188298]  dw_pcie_rd_conf+0x11c/0x150
[  545.192217]  pci_user_read_config_dword+0xa8/0x190
[  545.197004]  pci_read_config+0x1f8/0x264
[  545.200923]  sysfs_kf_bin_read+0x78/0xa0
[  545.204840]  kernfs_file_direct_read+0x90/0x220
[  545.209365]  kernfs_fop_read+0x44/0x50
[  545.213111]  vfs_read+0xb8/0x1e4
[  545.216334]  ksys_read+0x78/0x110
[  545.219643]  __arm64_sys_read+0x28/0x34
[  545.223476]  el0_svc_common.constprop.0+0x84/0x230
[  545.228261]  do_el0_svc+0x30/0xa0
[  545.231572]  el0_svc+0x18/0x50
[  545.234621]  el0_sync_handler+0x90/0x254
[  545.238538]  el0_sync+0x158/0x180
[  545.241849] Code: 528010e0 d50323bf b900005f d65f03c0 (b9400001) 
[  545.247941] ---[ end trace 68383e7eecaae870 ]---
Comment 1 Andreas Färber 2020-10-16 13:15:11 UTC
(In reply to Andreas Färber from comment #0)
> Running `getsysinfo` (via ssh) on Tumbleweed 20201011 with kernel-default
> 5.9.0 from Kernel:HEAD repository caused a kernel error, with ssh getting
> stuck and reconnections failing. Serial login still worked.

Correction: I got the login prompt on Enter key, but login timed out.
After reset and reboot the /tmp tarball was gone.

Sadly getsysinfo did not support any command line argument for output location.

Re-trying, I was able to log in via serial, copy the file elsewhere, but during reboot got stuck in RCU errors and had to reset again.
Comment 2 Andreas Färber 2020-10-16 13:52:13 UTC
Similar issue with Tumbleweed's 5.8.14 - ssh still working (to exit) but serial running into watchdog BUGs afterwards.

[  171.622422] Internal error: synchronous external abort: 96000210 [#1] SMP
[  171.629242] Modules linked in: af_packet ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_tcpudp ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security iscsi_ibft iscsi_boot_sysfs ip_set nfnetlink ebtable_filter ebtables rfkill ip6table_filter ip6_tables iptable_filter ip_tables x_tables fsl_dpaa2_ptp fsl_dpaa2_eth ptp_qoriq phylink hid_generic fsl_mc_dpio xgmac_mdio usbhid cdc_acm i2c_mux_pca954x i2c_mux tpm_i2c_atmel qoriq_thermal spi_fsl_qspi optee tee uio_pdrv_genirq uio leds_gpio qoriq_cpufreq nls_iso8859_1 nls_cp437 vfat fat drm xhci_plat_hcd xhci_hcd usbcore mmc_block caam_jr rtc_ds1307 libdes authenc caamhash_desc caamalg_desc crypto_engine mp886x aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce dpaa2_console nvme sdhci_of_esdhc sdhci_pltfm nvme_core dwc3 sdhci caam
[  171.629496]  mmc_core error ulpi udc_core roles i2c_imx btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[  171.732254] CPU: 0 PID: 2539 Comm: cp Not tainted 5.8.14-1-default #1 openSUSE Tumbleweed
[  171.740429] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-g1d4a3d9d5c 07/28/2020
[  171.748519] pstate: 60000085 (nZCv daIf -PAN -UAO BTYPE=--)
[  171.754092] pc : dw_pcie_read+0x48/0xc0
[  171.757925] lr : dw_pcie_access_other_conf.isra.0+0xcc/0x124
[  171.763578] sp : ffff800012153be0
[  171.766886] x29: ffff800012153be0 x28: 0000000000000400 
[  171.772194] x27: ffff00833f992400 x26: ffff0083481c2000 
[  171.777501] x25: ffff80001008d000 x24: 0000000000000000 
[  171.782809] x23: ffff800012153cc4 x22: 0000000000000004 
[  171.788117] x21: ffff00834b214080 x20: 0000000000000400 
[  171.793424] x19: ffff00834b2140a8 x18: 0000000000000000 
[  171.798732] x17: 0000000000000000 x16: ffffc50cf0986f10 
[  171.804039] x15: 0000000000000000 x14: 0000000000000000 
[  171.809347] x13: 0000000000000000 x12: 0000000000000040 
[  171.814654] x11: ffff00834bd8d488 x10: 0000000000000001 
[  171.819962] x9 : ffffc50cf1047ecc x8 : 0000000002080000 
[  171.825270] x7 : 0000000000000000 x6 : ffff800010f00000 
[  171.830578] x5 : 0000000000000000 x4 : 0000000000000908 
[  171.835885] x3 : 0000000000000003 x2 : ffff800012153cc4 
[  171.841193] x1 : 0000000000000004 x0 : ffff80001008d400 
[  171.846502] Call trace:
[  171.848944]  dw_pcie_read+0x48/0xc0
[  171.852428]  dw_pcie_rd_conf+0x148/0x180
[  171.856347]  pci_user_read_config_dword+0xa8/0x190
[  171.861135]  pci_read_config+0x1f8/0x264
[  171.865054]  sysfs_kf_bin_read+0x78/0xa0
[  171.868971]  kernfs_file_direct_read+0x90/0x220
[  171.873497]  kernfs_fop_read+0x44/0x50
[  171.877241]  vfs_read+0xb8/0x1d0
[  171.880464]  ksys_read+0x78/0x10c
[  171.883773]  __arm64_sys_read+0x28/0x34
[  171.887604]  el0_svc_common.constprop.0+0x84/0x230
[  171.892390]  do_el0_svc+0x30/0xa0
[  171.895700]  el0_svc+0x18/0x50
[  171.898749]  el0_sync_handler+0x90/0x254
[  171.902666]  el0_sync+0x158/0x180
[  171.905979] Code: 528010e0 d50323bf b900005f d65f03c0 (b9400001) 
[  171.912069] ---[ end trace 54772d5159fa3103 ]---
Comment 3 Andreas Färber 2020-10-18 04:19:47 UTC
For comparison, on a MacchiatoBin with kernel 5.8.14 `getsysinfo` also produces one kernel error on serial console, but the machine remains usable:

[ 1252.719947] BUG: Bad page state in process getsysinfo  pfn:7fe40

mack:~ # getsysinfo
/proc/bus/input
/proc/cpuinfo
/proc/device-tree
/proc/devices
/proc/fb
/proc/filesystems
/proc/interrupts
/proc/iomem
/proc/ioports
/proc/meminfo
/proc/modules
/proc/net/dev
/proc/partitions
/proc/scsi
/proc/tty
/proc/version
/sys
/var/lib/hardware/udi
/proc/mounts

System data written to: /tmp/mack.tar.gz
mack:~ #
Comment 4 Andreas Färber 2020-10-21 15:49:49 UTC
Confirming that running `/usr/sbin/getsysinfo` as non-root user works okay.
Comment 5 Miroslav Beneš 2022-01-07 14:32:05 UTC
Forgotten one, soory about that, Andreas. Is the issue still present with the latest TW?
Comment 6 Mathew McBride 2022-07-08 05:40:57 UTC
I had a fresh look into this today and managed to find the cause of the problem!

In summary the Layerscape PCIe controller generates a synchronous abort related to reading PCI config data for the PCIe switch/bridge.

This read does not happen in normal operation but is triggered by getsysinfo archiving/enumerating the /sys tree, where one can read out the pci config register as a file.

The synchronous abort problem exists in mainline kernels / non SUSE systems as well.

The Ten64 retail (1064-0201C) board has a Diodes/Pericom PI7C9X2G304SV PCIe switch to split 1xPCIe lane to 2xPCIe 2.0 for the miniPCIe slots

lspci -nn
0000:00:00.0 PCI bridge [0604]: Freescale Semiconductor Inc Device [1957:80c0] (rev 10)
0001:00:00.0 PCI bridge [0604]: Freescale Semiconductor Inc Device [1957:80c0] (rev 10)
0001:01:00.0 PCI bridge [0604]: Pericom Semiconductor Device [12d8:b304] (rev 01)
0001:02:01.0 PCI bridge [0604]: Pericom Semiconductor Device [12d8:b304] (rev 01)
0001:02:02.0 PCI bridge [0604]: Pericom Semiconductor Device [12d8:b304] (rev 01)
0001:03:00.0 Unclassified device [0002]: MEDIATEK Corp. MT7915E 802.11ax PCI Express Wireless Network Adapter [14c3:7915]
0001:04:00.0 Network controller [0280]: Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter [168c:003c]
0002:00:00.0 PCI bridge [0604]: Freescale Semiconductor Inc Device [1957:80c0] (rev 10)
0002:01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a]
root@recovery000afa24295d:/tmp# lspci -tnn
-+-[0002:00]---00.0-[01-ff]----00.0
 +-[0001:00]---00.0-[01-ff]----00.0-[02-04]--+-01.0-[03]----00.0
 |                                           \-02.0-[04]----00.0

If the PCIe switch is hidden (disable it's upstream PCIe controller in the FDT blob) or missing (it's been removed from some Ten64 board variants), the problem does not occur and getsysinfo will not cause a panic.

FreeBSD had a similar issue and the cause sounds very similar to what is happening here.

"pci: Don't try to read cfg registers of non-existing devices
Instead of returning 0xffs some controllers, such as Layerscape generate
an external exception when someone attempts to read any register
of config space of a non-existing device other than PCIR_VENDOR.
This causes a kernel panic.
Fix it by bailing during device enumeration if a device vendor register
returns invalid value. (0xffff)
Use this opportunity to replace some hardcoded values with a macro."
From https://cgit.freebsd.org/src/commit/?id=68cbe189fdd3c572476f8af9219a5d335f05b51a

I have been able to isolate it down to the 'config' sysfs file, here is a reduced testcase:
for i in $(find /sys/devices/platform/soc/3500000.pcie -type f); do
echo "Opening $i"
echo "------------------------------------------"
sleep 1 # allow time for console to flush
cat $i
echo "------------------------------------------"
done
....
------------------------------------------
Opening /sys/devices/platform/soc/3500000.pcie/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:02.0/config
------------------------------------------
[  150.192901] Internal error: synchronous external abort: 96000210 [#1] SMP

I have verified the problem exists on non-SUSE systems so it's just a kernel bug (including 5.19.0-rc5) which getsysinfo triggers.

Here is the trace from the latest Tumbleweed snapshot:
openSUSE-Tumbleweed-ARM-JeOS-efi.aarch64-2022.07.01-Snapshot20220704.raw.xz

Linux localhost.localdomain 5.18.6-1-default #1 SMP PREEMPT_DYNAMIC Thu Jun 23 05:46:18 UTC 2022 (5aa0763) aarch64 aarch64 aarch64 GNU/Linux
[   36.849750][ T2016] Internal error: synchronous external abort: 96000210 [#1]                                                                                                                           SMP
[   36.857252][ T2016] Modules linked in: af_packet mt7915e ath10k_pci ath10k_co                                                                                                                          re mt76_connac_lib mt76 ath mac80211 libarc4 fsl_dpaa2_eth pcs_lynx cfg80211 phy                                                                                                                          link rfkill i2c_mux_pca954x i2c_mux pci_endpoint_test tpm_i2c_atmel qoriq_therma                                                                                                                          l tee sfp uio_pdrv_genirq mdio_i2c leds_gpio uio qoriq_cpufreq nls_iso8859_1 nls                                                                                                                          _cp437 vfat fat fuse drm ip_tables x_tables xhci_plat_hcd xhci_hcd caam_jr crypt                                                                                                                          o_engine usbcore dpaa2_caam caamhash_desc caamalg_desc aes_ce_blk aes_ce_cipher                                                                                                                           crct10dif_ce ghash_ce gf128mul sha2_ce sha256_arm64 sha1_ce sp805_wdt fsl_mc_dpi                                                                                                                          o dpaa2_console authenc libdes caam nvme nvme_core error dwc3 sdhci_of_esdhc sdh                                                                                                                          ci_pltfm sdhci udc_core rtc_fsl_ftm_alarm roles mmc_core ulpi i2c_imx usb_common                                                                                                                           gpio_keys btrfs blake2b_generic xor xor_neon raid6_pq libcrc32c dm_mirror dm_re                                                                                                                          gion_hash dm_log dm_mod sg
[   36.929404][ T2016] CPU: 0 PID: 2016 Comm: cp Not tainted 5.18.6-1-default #1                                                                                                                           openSUSE Tumbleweed a3ce01492e87efb4fa7f3baf169c992c0c69c4b7
[   36.941846][ T2016] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-ga9                                                                                                                          4e0d21 03/15/2022
[   36.950460][ T2016] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTY                                                                                                                          PE=--)
[   36.958119][ T2016] pc : pci_generic_config_read+0x44/0xcc
[   36.963613][ T2016] lr : pci_generic_config_read+0x30/0xcc
[   36.969099][ T2016] sp : ffff80000a31b9f0
[   36.973105][ T2016] x29: ffff80000a31b9f0 x28: ffff08be45472400 x27: 00000000                                                                                                                          00000400
[   36.980941][ T2016] x26: 00000000000003ff x25: ffff08be45472000 x24: 00000000                                                                                                                          00001000
[   36.988779][ T2016] x23: 0000000000001000 x22: ffff80000a31bae4 x21: ffffbac4                                                                                                                          1ea22fa0
[   36.996616][ T2016] x20: ffff80000a31ba64 x19: 0000000000000004 x18: 00000000                                                                                                                          00000000
[   37.004453][ T2016] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000                                                                                                                          00000000
[   37.012289][ T2016] x14: 0000000000000000 x13: 0000000000000000 x12: 00000000                                                                                                                          00000000
[   37.020132][ T2016] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffbac4                                                                                                                          1ca785dc
[   37.027975][ T2016] x8 : 0000000000000004 x7 : ffff800008e00000 x6 : ffff8000                                                                                                                          08e00000
[   37.035816][ T2016] x5 : ffff08be41a93c80 x4 : 0000000000000908 x3 : 00000000                                                                                                                          00000000
[   37.043656][ T2016] x2 : 0000000000000000 x1 : ffff08be4a0de000 x0 : ffff8000                                                                                                                          08202400
[   37.051494][ T2016] Call trace:
[   37.054631][ T2016]  pci_generic_config_read+0x44/0xcc
[   37.059774][ T2016]  dw_pcie_rd_other_conf+0x24/0x7c
[   37.064741][ T2016]  pci_user_read_config_dword+0x84/0x124
[   37.070229][ T2016]  pci_read_config+0xf0/0x2a0
[   37.074760][ T2016]  sysfs_kf_bin_read+0x78/0xa0
[   37.079378][ T2016]  kernfs_fop_read_iter+0xac/0x1d4
[   37.084344][ T2016]  new_sync_read+0xd8/0x160
[   37.088700][ T2016]  vfs_read+0x19c/0x1e4
[   37.092710][ T2016]  ksys_read+0x78/0x10c
[   37.096718][ T2016]  __arm64_sys_read+0x28/0x34
[   37.101248][ T2016]  invoke_syscall+0x78/0x100
[   37.105693][ T2016]  el0_svc_common.constprop.0+0x58/0x190
[   37.111181][ T2016]  do_el0_svc+0x30/0x90
[   37.115191][ T2016]  el0_svc+0x34/0x130
[   37.119029][ T2016]  el0t_64_sync_handler+0x10c/0x140
[   37.124080][ T2016]  el0t_64_sync+0x1a0/0x1a4
[   37.128439][ T2016] Code: 7100067f 540001c0 71000a7f 54000280 (b9400001)
[   37.135228][ T2016] ---[ end trace 0000000000000000 ]---
[   37.140539][ T2016] note: cp[2016] exited with preempt_count 1

And from Leap 15.4:
Linux localhost 5.14.21-150400.22-default #1 SMP PREEMPT_DYNAMIC Wed May 11 06:57:18 UTC 2022 (49db222) aarch64 aarch64 aarch64 GNU/Linux
[  445.922445][ T2950] Call trace:
[  445.925582][ T2950]  pci_generic_config_read+0x40/0x100
[  445.930810][ T2950]  dw_pcie_rd_other_conf+0x20/0x80
[  445.935777][ T2950]  pci_user_read_config_dword+0x88/0x140