4

I'm currently working to integrate the ARM64 ftrace patches to enable the full support to "dynamic ftrace with registers". Specifically I'm working on the 4.9.200 kernel version for the Pixel 3a (sargo) and the patches I'm referring to are the following:

The aforementioned patches require the support to the '-fpatchable-function-entry=2' GCC 8.x compilation option, reason for which I integrated the support to GCC 9.1 to build the kernel. Compiling the kernel with this option properly inserts the 2 ARM64 NOP instructions at the prologue of each traceable function.

The issue is that the kernel compiled with the ported "dynamic ftrace with registers" patch (4.8.x and 4.20.x are really similar) is crashing during the transition from kernel-land to user-land, specifically in the call to 'do_execve()' to spawn '/init'. The ftrace initialization and the whole initial booting sequence in the crashing kernel is identical to a properly booting kernel (e.g. a kernel with the "dynamic ftrace enabled without registers" support).

The verbose logs ('debug', 'ignore_loglevel', 'initcall_debug' and increased log buffer shift) are enabled and the crash is not actually showing the reason of the failure (e.g. invalid instruction execution, invalid memory access).

An attempt to enable the full KASAN+KCOV support has been done, but it resulted to be impossible to carry on as the generated LZ4 image is too big to be loaded by the Pixel 3a bootloader resulting in a "FAILED (remote: 'Error verifying the received boot.img: Buffer Too Small')" fastboot error. Flashing the boot image is possible, but after the crash the device enters a bootloop phase where it's impossible to obtain the logs from '/sys/fs/pstore' because a new flash of the working boot image is causing a flush of the crash logs.

As an additional attempt, the 4.8.x patch has been ported to the 4.9.x kernel and the 4.20.x patch has been ported to the 4.19.x kernel for the HiKey620 board (ARM64-based) resulting in a successful boot in both cases (using the latest AOSP compiled from the 'master' branch) and with the possibility to use the "dynamic ftrace with registers" through the API from a kernel module. At this point I've been left wondering what may be the difference between the 4.9.x kernels for the HiKey620 board and the Pixel 3a.

I've also been playing with the kernel option 'CONFIG_DEBUG_RODATA' to disable the read only memory enforcing (e.g. this old issue https://github.com/raspberrypi/linux/issues/2166 is hinting to the ARM kernel crashing when ftrace is enabled and it turned out to be a read-only memory issue); in my case the full boot sequence is working fine so I excluded that as a possible cause.

To make sure that the '/init' (actually '/system/bin/init') binary is not executed at all I put some logs and an infinite loop as really first instructions in the entry point ('int main(..)' of the 'init/main.cpp' file) and the boot process is clearly not reaching that point, so this lead me to exclude a problem with the setup functions of the 'first' and 'second' init user-land stages.

The following links point to the verbose logs of the crashing kernel (4.9.200 with "dynamic ftrace with registers") and the booting kernel (4.9.200 with "dynamic ftrace without registers"):

What would it be the best way to debug the issue? Is there anything obvious that is causing the issue and I’m missing?

Update 1

I managed to get KASAN+KCOV build working after compiling the kernel with Clang10 and enabling the 'CONFIG_CC_OPTIMIZE_FOR_SIZE' option that uses the -Os compilation flag to shrink the size of the kernel image. Enabling 'CONFIG_GZIP' reduces the size even more, enabling fastboot to properly boot the kernel without flashing. Clang10 has been compiled from sources, that now contain the full support to the '-fpatchable-function-entry' option. Even in this case the obtained crash logs are not hinting to anything particular (e.g no KASAN crashes or warnings).

While looking for similar problems I ran into what looks like a really similar issue, that had no solution: https://unix.stackexchange.com/questions/243515/why-cant-the-kernel-run-init.

Update 2

Compiling the '/system/bin/init' binary statically leads to the proper execution of the FirstStageMain function (whose logs are identical to a properly booting kernel). The crash now is moved from the previously mentioned 'do_execve()' call to the 'execv()' call at the really end of the FirstStageMain function, which is supposed to execute again the '/system/bin/init' binary (using the 2SI ramdisk SAR boot logic), but with a different argv[1] = "selinux_setup" argument.

Matteo Favaro
  • 61
  • 1
  • 4

1 Answers1

0

The issue was related to the necessity of disabling the '-fpatchable-function-entry=2' option while compiling the vDSO shared library on the Pixel 3a 4.9.200 kernel.

The change has been introduced by the commit 28b1a824a4f44da46983cd2c3249f910bd4b797b in the mainline kernel, but wasn't present in the original patches that I was trying to back-port and, according to the experiments with the HiKey620 board, it is not even a necessity on all the devices.

Other components of the kernel are built with the option disabled and all of them seem to have special needs when it comes to the generated assembly (e.g. like a specific layout or usage of non-ABI compliant registers that may be corrupted in the presence of some instrumentation).

Checking the mainline kernel for the removal of the 'CC_FLAGS_FTRACE' variable seem to be a good indicator of the components that absolutely need to avoid its use.

Matteo Favaro
  • 61
  • 1
  • 4