0

I am having a kernel crash issue in ARM64 (or aarch64). My system has 4 cores. I investigated the issue and find out some points.

Here is the code C (linux/drivers/cpuidle/governors/menu.c)

static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
{
    struct menu_device *data = this_cpu_ptr(&menu_devices);
    int latency_req = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);

Here is code of pm_qos_request (linux/kernel/power/qos.c)

int pm_qos_request(int pm_qos_class)
{
    return pm_qos_read_value(pm_qos_array[pm_qos_class]->constraints);
}

In assembly code, the value of pm_qos_class is changed from PM_QOS_CPU_DMA_LATENCY (macro value is 1) to a abnormal value (0x0499162b) and make kernel crashes.

Here is crash log

Unable to handle kernel paging request at virtual address ffffffc025140188
pgd = ffffffc006fd3000
[ffffffc025140188] *pgd=0000000000000000, *pud=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
...
CPU: 2 PID: 0 Comm: swapper/2 Tainted: P                4.1.45 #1
Hardware name: Broadcom-v8A (DT)
task: ffffffc0160d5540 ti: ffffffc0160d8000 task.ti: ffffffc0160d8000
PC is at pm_qos_request+0x8/0x18
LR is at menu_select+0x3c/0x3f8
pc : [<ffffffc0000c9330>] lr : [<ffffffc000302804>] pstate: 800001c5
sp : ffffffc0160dbed0
x29: ffffffc0160dbed0 x28: 0000000000002701 
x27: ffffffc0177d97d4 x26: 0000000000000006 
x25: 0000000000002076 x24: ffffffc0006b25a0 
x23: 0000000017164000 x22: 0000000077359400 
x21: ffffffc000675770 x20: ffffffc0177d9770 
x19: ffffffc0177d97b4 x18: 0000000000000000 
x17: 0000000000000000 x16: ffffffc0000f59e8 
x15: 0000000000000000 x14: 00000000f6b86920 
x13: 00000000f6b85de0 x12: 0000000000000000 
x11: 0000000000000040 x10: 0000000000000000 
x9 : 0000000000ffffb7 x8 : 0000000000000037 
x7 : ffffffc0177d78b8 x6 : 0000000000000001 
x5 : 000000000000270b x4 : 0000000000001b3d 
x3 : 0000000000002704 x2 : ffffffc0177d97c8 
x1 : ffffffc0004b5030 x0 : 000000000499162b 

Let's look into the assembly code of menu_select

(gdb) disassemble menu_select
Dump of assembler code for function menu_select:
   0xffffffc0003027c8 <+0>: stp x29, x30, [sp,#-128]!
   0xffffffc0003027cc <+4>: mov x29, sp
   0xffffffc0003027d0 <+8>: stp x21, x22, [sp,#32]
   0xffffffc0003027d4 <+12>:    adrp    x21, 0xffffffc000675000 <vmstat_work+88>
   0xffffffc0003027d8 <+16>:    stp x23, x24, [sp,#48]
   0xffffffc0003027dc <+20>:    stp x19, x20, [sp,#16]
   0xffffffc0003027e0 <+24>:    mrs x23, tpidr_el1
   0xffffffc0003027e4 <+28>:    mov x24, x0
   0xffffffc0003027e8 <+32>:    add x21, x21, #0x770
   0xffffffc0003027ec <+36>:    mov w0, #0x1                    // #1
   0xffffffc0003027f0 <+40>:    str x1, [x29,#104]
   0xffffffc0003027f4 <+44>:    add x20, x21, x23
   0xffffffc0003027f8 <+48>:    stp x25, x26, [sp,#64]
   0xffffffc0003027fc <+52>:    stp x27, x28, [sp,#80]
   0xffffffc000302800 <+56>:    **bl    0xffffffc0000c9328 <pm_qos_request>**
   0xffffffc000302804 <+60>:    mov w22, w0

w0 is set to #1 by the command before coming to pm_qos_request

mov w0, #0x1                    // #1

But when crash happen in pm_qos_request, value of w0 is not #1 (it is 0x0499162b from crash log)

Let's the assembly of pm_qos_request

(gdb) disassemble pm_qos_request
Dump of assembler code for function pm_qos_request:
   0xffffffc0000c9328 <+0>: adrp    x1, 0xffffffc0004b5000 <__func__.12082+8>
   0xffffffc0000c932c <+4>: add x1, x1, #0x30
   0xffffffc0000c9330 <+8>: ldr x0, [x1,w0,sxtw #3]
   0xffffffc0000c9334 <+12>:    ldr x0, [x0]
   0xffffffc0000c9338 <+16>:    ldr w0, [x0,#16]
   0xffffffc0000c933c <+20>:    ret

Crash point is in 0xffffffc0000c9330. x0 is not used from the starting point of pm_qos_request function so that I am expecting its value is always #1 as input value but it doesn't. It was 0x0499162b from crash log

My question Is there any possibility that w0 (or other generic-registers) is changed randomly as above?

Appreciate any advice.

Thank you in advance.

kid1412hv
  • 1
  • 1
  • 1
    Possibly because the code that you haven't shown that was executed in `pm_qos_request` changed the value of X0. – Ross Ridge Nov 27 '20 at 06:43
  • x0 is a call-clobbered register in the standard calling convention, in fact used for return values. w0 is the low 32 bits of x0. It's 100% expected that functions will use `x0` as a scratch register. See [What are callee and caller saved registers?](https://stackoverflow.com/a/56178078) for more about volatile vs. non-volatile registers. In fact, *within* a function, it's normal for even the call-preserved registers to have different values, if the function is planning to restore them before eventually returning. – Peter Cordes Nov 27 '20 at 07:04
  • Hi Peter, Thanks for reply. I see menu_select uses w0 as an input value for pm_qos_request. w0 will be used inside of the function. But I don't understand why does it changes from #1 to other value (in this case 0x0499162b). I don't see any asm code that changes w0 after it is assigned value #1 and before the crashed point. – kid1412hv Nov 27 '20 at 08:44
  • That is mysterious; the load shouldn't both fault *and* have actually updated x0. Reopened now that you've ruled out the obvious explanation. It seems unlikely a buggy interrupt handler could be corrupting registers; if that was happening, you'd expect the system to be totally unusable, and not crash in a consistent spot. (Unless it's only triggered by some driver that isn't used by automatically?) – Peter Cordes Nov 27 '20 at 09:49
  • Some suggestions, presuming you are working on this code. Disable interrupts on this cpu around that call: If that masks it, you almost certainly have a buggy driver stomping on stuff. 2: replace pm_qos_request with a little assembly one, which saves x0 into another register(x2?), then the above opcodes. Check the copy when the crash occurs; move the copy down one op and retry. It looks like a juicy one. – mevets Nov 27 '20 at 17:01
  • Is this reproducible, or a one-time crash that you're trying to postmortem? – Nate Eldredge Nov 27 '20 at 19:07
  • Thanks for your replies. It is a one-time crash. I found this one from internet and it seems relates to the issue. Once IRQs have been reenabled in IRQ mode there is a possibility of LR corruption even if the callee saves/restores LR. Consider the case where the processor is executing 'BL bar2' when the IRQ is signalled. The current instruction (the BL) will be completed and will store the return address in LR and set the PC to bar2. But before the first instruction of bar2 can execute, the IRQ will be handled and overwrite/corrupt LR (game over). – kid1412hv Nov 30 '20 at 07:14

0 Answers0