Question

我正在尝试编写块设备驱动程序，以便将I / O请求移交给单独的工作线程。工作线程等待获取请求，然后执行实际的I / O.

我正在使用完整的结构来实现工作线程和传输请求功能之间的同步。工作线程在继续处理请求之前会执行wait_for_completion，并在传输请求收到请求时被唤醒。

然而，当我尝试做insmod时，我的系统正常挂起。这就是回溯在每个CPU上的显示方式

[  152.036031] BUG: soft lockup - CPU#0 stuck for 22s! [mount:1752]
[  152.036046] CPU 0
[  152.036046] Call Trace:
[  152.036046]  [<ffffffff8108d3e3>] smp_call_function+0x33/0x60
[  152.036046]  [<ffffffff8108d443>] on_each_cpu+0x33/0xa0
[  152.036046]  [<ffffffff81189e95>] __blkdev_put+0x185/0x1f0
[  152.036046]  [<ffffffff8115570a>] __fput+0xaa/0x200
[  152.036046]  [<ffffffff81151f3f>] filp_close+0x5f/0x90
[  152.036046]  [<ffffffff8115201d>] sys_close+0xad/0x120
[  152.036046]  [<ffffffff815a7312>] system_call_fastpath+0x16/0x1b

[  158.233026] INFO: rcu_preempt_state detected stalls on CPUs/tasks: { 2} (detected by 0, t=60002 jiffies)
[  158.233031] sending NMI to all CPUs:

[  158.233031] NMI backtrace for cpu 0
[  158.233031] CPU 0 
[  158.233031] Call Trace:
[  158.233031]  [<ffffffff8102018f>] arch_trigger_all_cpu_backtrace+0x4f/0x90
[  158.233031]  [<ffffffff810ca429>] print_other_cpu_stall+0x159/0x1c0
[  158.233031]  [<ffffffff810cb5e1>] __rcu_pending+0x31/0x180
[  158.233031]  [<ffffffff810cbb4a>] rcu_check_callbacks+0x11a/0x190
[  158.233031]  [<ffffffff8106505f>] update_process_times+0x3f/0x80
[  158.233031]  [<ffffffff8108772b>] tick_sched_timer+0x5b/0xc0
[  158.233031]  [<ffffffff8107a2ce>] __run_hrtimer+0x6e/0x240
[  158.233031]  [<ffffffff8107acf5>] hrtimer_interrupt+0xe5/0x200
[  158.233031]  [<ffffffff8101efb3>] smp_apic_timer_interrupt+0x63/0xa0
[  158.233031]  [<ffffffff815a7e5e>] apic_timer_interrupt+0x6e/0x80
[  158.233031]  [<ffffffff8108d30e>] smp_call_function_many+0x1fe/0x2a0
[  158.233031]  [<ffffffff8108d3e3>] smp_call_function+0x33/0x60
[  158.233031]  [<ffffffff8108d443>] on_each_cpu+0x33/0xa0
[  158.233031]  [<ffffffff81189e95>] __blkdev_put+0x185/0x1f0
[  158.233031]  [<ffffffff8115570a>] __fput+0xaa/0x200
[  158.233031]  [<ffffffff81151f3f>] filp_close+0x5f/0x90
[  158.233031]  [<ffffffff8115201d>] sys_close+0xad/0x120
[  158.233031]  [<ffffffff815a7312>] system_call_fastpath+0x16/0x1b

[  158.233291] CPU 3 and CPU 1
[  158.233347] Call Trace:
[  158.233352]  [<ffffffff8101eb68>] lapic_next_event+0x18/0x20
[  158.233356]  [<ffffffff81087288>] tick_dev_program_event+0x38/0x100
[  158.233360]  [<ffffffff8107ad2d>] hrtimer_interrupt+0x11d/0x200
[  158.233363]  [<ffffffff8101efb3>] smp_apic_timer_interrupt+0x63/0xa0
[  158.233366]  [<ffffffff815a7e5e>] apic_timer_interrupt+0x6e/0x80
[  158.233371]  [<ffffffff81029aa2>] native_safe_halt+0x2/0x10
[  158.233374]  [<ffffffff8100a67d>] default_idle+0x4d/0x2a0
[  158.233378]  [<ffffffff810011a6>] cpu_idle+0x86/0xd0

[  137.645911] CPU 2 
[  137.645911] Call Trace:
[  137.645911]  [<ffffffff8159d766>] wait_for_common+0x26/0x150
[  137.645911]  [<ffffffffa01674e2>] tsdd_worker_thread+0x72/0x1b0 [tsdd]
[  137.645911]  [<ffffffff81075eee>] kthread+0x7e/0x90
[  137.645911]  [<ffffffff815a94f4>] kernel_thread_helper+0x4/0x10`

......最终系统挂起。

我模糊地理解这里似乎发生了什么。（tsdd）工作线程在CPU 2上运行，等待wait_for_completion（）。在CPU0上有一个关闭系统调用。它似乎等待所有其他CPU的响应。它获得除CPU2之外的所有响应。它等待的时间太长了（BUG：软锁定 - CPU＃0卡住22秒！）并且有一个定时器中断。

现在，此中断将广播给所有CPU。 wait_for_completion（）函数使CPU2上的线程等到completion_done。即使发生中断，它也不会唤醒线程。当在这种情况下存在定时器中断时，中断将被发送到所有CPU，包括CPU2，其中我们的线程停留在wait_for_completion（）。它将无法为中断提供服务，系统最终会挂起。

这种观察是否正确，还是还有其他事情发生？

下面是一个简短的伪代码：

static struct request *sch_req = NULL;  //global
static struct complete *comp = NULL;    // initialized in module_init

void transfer_req(req_queue) {
    req = blk_fetch_request(req_queue);
    served = 0;
    while (!served) {
        if (completion_done(comp))
            continue;
        sch_req = req;
        sch_queue = req_queue;
        complete(comp);
        served = 1;
    }
}

void worker_thread() {
    while (!kthread_should_stop()) {
        if (wait_for_completion(comp))
            continue;
        while (sch_req) {
            perform_IO(sch_req);
            sch_req = blk_fetch_req(sch_queue);
        }
        init_completion(comp);
    }
}

有人可以帮忙解决这里有什么问题吗？我还想了解如何解决这个问题。我甚至尝试过使用wait_for_completion_interruptible，但这似乎无法解决问题。

由于

P.S。 - 抱歉长篇文章（无法附加日志文件）

Answer 1

看起来CPU0等待CSD锁定。您是否检查过CPU之间是否缺少IPI中断？

cpu挂在insmod上

1 个答案: