linux/arch/x86/kernel/irq.c

551 lines
16 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-only
/*
* Common interrupt code for 32 and 64 bit
*/
#include <linux/cpu.h>
#include <linux/interrupt.h>
#include <linux/kernel_stat.h>
#include <linux/of.h>
#include <linux/seq_file.h>
#include <linux/smp.h>
#include <linux/ftrace.h>
#include <linux/delay.h>
#include <linux/export.h>
x86: Don't include linux/irq.h from asm/hardirq.h The next patch in this series will have to make the definition of irq_cpustat_t available to entering_irq(). Inclusion of asm/hardirq.h into asm/apic.h would cause circular header dependencies like asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/topology.h linux/smp.h asm/smp.h or linux/gfp.h linux/mmzone.h asm/mmzone.h asm/mmzone_64.h asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/irqdesc.h linux/kobject.h linux/sysfs.h linux/kernfs.h linux/idr.h linux/gfp.h and others. This causes compilation errors because of the header guards becoming effective in the second inclusion: symbols/macros that had been defined before wouldn't be available to intermediate headers in the #include chain anymore. A possible workaround would be to move the definition of irq_cpustat_t into its own header and include that from both, asm/hardirq.h and asm/apic.h. However, this wouldn't solve the real problem, namely asm/harirq.h unnecessarily pulling in all the linux/irq.h cruft: nothing in asm/hardirq.h itself requires it. Also, note that there are some other archs, like e.g. arm64, which don't have that #include in their asm/hardirq.h. Remove the linux/irq.h #include from x86' asm/hardirq.h. Fix resulting compilation errors by adding appropriate #includes to *.c files as needed. Note that some of these *.c files could be cleaned up a bit wrt. to their set of #includes, but that should better be done from separate patches, if at all. Signed-off-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-29 12:15:33 +02:00
#include <linux/irq.h>
#include <asm/irq_stack.h>
#include <asm/apic.h>
#include <asm/io_apic.h>
#include <asm/irq.h>
#include <asm/mce.h>
#include <asm/hw_irq.h>
#include <asm/desc.h>
#include <asm/traps.h>
#include <asm/thermal.h>
#include <asm/posted_intr.h>
#include <asm/irq_remapping.h>
trace,x86: Move creation of irq tracepoints from apic.c to irq.c Compiling without CONFIG_X86_LOCAL_APIC set, apic.c will not be compiled, and the irq tracepoints will not be created via the CREATE_TRACE_POINTS macro. When CONFIG_X86_LOCAL_APIC is not set, we get the following build error: LD init/built-in.o arch/x86/built-in.o: In function `trace_x86_platform_ipi_entry': linux-test.git/arch/x86/include/asm/trace/irq_vectors.h:66: undefined reference to `__tracepoint_x86_platform_ipi_entry' arch/x86/built-in.o: In function `trace_x86_platform_ipi_exit': linux-test.git/arch/x86/include/asm/trace/irq_vectors.h:66: undefined reference to `__tracepoint_x86_platform_ipi_exit' arch/x86/built-in.o: In function `trace_irq_work_entry': linux-test.git/arch/x86/include/asm/trace/irq_vectors.h:72: undefined reference to `__tracepoint_irq_work_entry' arch/x86/built-in.o: In function `trace_irq_work_exit': linux-test.git/arch/x86/include/asm/trace/irq_vectors.h:72: undefined reference to `__tracepoint_irq_work_exit' arch/x86/built-in.o:(__jump_table+0x8): undefined reference to `__tracepoint_x86_platform_ipi_entry' arch/x86/built-in.o:(__jump_table+0x14): undefined reference to `__tracepoint_x86_platform_ipi_exit' arch/x86/built-in.o:(__jump_table+0x20): undefined reference to `__tracepoint_irq_work_entry' arch/x86/built-in.o:(__jump_table+0x2c): undefined reference to `__tracepoint_irq_work_exit' make[1]: *** [vmlinux] Error 1 make: *** [sub-make] Error 2 As irq.c is always compiled for x86, it is a more appropriate location to create the irq tracepoints. Cc: Seiji Aguchi <seiji.aguchi@hds.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-06-21 10:29:05 -04:00
#define CREATE_TRACE_POINTS
x86, trace: Add irq vector tracepoints [Purpose of this patch] As Vaibhav explained in the thread below, tracepoints for irq vectors are useful. http://www.spinics.net/lists/mm-commits/msg85707.html <snip> The current interrupt traces from irq_handler_entry and irq_handler_exit provide when an interrupt is handled. They provide good data about when the system has switched to kernel space and how it affects the currently running processes. There are some IRQ vectors which trigger the system into kernel space, which are not handled in generic IRQ handlers. Tracing such events gives us the information about IRQ interaction with other system events. The trace also tells where the system is spending its time. We want to know which cores are handling interrupts and how they are affecting other processes in the system. Also, the trace provides information about when the cores are idle and which interrupts are changing that state. <snip> On the other hand, my usecase is tracing just local timer event and getting a value of instruction pointer. I suggested to add an argument local timer event to get instruction pointer before. But there is another way to get it with external module like systemtap. So, I don't need to add any argument to irq vector tracepoints now. [Patch Description] Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events. But there is an above use case to trace specific irq_vector rather than tracing all events. In this case, we are concerned about overhead due to unwanted events. So, add following tracepoints instead of introducing irq_vector_entry/exit. so that we can enable them independently. - local_timer_vector - reschedule_vector - call_function_vector - call_function_single_vector - irq_work_entry_vector - error_apic_vector - thermal_apic_vector - threshold_apic_vector - spurious_apic_vector - x86_platform_ipi_vector Also, introduce a logic switching IDT at enabling/disabling time so that a time penalty makes a zero when tracepoints are disabled. Detailed explanations are as follows. - Create trace irq handlers with entering_irq()/exiting_irq(). - Create a new IDT, trace_idt_table, at boot time by adding a logic to _set_gate(). It is just a copy of original idt table. - Register the new handlers for tracpoints to the new IDT by introducing macros to alloc_intr_gate() called at registering time of irq_vector handlers. - Add checking, whether irq vector tracing is on/off, into load_current_idt(). This has to be done below debug checking for these reasons. - Switching to debug IDT may be kicked while tracing is enabled. - On the other hands, switching to trace IDT is kicked only when debugging is disabled. In addition, the new IDT is created only when CONFIG_TRACING is enabled to avoid being used for other purposes. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Link: http://lkml.kernel.org/r/51C323ED.5050708@hds.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-06-20 11:46:53 -04:00
#include <asm/trace/irq_vectors.h>
DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
EXPORT_PER_CPU_SYMBOL(irq_stat);
atomic_t irq_err_count;
/*
* 'what should we do if we get a hw irq event on an illegal vector'.
* each architecture has to answer this themselves.
*/
void ack_bad_irq(unsigned int irq)
{
if (printk_ratelimit())
pr_err("unexpected IRQ trap at vector %02x\n", irq);
/*
* Currently unexpected vectors happen only on SMP and APIC.
* We _must_ ack these because every local APIC has only N
* irq slots per priority level, and a 'hanging, unacked' IRQ
* holds up an irq slot - in excessive cases (when multiple
* unexpected vectors occur) that might lock up the APIC
* completely.
* But only ack when the APIC is enabled -AK
*/
apic_eoi();
}
#define irq_stats(x) (&per_cpu(irq_stat, x))
/*
* /proc/interrupts printing for arch specific interrupts
*/
int arch_show_interrupts(struct seq_file *p, int prec)
{
int j;
seq_printf(p, "%*s: ", prec, "NMI");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->__nmi_count);
seq_puts(p, " Non-maskable interrupts\n");
#ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, "%*s: ", prec, "LOC");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
seq_puts(p, " Local timer interrupts\n");
seq_printf(p, "%*s: ", prec, "SPU");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_spurious_count);
seq_puts(p, " Spurious interrupts\n");
seq_printf(p, "%*s: ", prec, "PMI");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
seq_puts(p, " Performance monitoring interrupts\n");
seq_printf(p, "%*s: ", prec, "IWI");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_irq_work_irqs);
seq_puts(p, " IRQ work interrupts\n");
seq_printf(p, "%*s: ", prec, "RTR");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->icr_read_retry_count);
seq_puts(p, " APIC ICR read retries\n");
if (x86_platform_ipi_callback) {
seq_printf(p, "%*s: ", prec, "PLT");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->x86_platform_ipis);
seq_puts(p, " Platform interrupts\n");
}
#endif
#ifdef CONFIG_SMP
seq_printf(p, "%*s: ", prec, "RES");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_resched_count);
seq_puts(p, " Rescheduling interrupts\n");
seq_printf(p, "%*s: ", prec, "CAL");
for_each_online_cpu(j)
x86/irq: Do not substract irq_tlb_count from irq_call_count Since commit: 52aec3308db8 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR") the TLB remote shootdown is done through call function vector. That commit didn't take care of irq_tlb_count, which a later commit: fd0f5869724f ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts") ... tried to fix. The fix assumes every increase of irq_tlb_count has a corresponding increase of irq_call_count. So the irq_call_count is always bigger than irq_tlb_count and we could substract irq_tlb_count from irq_call_count. Unfortunately this is not true for the smp_call_function_single() case. The IPI is only sent if the target CPU's call_single_queue is empty when adding a csd into it in generic_exec_single. That means if two threads are both adding flush tlb csds to the same CPU's call_single_queue, only one IPI is sent. In other words, the irq_call_count is incremented by 1 but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be bigger than irq_call_count and the substract will produce a very large irq_call_count value due to overflow. Considering that: 1) it's not worth to send more IPIs for the sake of accurate counting of irq_call_count in generic_exec_single(); 2) it's not easy to tell if the call function interrupt is for TLB shootdown in __smp_call_function_single_interrupt(). Not to exclude TLB shootdown from call function count seems to be the simplest fix and this patch just does that. This bug was found by LKP's cyclic performance regression tracking recently with the vm-scalability test suite. I have bisected to commit: 3dec0ba0be6a ("mm/rmap: share the i_mmap_rwsem") This commit didn't do anything wrong but revealed the irq_call_count problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file concurrent with multiple threads. When remap_one is try_to_unmap_one(), then multiple threads could queue flush TLB to the same CPU but only one IPI will be sent. Since the commit was added in Linux v3.19, the counting problem only shows up from v3.19 onwards. Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com> Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-11 15:44:30 +08:00
seq_printf(p, "%10u ", irq_stats(j)->irq_call_count);
seq_puts(p, " Function call interrupts\n");
seq_printf(p, "%*s: ", prec, "TLB");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_tlb_count);
seq_puts(p, " TLB shootdowns\n");
#endif
#ifdef CONFIG_X86_THERMAL_VECTOR
seq_printf(p, "%*s: ", prec, "TRM");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_thermal_count);
seq_puts(p, " Thermal event interrupts\n");
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
seq_printf(p, "%*s: ", prec, "THR");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_threshold_count);
seq_puts(p, " Threshold APIC interrupts\n");
#endif
#ifdef CONFIG_X86_MCE_AMD
seq_printf(p, "%*s: ", prec, "DFR");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_deferred_error_count);
seq_puts(p, " Deferred Error APIC interrupts\n");
#endif
#ifdef CONFIG_X86_MCE
seq_printf(p, "%*s: ", prec, "MCE");
for_each_online_cpu(j)
seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
seq_puts(p, " Machine check exceptions\n");
seq_printf(p, "%*s: ", prec, "MCP");
for_each_online_cpu(j)
seq_printf(p, "%10u ", per_cpu(mce_poll_count, j));
seq_puts(p, " Machine check polls\n");
#endif
#ifdef CONFIG_X86_HV_CALLBACK_VECTOR
if (test_bit(HYPERVISOR_CALLBACK_VECTOR, system_vectors)) {
seq_printf(p, "%*s: ", prec, "HYP");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
irq_stats(j)->irq_hv_callback_count);
seq_puts(p, " Hypervisor callback interrupts\n");
}
#endif
#if IS_ENABLED(CONFIG_HYPERV)
if (test_bit(HYPERV_REENLIGHTENMENT_VECTOR, system_vectors)) {
seq_printf(p, "%*s: ", prec, "HRE");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
irq_stats(j)->irq_hv_reenlightenment_count);
seq_puts(p, " Hyper-V reenlightenment interrupts\n");
}
if (test_bit(HYPERV_STIMER0_VECTOR, system_vectors)) {
seq_printf(p, "%*s: ", prec, "HVS");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
irq_stats(j)->hyperv_stimer0_count);
seq_puts(p, " Hyper-V stimer0 interrupts\n");
}
#endif
seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
#if defined(CONFIG_X86_IO_APIC)
seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count));
#endif
#if IS_ENABLED(CONFIG_KVM)
seq_printf(p, "%*s: ", prec, "PIN");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->kvm_posted_intr_ipis);
seq_puts(p, " Posted-interrupt notification event\n");
seq_printf(p, "%*s: ", prec, "NPI");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
irq_stats(j)->kvm_posted_intr_nested_ipis);
seq_puts(p, " Nested posted-interrupt event\n");
seq_printf(p, "%*s: ", prec, "PIW");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
irq_stats(j)->kvm_posted_intr_wakeup_ipis);
seq_puts(p, " Posted-interrupt wakeup event\n");
#endif
#ifdef CONFIG_X86_POSTED_MSI
seq_printf(p, "%*s: ", prec, "PMN");
for_each_online_cpu(j)
seq_printf(p, "%10u ",
irq_stats(j)->posted_msi_notification_count);
seq_puts(p, " Posted MSI notification event\n");
#endif
return 0;
}
/*
* /proc/stat helpers
*/
u64 arch_irq_stat_cpu(unsigned int cpu)
{
u64 sum = irq_stats(cpu)->__nmi_count;
#ifdef CONFIG_X86_LOCAL_APIC
sum += irq_stats(cpu)->apic_timer_irqs;
sum += irq_stats(cpu)->irq_spurious_count;
sum += irq_stats(cpu)->apic_perf_irqs;
sum += irq_stats(cpu)->apic_irq_work_irqs;
sum += irq_stats(cpu)->icr_read_retry_count;
if (x86_platform_ipi_callback)
sum += irq_stats(cpu)->x86_platform_ipis;
#endif
#ifdef CONFIG_SMP
sum += irq_stats(cpu)->irq_resched_count;
sum += irq_stats(cpu)->irq_call_count;
#endif
#ifdef CONFIG_X86_THERMAL_VECTOR
sum += irq_stats(cpu)->irq_thermal_count;
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
sum += irq_stats(cpu)->irq_threshold_count;
#endif
#ifdef CONFIG_X86_HV_CALLBACK_VECTOR
sum += irq_stats(cpu)->irq_hv_callback_count;
#endif
#if IS_ENABLED(CONFIG_HYPERV)
sum += irq_stats(cpu)->irq_hv_reenlightenment_count;
sum += irq_stats(cpu)->hyperv_stimer0_count;
#endif
#ifdef CONFIG_X86_MCE
sum += per_cpu(mce_exception_count, cpu);
sum += per_cpu(mce_poll_count, cpu);
#endif
return sum;
}
u64 arch_irq_stat(void)
{
u64 sum = atomic_read(&irq_err_count);
return sum;
}
static __always_inline void handle_irq(struct irq_desc *desc,
struct pt_regs *regs)
{
if (IS_ENABLED(CONFIG_X86_64))
generic_handle_irq_desc(desc);
else
__handle_irq(desc, regs);
}
static __always_inline int call_irq_handler(int vector, struct pt_regs *regs)
{
struct irq_desc *desc;
int ret = 0;
desc = __this_cpu_read(vector_irq[vector]);
if (likely(!IS_ERR_OR_NULL(desc))) {
handle_irq(desc, regs);
} else {
ret = -EINVAL;
if (desc == VECTOR_UNUSED) {
pr_emerg_ratelimited("%s: %d.%u No irq handler for vector\n",
x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs During heavy CPU-hotplug operations the following spurious kernel warnings can trigger: do_IRQ: No ... irq handler for vector (irq -1) [ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ] When downing a cpu it is possible that there are unhandled irqs left in the APIC IRR register. The following code path shows how the problem can occur: 1. CPU 5 is to go down. 2. cpu_disable() on CPU 5 executes with interrupt flag cleared by local_irq_save() via stop_machine(). 3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because interrupt flag is cleared (CPU unabled to handle the irq) 4. IRQs are migrated off of CPU 5, and the vectors' irqs are set to -1. 5. stop_machine() finishes cpu_disable() 6. cpu_die() for CPU 5 executes in normal context. 7. CPU 5 attempts to handle IRQ 12 because the IRR is set for IRQ 12. The code attempts to find the vector's IRQ and cannot because it has been set to -1. 8. do_IRQ() warning displays warning about CPU 5 IRQ 12. I added a debug printk to output which CPU & vector was retriggered and discovered that that we are getting bogus events. I see a 100% correlation between this debug printk in fixup_irqs() and the do_IRQ() warning. This patchset resolves this by adding definitions for VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying the code to use them. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831 Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Rui Wang <rui.y.wang@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: janet.morgan@Intel.com Cc: tony.luck@Intel.com Cc: ruiv.wang@gmail.com Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com [ Cleaned up the code a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-05 11:10:52 -05:00
__func__, smp_processor_id(),
vector);
x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs During heavy CPU-hotplug operations the following spurious kernel warnings can trigger: do_IRQ: No ... irq handler for vector (irq -1) [ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ] When downing a cpu it is possible that there are unhandled irqs left in the APIC IRR register. The following code path shows how the problem can occur: 1. CPU 5 is to go down. 2. cpu_disable() on CPU 5 executes with interrupt flag cleared by local_irq_save() via stop_machine(). 3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because interrupt flag is cleared (CPU unabled to handle the irq) 4. IRQs are migrated off of CPU 5, and the vectors' irqs are set to -1. 5. stop_machine() finishes cpu_disable() 6. cpu_die() for CPU 5 executes in normal context. 7. CPU 5 attempts to handle IRQ 12 because the IRR is set for IRQ 12. The code attempts to find the vector's IRQ and cannot because it has been set to -1. 8. do_IRQ() warning displays warning about CPU 5 IRQ 12. I added a debug printk to output which CPU & vector was retriggered and discovered that that we are getting bogus events. I see a 100% correlation between this debug printk in fixup_irqs() and the do_IRQ() warning. This patchset resolves this by adding definitions for VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying the code to use them. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831 Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Rui Wang <rui.y.wang@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: janet.morgan@Intel.com Cc: tony.luck@Intel.com Cc: ruiv.wang@gmail.com Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com [ Cleaned up the code a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-05 11:10:52 -05:00
} else {
__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs During heavy CPU-hotplug operations the following spurious kernel warnings can trigger: do_IRQ: No ... irq handler for vector (irq -1) [ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ] When downing a cpu it is possible that there are unhandled irqs left in the APIC IRR register. The following code path shows how the problem can occur: 1. CPU 5 is to go down. 2. cpu_disable() on CPU 5 executes with interrupt flag cleared by local_irq_save() via stop_machine(). 3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because interrupt flag is cleared (CPU unabled to handle the irq) 4. IRQs are migrated off of CPU 5, and the vectors' irqs are set to -1. 5. stop_machine() finishes cpu_disable() 6. cpu_die() for CPU 5 executes in normal context. 7. CPU 5 attempts to handle IRQ 12 because the IRR is set for IRQ 12. The code attempts to find the vector's IRQ and cannot because it has been set to -1. 8. do_IRQ() warning displays warning about CPU 5 IRQ 12. I added a debug printk to output which CPU & vector was retriggered and discovered that that we are getting bogus events. I see a 100% correlation between this debug printk in fixup_irqs() and the do_IRQ() warning. This patchset resolves this by adding definitions for VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying the code to use them. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831 Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Rui Wang <rui.y.wang@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: janet.morgan@Intel.com Cc: tony.luck@Intel.com Cc: ruiv.wang@gmail.com Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com [ Cleaned up the code a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-05 11:10:52 -05:00
}
}
return ret;
}
/*
* common_interrupt() handles all normal device IRQ's (the special SMP
* cross-CPU interrupts have their own entry points).
*/
DEFINE_IDTENTRY_IRQ(common_interrupt)
{
struct pt_regs *old_regs = set_irq_regs(regs);
/* entry code tells RCU that we're not quiescent. Check it. */
RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
if (unlikely(call_irq_handler(vector, regs)))
apic_eoi();
set_irq_regs(old_regs);
}
#ifdef CONFIG_X86_LOCAL_APIC
/* Function pointer for generic interrupt vector handling */
void (*x86_platform_ipi_callback)(void) = NULL;
/*
* Handler for X86_PLATFORM_IPI_VECTOR.
*/
DEFINE_IDTENTRY_SYSVEC(sysvec_x86_platform_ipi)
x86, trace: Introduce entering/exiting_irq() When implementing tracepoints in interrupt handers, if the tracepoints are simply added in the performance sensitive path of interrupt handers, it may cause potential performance problem due to the time penalty. To solve the problem, an idea is to prepare non-trace/trace irq handers and switch their IDTs at the enabling/disabling time. So, let's introduce entering_irq()/exiting_irq() for pre/post- processing of each irq handler. A way to use them is as follows. Non-trace irq handler: smp_irq_handler() { entering_irq(); /* pre-processing of this handler */ __smp_irq_handler(); /* * common logic between non-trace and trace handlers * in a vector. */ exiting_irq(); /* post-processing of this handler */ } Trace irq_handler: smp_trace_irq_handler() { entering_irq(); /* pre-processing of this handler */ trace_irq_entry(); /* tracepoint for irq entry */ __smp_irq_handler(); /* * common logic between non-trace and trace handlers * in a vector. */ trace_irq_exit(); /* tracepoint for irq exit */ exiting_irq(); /* post-processing of this handler */ } If tracepoints can place outside entering_irq()/exiting_irq() as follows, it looks cleaner. smp_trace_irq_handler() { trace_irq_entry(); smp_irq_handler(); trace_irq_exit(); } But it doesn't work. The problem is with irq_enter/exit() being called. They must be called before trace_irq_enter/exit(), because of the rcu_irq_enter() must be called before any tracepoints are used, as tracepoints use rcu to synchronize. As a possible alternative, we may be able to call irq_enter() first as follows if irq_enter() can nest. smp_trace_irq_hander() { irq_entry(); trace_irq_entry(); smp_irq_handler(); trace_irq_exit(); irq_exit(); } But it doesn't work, either. If irq_enter() is nested, it may have a time penalty because it has to check if it was already called or not. The time penalty is not desired in performance sensitive paths even if it is tiny. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Link: http://lkml.kernel.org/r/51C3238D.9040706@hds.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-06-20 11:45:17 -04:00
{
struct pt_regs *old_regs = set_irq_regs(regs);
apic_eoi();
trace_x86_platform_ipi_entry(X86_PLATFORM_IPI_VECTOR);
inc_irq_stat(x86_platform_ipis);
if (x86_platform_ipi_callback)
x86_platform_ipi_callback();
trace_x86_platform_ipi_exit(X86_PLATFORM_IPI_VECTOR);
set_irq_regs(old_regs);
}
#endif
#if IS_ENABLED(CONFIG_KVM)
static void dummy_handler(void) {}
static void (*kvm_posted_intr_wakeup_handler)(void) = dummy_handler;
void kvm_set_posted_intr_wakeup_handler(void (*handler)(void))
{
if (handler)
kvm_posted_intr_wakeup_handler = handler;
else {
kvm_posted_intr_wakeup_handler = dummy_handler;
synchronize_rcu();
}
}
EXPORT_SYMBOL_GPL(kvm_set_posted_intr_wakeup_handler);
/*
* Handler for POSTED_INTERRUPT_VECTOR.
*/
DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_ipi)
{
apic_eoi();
inc_irq_stat(kvm_posted_intr_ipis);
}
/*
* Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
*/
DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_posted_intr_wakeup_ipi)
{
apic_eoi();
inc_irq_stat(kvm_posted_intr_wakeup_ipis);
kvm_posted_intr_wakeup_handler();
}
/*
* Handler for POSTED_INTERRUPT_NESTED_VECTOR.
*/
DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
{
apic_eoi();
inc_irq_stat(kvm_posted_intr_nested_ipis);
}
#endif
#ifdef CONFIG_X86_POSTED_MSI
/* Posted Interrupt Descriptors for coalesced MSIs to be posted */
DEFINE_PER_CPU_ALIGNED(struct pi_desc, posted_msi_pi_desc);
void intel_posted_msi_init(void)
{
u32 destination;
u32 apic_id;
this_cpu_write(posted_msi_pi_desc.nv, POSTED_MSI_NOTIFICATION_VECTOR);
/*
* APIC destination ID is stored in bit 8:15 while in XAPIC mode.
* VT-d spec. CH 9.11
*/
apic_id = this_cpu_read(x86_cpu_to_apicid);
destination = x2apic_enabled() ? apic_id : apic_id << 8;
this_cpu_write(posted_msi_pi_desc.ndst, destination);
}
/*
* De-multiplexing posted interrupts is on the performance path, the code
* below is written to optimize the cache performance based on the following
* considerations:
* 1.Posted interrupt descriptor (PID) fits in a cache line that is frequently
* accessed by both CPU and IOMMU.
* 2.During posted MSI processing, the CPU needs to do 64-bit read and xchg
* for checking and clearing posted interrupt request (PIR), a 256 bit field
* within the PID.
* 3.On the other side, the IOMMU does atomic swaps of the entire PID cache
* line when posting interrupts and setting control bits.
* 4.The CPU can access the cache line a magnitude faster than the IOMMU.
* 5.Each time the IOMMU does interrupt posting to the PIR will evict the PID
* cache line. The cache line states after each operation are as follows:
* CPU IOMMU PID Cache line state
* ---------------------------------------------------------------
*...read64 exclusive
*...lock xchg64 modified
*... post/atomic swap invalid
*...-------------------------------------------------------------
*
* To reduce L1 data cache miss, it is important to avoid contention with
* IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is used
* to dispatch interrupt handlers.
*
* In addition, the code is trying to keep the cache line state consistent
* as much as possible. e.g. when making a copy and clearing the PIR
* (assuming non-zero PIR bits are present in the entire PIR), it does:
* read, read, read, read, xchg, xchg, xchg, xchg
* instead of:
* read, xchg, read, xchg, read, xchg, read, xchg
*/
static __always_inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
{
int i, vec = FIRST_EXTERNAL_VECTOR;
unsigned long pir_copy[4];
bool handled = false;
for (i = 0; i < 4; i++)
pir_copy[i] = pir[i];
for (i = 0; i < 4; i++) {
if (!pir_copy[i])
continue;
pir_copy[i] = arch_xchg(&pir[i], 0);
handled = true;
}
if (handled) {
for_each_set_bit_from(vec, pir_copy, FIRST_SYSTEM_VECTOR)
call_irq_handler(vec, regs);
}
return handled;
}
/*
* Performance data shows that 3 is good enough to harvest 90+% of the benefit
* on high IRQ rate workload.
*/
#define MAX_POSTED_MSI_COALESCING_LOOP 3
/*
* For MSIs that are delivered as posted interrupts, the CPU notifications
* can be coalesced if the MSIs arrive in high frequency bursts.
*/
DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
{
struct pt_regs *old_regs = set_irq_regs(regs);
struct pi_desc *pid;
int i = 0;
pid = this_cpu_ptr(&posted_msi_pi_desc);
inc_irq_stat(posted_msi_notification_count);
irq_enter();
/*
* Max coalescing count includes the extra round of handle_pending_pir
* after clearing the outstanding notification bit. Hence, at most
* MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here.
*/
while (++i < MAX_POSTED_MSI_COALESCING_LOOP) {
if (!handle_pending_pir(pid->pir64, regs))
break;
}
/*
* Clear outstanding notification bit to allow new IRQ notifications,
* do this last to maximize the window of interrupt coalescing.
*/
pi_clear_on(pid);
/*
* There could be a race of PI notification and the clearing of ON bit,
* process PIR bits one last time such that handling the new interrupts
* are not delayed until the next IRQ.
*/
handle_pending_pir(pid->pir64, regs);
apic_eoi();
irq_exit();
set_irq_regs(old_regs);
}
#endif /* X86_POSTED_MSI */
x86, trace: Add irq vector tracepoints [Purpose of this patch] As Vaibhav explained in the thread below, tracepoints for irq vectors are useful. http://www.spinics.net/lists/mm-commits/msg85707.html <snip> The current interrupt traces from irq_handler_entry and irq_handler_exit provide when an interrupt is handled. They provide good data about when the system has switched to kernel space and how it affects the currently running processes. There are some IRQ vectors which trigger the system into kernel space, which are not handled in generic IRQ handlers. Tracing such events gives us the information about IRQ interaction with other system events. The trace also tells where the system is spending its time. We want to know which cores are handling interrupts and how they are affecting other processes in the system. Also, the trace provides information about when the cores are idle and which interrupts are changing that state. <snip> On the other hand, my usecase is tracing just local timer event and getting a value of instruction pointer. I suggested to add an argument local timer event to get instruction pointer before. But there is another way to get it with external module like systemtap. So, I don't need to add any argument to irq vector tracepoints now. [Patch Description] Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events. But there is an above use case to trace specific irq_vector rather than tracing all events. In this case, we are concerned about overhead due to unwanted events. So, add following tracepoints instead of introducing irq_vector_entry/exit. so that we can enable them independently. - local_timer_vector - reschedule_vector - call_function_vector - call_function_single_vector - irq_work_entry_vector - error_apic_vector - thermal_apic_vector - threshold_apic_vector - spurious_apic_vector - x86_platform_ipi_vector Also, introduce a logic switching IDT at enabling/disabling time so that a time penalty makes a zero when tracepoints are disabled. Detailed explanations are as follows. - Create trace irq handlers with entering_irq()/exiting_irq(). - Create a new IDT, trace_idt_table, at boot time by adding a logic to _set_gate(). It is just a copy of original idt table. - Register the new handlers for tracpoints to the new IDT by introducing macros to alloc_intr_gate() called at registering time of irq_vector handlers. - Add checking, whether irq vector tracing is on/off, into load_current_idt(). This has to be done below debug checking for these reasons. - Switching to debug IDT may be kicked while tracing is enabled. - On the other hands, switching to trace IDT is kicked only when debugging is disabled. In addition, the new IDT is created only when CONFIG_TRACING is enabled to avoid being used for other purposes. Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Link: http://lkml.kernel.org/r/51C323ED.5050708@hds.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2013-06-20 11:46:53 -04:00
#ifdef CONFIG_HOTPLUG_CPU
/* A cpu has been removed from cpu_online_mask. Reset irq affinities. */
void fixup_irqs(void)
{
unsigned int vector;
struct irq_desc *desc;
struct irq_data *data;
struct irq_chip *chip;
irq_migrate_all_off_this_cpu();
/*
* We can remove mdelay() and then send spurious interrupts to
* new cpu targets for all the irqs that were handled previously by
* this cpu. While it works, I have seen spurious interrupt messages
* (nothing wrong but still...).
*
* So for now, retain mdelay(1) and check the IRR and then send those
* interrupts to new targets as this cpu is already offlined...
*/
mdelay(1);
/*
* We can walk the vector array of this cpu without holding
* vector_lock because the cpu is already marked !online, so
* nothing else will touch it.
*/
for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
if (IS_ERR_OR_NULL(__this_cpu_read(vector_irq[vector])))
continue;
if (is_vector_pending(vector)) {
desc = __this_cpu_read(vector_irq[vector]);
raw_spin_lock(&desc->lock);
data = irq_desc_get_irq_data(desc);
chip = irq_data_get_irq_chip(data);
x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs During heavy CPU-hotplug operations the following spurious kernel warnings can trigger: do_IRQ: No ... irq handler for vector (irq -1) [ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ] When downing a cpu it is possible that there are unhandled irqs left in the APIC IRR register. The following code path shows how the problem can occur: 1. CPU 5 is to go down. 2. cpu_disable() on CPU 5 executes with interrupt flag cleared by local_irq_save() via stop_machine(). 3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because interrupt flag is cleared (CPU unabled to handle the irq) 4. IRQs are migrated off of CPU 5, and the vectors' irqs are set to -1. 5. stop_machine() finishes cpu_disable() 6. cpu_die() for CPU 5 executes in normal context. 7. CPU 5 attempts to handle IRQ 12 because the IRR is set for IRQ 12. The code attempts to find the vector's IRQ and cannot because it has been set to -1. 8. do_IRQ() warning displays warning about CPU 5 IRQ 12. I added a debug printk to output which CPU & vector was retriggered and discovered that that we are getting bogus events. I see a 100% correlation between this debug printk in fixup_irqs() and the do_IRQ() warning. This patchset resolves this by adding definitions for VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying the code to use them. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831 Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Rui Wang <rui.y.wang@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: janet.morgan@Intel.com Cc: tony.luck@Intel.com Cc: ruiv.wang@gmail.com Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com [ Cleaned up the code a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-05 11:10:52 -05:00
if (chip->irq_retrigger) {
chip->irq_retrigger(data);
x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs During heavy CPU-hotplug operations the following spurious kernel warnings can trigger: do_IRQ: No ... irq handler for vector (irq -1) [ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ] When downing a cpu it is possible that there are unhandled irqs left in the APIC IRR register. The following code path shows how the problem can occur: 1. CPU 5 is to go down. 2. cpu_disable() on CPU 5 executes with interrupt flag cleared by local_irq_save() via stop_machine(). 3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because interrupt flag is cleared (CPU unabled to handle the irq) 4. IRQs are migrated off of CPU 5, and the vectors' irqs are set to -1. 5. stop_machine() finishes cpu_disable() 6. cpu_die() for CPU 5 executes in normal context. 7. CPU 5 attempts to handle IRQ 12 because the IRR is set for IRQ 12. The code attempts to find the vector's IRQ and cannot because it has been set to -1. 8. do_IRQ() warning displays warning about CPU 5 IRQ 12. I added a debug printk to output which CPU & vector was retriggered and discovered that that we are getting bogus events. I see a 100% correlation between this debug printk in fixup_irqs() and the do_IRQ() warning. This patchset resolves this by adding definitions for VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying the code to use them. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831 Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Rui Wang <rui.y.wang@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: janet.morgan@Intel.com Cc: tony.luck@Intel.com Cc: ruiv.wang@gmail.com Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com [ Cleaned up the code a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-05 11:10:52 -05:00
__this_cpu_write(vector_irq[vector], VECTOR_RETRIGGERED);
}
raw_spin_unlock(&desc->lock);
}
x86/irq: Fix do_IRQ() interrupt warning for cpu hotplug retriggered irqs During heavy CPU-hotplug operations the following spurious kernel warnings can trigger: do_IRQ: No ... irq handler for vector (irq -1) [ See: https://bugzilla.kernel.org/show_bug.cgi?id=64831 ] When downing a cpu it is possible that there are unhandled irqs left in the APIC IRR register. The following code path shows how the problem can occur: 1. CPU 5 is to go down. 2. cpu_disable() on CPU 5 executes with interrupt flag cleared by local_irq_save() via stop_machine(). 3. IRQ 12 asserts on CPU 5, setting IRR but not ISR because interrupt flag is cleared (CPU unabled to handle the irq) 4. IRQs are migrated off of CPU 5, and the vectors' irqs are set to -1. 5. stop_machine() finishes cpu_disable() 6. cpu_die() for CPU 5 executes in normal context. 7. CPU 5 attempts to handle IRQ 12 because the IRR is set for IRQ 12. The code attempts to find the vector's IRQ and cannot because it has been set to -1. 8. do_IRQ() warning displays warning about CPU 5 IRQ 12. I added a debug printk to output which CPU & vector was retriggered and discovered that that we are getting bogus events. I see a 100% correlation between this debug printk in fixup_irqs() and the do_IRQ() warning. This patchset resolves this by adding definitions for VECTOR_UNDEFINED(-1) and VECTOR_RETRIGGERED(-2) and modifying the code to use them. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=64831 Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Rui Wang <rui.y.wang@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: janet.morgan@Intel.com Cc: tony.luck@Intel.com Cc: ruiv.wang@gmail.com Link: http://lkml.kernel.org/r/1388938252-16627-1-git-send-email-prarit@redhat.com [ Cleaned up the code a bit. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-05 11:10:52 -05:00
if (__this_cpu_read(vector_irq[vector]) != VECTOR_RETRIGGERED)
__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
}
}
#endif
#ifdef CONFIG_X86_THERMAL_VECTOR
static void smp_thermal_vector(void)
{
if (x86_thermal_enabled())
intel_thermal_interrupt();
else
pr_err("CPU%d: Unexpected LVT thermal interrupt!\n",
smp_processor_id());
}
DEFINE_IDTENTRY_SYSVEC(sysvec_thermal)
{
trace_thermal_apic_entry(THERMAL_APIC_VECTOR);
inc_irq_stat(irq_thermal_count);
smp_thermal_vector();
trace_thermal_apic_exit(THERMAL_APIC_VECTOR);
apic_eoi();
}
#endif